NASHVILLE HOUSING DATA CLEANING PROJECT

DATA CLEANING USING SQL

What is Data Cleaning?

Data cleaning is the process of removing incorrect, duplicate, or otherwise erroneous data from a dataset. These errors can include incorrectly formatted data, redundant entries, mislabeled data, and other issues; they often arise when two or more datasets are combined together. Data cleaning improves the quality of your data as well as any business decisions that you draw based on the data.

Here are some of the most common steps and methods of data cleansing experienced development teams swear by:

  1. Dealing with missing data
  2. Standardizing the process
  3. Validating data accuracy
  4. Removing duplicate data
  5. Handling structural errors
  6. Getting rid of unwanted observations

So, Data cleaning helps us improve the quality of your data as well as any business decisions that you draw based on the data.

OBJECTIVE

To clean the given dataset by removing incorrect, duplicate, or otherwise erroneous data from a dataset.

ABOUT DATASET

This is home value data for the hot Nashville market. There are 56,000+ rows altogether.

Link to the dataset — AVAILABLE ON KAGGLE

TOOLS USED

PGAdmin — PostgreSQL Management Tool for the Analysis

Cleaning Data USING SQL Queries

Select *
From NashvilleHousing

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —Standardize Date Format — already in the exact same format that I wanted

Select saleDate
From NashvilleHousing

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

Change Y and N to Yes and No in the “SoldasVacant” field

Select soldasvacant
from NashvilleHousing
Select distinct soldasvacant,
count(soldasvacant)
from NashvilleHousing
group by 1
Select
CASE WHEN soldasvacant=’Y’ THEN ‘Yes’
WHEN soldasvacant =’N’THEN ‘No’
ELSE soldasvacant
END
from NashvilleHousing
update nashvilleHousing
Set soldasvacant = (Select
CASE WHEN soldasvacant=’Y’ THEN ‘Yes’
WHEN soldasvacant =’N’THEN ‘No’
ELSE soldasvacant
END)

OUTPUT:- We can see all Y’s and N’s have been replaced with Yes and No which definitely makes more sense.

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —Remove Duplicates using a WINDOW FUNCTION

delete from NashvilleHousing
where uniqueID IN(
select UniqueId from
(Select *,
ROW_NUMBER() OVER (
PARTITION BY ParcelID,
PropertyAddress,
SalePrice,
SaleDate,
LegalReference
ORDER BY
UniqueID
) as row_num
From NashvilleHousing) rn
where rn.row_num>1)

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —
Populate Property Address data, filled the null spaces with the relevant data available in the table itself.

Select *
From NashvilleHousing
Where PropertyAddress is null
order by ParcelID
Select a.ParcelID, a.PropertyAddress, b.ParcelID, b.PropertyAddress,
coalesce(a.PropertyAddress,b.PropertyAddress)
From NashvilleHousing a
JOIN NashvilleHousing b
on a.ParcelID = b.ParcelID
AND a.UniqueID <> b.UniqueID
Where a.PropertyAddress is null
Update NashvilleHousing
SET PropertyAddress = Coalesce(a.PropertyAddress,b.PropertyAddress)
From NashvilleHousing a
JOIN NashvilleHousing b
on a.ParcelID = b.ParcelID
AND a.UniqueID <> b.UniqueID
Where a.PropertyAddress is null

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — Breaking out Address into Individual Columns (Address, City, State)

Select PropertyAddress
From NashvilleHousing
— Where PropertyAddress is null
— order by ParcelID
SELECT
SUBSTRING(PropertyAddress, 1, CHARINDEX(‘,’ , PropertyAddress) -1 ) as Address,
SUBSTRING(PropertyAddress, CHARINDEX(“,”, PropertyAddress) + 1 , LEN(PropertyAddress)) as AddressFrom NashvilleHousingALTER TABLE NashvilleHousing
Add PropertySplitAddress Nvarchar(255);
Update NashvilleHousing
SET PropertySplitAddress = SUBSTRING(PropertyAddress, 1, CHARINDEX(‘,’, PropertyAddress) -1 )
ALTER TABLE NashvilleHousing
Add PropertySplitCity Nvarchar(255);
Update NashvilleHousing
SET PropertySplitCity = SUBSTRING(PropertyAddress, CHARINDEX(‘,’, PropertyAddress) + 1 , LEN(PropertyAddress))
Select *
From NashvilleHousing

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — -

— Delete Unused Columns

Select *
From NashvilleHousing
ALTER TABLE NashvilleHousing
DROP COLUMN OwnerAddress, TaxDistrict, PropertyAddress, SaleDate

Also “DO YOU KNOW” that 80% of the time in one Data Science Project goes into getting the data ready to work upon i.e Data Preparation rest all your work is just 20%.

Learning is never-ending. I look forward to learning more and enhancing my skills. Would be happy to have references for more projects. Also, I am a huge fan of constructive feedback, feel free to drop any sort of suggestions with respect to my learning journey or even framing my medium stories to make them better.

The highest in me recognizes the highest in you✊

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store