Data Cleaning using Python
Handling “Unusual” Values
Real-world data often has “unusual values”. Data can have this so called “unusual” values for a number of reasons such as human errors or problem in measuring devices.
from pandas import read_csv from pandas import nan
Dataset (Bling Bling $$)
For this tutorial we will be using diamonds dataset. It is a classic dataset and suitable for beginner to explore data analysis.
You can dowload the dataset from -> https://www.kaggle.com/shivam2503/diamonds
#load the dataset, please make sure on your working directory data = read_csv('diamonds.csv') #view the dataset data
We can use summary statistics to help identify this unusual data.
# summarize the dataset data.describe()
The red arrow indicate the minimum value in each column. Note that column x, y, z, the dimensions of these diamonds, in mm have minimum value 0.
We know that diamonds can’t have a width of 0 mm, so these values kind of unusual and must be incorrect.
Whats wrong with having 0 ?
If 0 is the real value, for example 0 mark in an exam, we can keep it.
Now, imagine 0 is a data error, like the diamods dataset, it will really effect the data distribution.
Handling the Unusual Value
We first locate and count this 0’s.
# Finding and counting 0 (data.loc[:, 'carat':'z'] == 0).sum()
Note that column x, y, z have 8, 7, 20 values with 0 respectively. Now lets view the rows when the column x = 0.
# Filtering rows with column condition, x = 0 data.loc[data.x == 0, :]
Marking with NaN
In Pandas or NumPy, we can replace them with NaN (missing values) or some refer this step as marking with NaN. Values with a NaN value are ignored from operations like sum or count.
Note that we only need to focus on column x, y, z because this are the columns where the 0’s are located.
# Replace 0 with NaN data[['x','y','z']] = data[['x','y','z']].replace(0, nan)
After we have replaced 0 with NaN, we can use the isnull() function to mark the NaN values as True and count the number of missing values for each column.
# Counting NaN data.isnull().sum()
Running this will produce a similiar output as counting 0’s. Note that columns x, y, z have the same number of NaN as zero values identified above. Now you can have a peace of mind!
Now lets view again the rows when the column x = 0, whether they had been replaced with NaN.
# Filtering rows with column condition, x = NaN data.loc[data.x.isnull(), :]
Yes, finally we had successfully replaced the 0’s with NaN’s.
For some machine learning algorithms having missing values can cause trouble. In my next post I will discuss on this.