41 KiB
Lab 11. Data Preparation
Overview
By the end of this lab you will be able to filter DataFrames with specific conditions; remove duplicate or irrelevant records or columns; convert variables into different data types; replace values in a column and handle missing values and outlier observations.
This lab will introduce you to the main techniques you can use to handle data issues in order to achieve high quality for your dataset prior to modeling it.
Handling Row Duplication
Let's use the example that we saw in Lab 10, Analyzing a Dataset.
Start by importing the dataset into a DataFrame:
import pandas as pd
file_url = 'https://github.com/fenago/'\
'data-science/blob/'\
'master/Lab10/dataset/'\
'Online%20Retail.xlsx?raw=true'
df = pd.read_excel(file_url)
The duplicated() method from pandas checks
whether any of the rows are duplicates and returns a boolean value
for each row, True if the row is a duplicate and
False if not:
df.duplicated()
You should get the following output:
Caption: Output of the duplicated() method
In Python, the True and False binary values
correspond to the numerical values 1 and 0, respectively. To find out
how many rows have been identified as duplicates, you can use the
sum() method on the output of duplicated(). This
will add all the 1s (that is, True values) and gives us the
count of duplicates:
df.duplicated().sum()
You should get the following output:
5268
The first API subsets the DataFrame by row or column. To filter
specific columns, you can provide a list that contains their names. For
instance, if you want to keep only the variables, that is,
InvoiceNo, StockCode, InvoiceDate,
and CustomerID, you need to use the following code:
df[['InvoiceNo', 'StockCode', 'InvoiceDate', 'CustomerID']]
You should get the following output:
Caption: Subsetting columns
If you only want to filter the rows that are considered duplicates, you
can use the same API call with the output of the
duplicated() method. It will only keep the rows with
True as a value:
df[df.duplicated()]
You should get the following output:
Caption: Subsetting the duplicated rows
If you want to subset the rows and columns at the same time, you must
use one of the other two available APIs: .loc or
.iloc. These APIs do the exact same thing but
.loc uses labels or names while .iloc only takes
indices as input. You will use the .loc API to subset the
duplicated rows and keep only the selected four columns, as shown in the
previous example:
df.loc[df.duplicated(), ['InvoiceNo', 'StockCode', \
'InvoiceDate', 'CustomerID']]
You should get the following output:
Caption: Subsetting the duplicated rows and selected columns using .loc
This preceding output shows that the first few duplicates are row
numbers 517, 527, 537, and so on. By
default, pandas doesn't mark the first occurrence of
duplicates as duplicates: all the same, duplicates will have a value of
True except for the first occurrence. You can change this
behavior by specifying the keep parameter. If you want to
keep the last duplicate, you need to specify keep='last':
df.loc[df.duplicated(keep='last'), ['InvoiceNo', 'StockCode', \
'InvoiceDate', 'CustomerID']]
You should get the following output:
Caption: Subsetting the last duplicated rows
As you can see from the previous outputs, row 485 has the
same value as row 539. As expected, row 539 is
not marked as a duplicate anymore. If you want to mark all the duplicate
records as duplicates, you will have to use keep=False:
df.loc[df.duplicated(keep=False), ['InvoiceNo', 'StockCode',\
'InvoiceDate', 'CustomerID']]
You should get the following output:
Caption: Subsetting all the duplicated rows
You can use the drop_duplicates() method from
pandas. It has the same keep parameter as
duplicated(), which specifies which duplicated record you
want to keep or if you want to remove all of them. In this case, we want
to keep at least one duplicate row. Here, we want to keep the first
occurrence:
df.drop_duplicates(keep='first')
You should get the following output:
Caption: Dropping duplicate rows with keep='first'
The drop_duplicates() and duplicated() methods
also have another very useful parameter: subset. This
parameter allows you to specify the list of columns to consider while
looking for duplicates. By default, all the columns of a DataFrame are
used to find duplicate rows. Let's see how many duplicate rows there
are while only looking at the InvoiceNo,
StockCode, invoiceDate, and
CustomerID columns:
df.duplicated(subset=['InvoiceNo', 'StockCode', 'InvoiceDate',\
'CustomerID'], keep='first').sum()
You should get the following output:
10677
Exercise 11.01: Handling Duplicates in a Breast Cancer Dataset
In this exercise, you will learn how to identify duplicate records and how to handle such issues so that the dataset only contains unique records. Let's get started:
-
Open a new Jupyter notebook.
-
Import the
pandaspackage:import pandas as pd -
Assign the link to the
Breast Cancerdataset to a variable calledfile_url:file_url = 'https://raw.githubusercontent.com/'\ 'fenago/data-science/'\ 'master/Lab11/dataset/'\ 'breast-cancer-wisconsin.data' -
Using the
read_csv()method from thepandaspackage, load the dataset into a new variable calleddfwith theheader=Noneparameter. We're doing this because this file doesn't contain column names:df = pd.read_csv(file_url, header=None) -
Create a variable called
col_namesthat contains the names of the columns:Sample code number, Clump Thickness, Uniformity of Cell Size, Uniformity of Cell Shape, Marginal Adhesion, Single Epithelial Cell Size, Bare Nuclei, Bland Chromatin, Normal Nucleoli, Mitoses, andClass:col_names = ['Sample code number','Clump Thickness',\ 'Uniformity of Cell Size',\ 'Uniformity of Cell Shape',\ 'Marginal Adhesion','Single Epithelial Cell Size',\ 'Bare Nuclei','Bland Chromatin',\ 'Normal Nucleoli','Mitoses','Class'] -
Assign the column names of the DataFrame using the
columnsattribute:df.columns = col_names -
Display the shape of the DataFrame using the
.shapeattribute:df.shapeYou should get the following output:
(699, 11)This DataFrame contains
699rows and11columns. -
Display the first five rows of the DataFrame using the
head()method:df.head()You should get the following output:
Caption: The first five rows of the Breast Cancer dataset
All the variables are numerical. The Sample code number column is an
identifier for the measurement samples.
-
Find the number of duplicate rows using the
duplicated()andsum()methods:df.duplicated().sum()You should get the following output:
8Looking at the 11 columns in this dataset, we can see that there are
8duplicate rows. -
Display the duplicate rows using the
loc()andduplicated()methods:df.loc[df.duplicated()]You should get the following output:
Caption: Duplicate records
The following rows are duplicates: `208`, `253`,
`254`, `258`, `272`, `338`,
`561`, and `684`.
-
Display the duplicate rows just like we did in Step 9, but with the
keep='last'parameter instead:df.loc[df.duplicated(keep='last')]You should get the following output:
Caption: Duplicate records with keep=\'last\'
By using the `keep='last'` parameter, the following rows
are considered duplicates: `42`, `62`,
`168`, `207`, `267`, `314`,
`560`, and `683`. By comparing this output to
the one from the previous step, we can see that rows 253 and 42 are
identical.
-
Remove the duplicate rows using the
drop_duplicates()method along with thekeep='first'parameter and save this into a new DataFrame calleddf_unique:df_unique = df.drop_duplicates(keep='first') -
Display the shape of
df_uniquewith the.shapeattribute:df_unique.shapeYou should get the following output:
(691, 11)Now that we have removed the eight duplicate records, only
691rows remain. Now, the dataset only contains unique observations.
In this exercise, you learned how to identify and remove duplicate records from a real-world dataset.
Converting Data Types
Another problem you may face in a project is incorrect data types being
inferred for some columns. As we saw in Lab 10, Analyzing a
Dataset, the pandas package provides us with a very easy
way to display the data type of each column using the
.dtypes attribute. You may be wondering, when did
pandas identify the type of each column? The types are
detected when you load the dataset into a pandas DataFrame
using methods such as read_csv(), read_excel(),
and so on.
When you've done this, pandas will try its best to
automatically find the best type according to the values contained in
each column. Let's see how this works on the Online Retail
dataset.
First, you must import pandas:
import pandas as pd
Then, you need to assign the URL to the dataset to a new variable:
file_url = 'https://github.com/fenago/'\
'data-science/blob/'\
'master/Lab10/dataset/'\
'Online%20Retail.xlsx?raw=true'
Let's load the dataset into a pandas DataFrame using
read_excel():
df = pd.read_excel(file_url)
Finally, let's print the data type of each column:
df.dtypes
You should get the following output:
Caption: The data type of each column of the Online Retail dataset
The preceding output shows the data types that have been assigned to
each column. Quantity, UnitPrice, and
CustomerID have been identified as numerical variables
(int64, float64), InvoiceDate is a
datetime variable, and all the other columns are considered
text (object). This is not too bad. pandas did a
great job of recognizing non-text columns.
But what if you want to change the types of some columns? You have two ways to achieve this.
The first way is to reload the dataset, but this time, you will need to
specify the data types of the columns of interest using the
dtype parameter. This parameter takes a dictionary with the
column names as keys and the correct data types as values, such as
{'col1': np.float64, 'col2': np.int32}, as input. Let's try this on
CustomerID. We know this isn't a numerical variable as it
contains a unique identifier (code). Here, we are going to change
its type to object:
df = pd.read_excel(file_url, dtype={'CustomerID': 'category'})
df.dtypes
You should get the following output:
Caption: The data types of each column after converting CustomerID
As you can see, the data type for CustomerID has effectively
changed to a category type.
Now, let's look at the second way of converting a single column into a
different type. In pandas, you can use the
astype() method and specify the new data type that it will
be converted into as its parameter. It will return a new column (a
new pandas series, to be more precise), so you need to
reassign it to the same column of the DataFrame. For instance, if you
want to change the InvoiceNo column into a categorical
variable, you would do the following:
df['InvoiceNo'] = df['InvoiceNo'].astype('category')
df.dtypes
You should get the following output:
Caption: The data types of each column after converting InvoiceNo
As you can see, the data type for InvoiceNo has changed to a
categorical variable. The difference between object and
category is that the latter has a finite number of possible
values (also called discrete variables). Once these have been changed
into categorical variables, pandas will automatically list
all the values. They can be accessed using the
.cat.categories attribute:
df['InvoiceNo'].cat.categories
You should get the following output:
Caption: List of categories (possible values) for the InvoiceNo categorical variable
pandas has identified that there are 25,900 different values
in this column and has listed all of them. Depending on the data type
that's assigned to a variable, pandas provides different
attributes and methods that are very handy for data transformation or
feature engineering (this will be covered in Lab 12, Feature
Engineering).
As a final note, you may be wondering when you would use the first way
of changing the types of certain columns (while loading the dataset). To
find out the current type of each variable, you must load the data
first, so why will you need to reload the data again with new data
types? It will be easier to change the type with the
astype() method after the first load. There are a few
reasons why you would use it. One reason could be that you have already
explored the dataset on a different tool, such as Excel, and already
know what the correct data types are.
The second reason could be that your dataset is big, and you cannot load
it in its entirety. As you may have noticed, by default,
pandas use 64-bit encoding for numerical variables. This
requires a lot of memory and may be overkill.
For example, the Quantity column has an int64 data type,
which means that the range of possible values is
-9,223,372,036,854,775,808 to 9,223,372,036,854,775,807. However, in
Lab 10, Analyzing a Dataset while analyzing the distribution of
this column, you learned that the range of values for this column is
only from -80,995 to 80,995. You don't need to use so much space. By
reducing the data type of this variable to int32 (which ranges from
-2,147,483,648 to 2,147,483,647), you may be able to reload the entire
dataset.
Exercise 11.02: Converting Data Types for the Ames Housing Dataset
In this exercise, you will prepare a dataset by converting its variables into the correct data types.
You will use the Ames Housing dataset to do this, which we also used in Lab 10, Analyzing a Dataset. For more information about this dataset, refer to the following note. Let's get started:
-
Open a new Jupyter notebook.
-
Import the
pandaspackage:import pandas as pd -
Assign the link to the Ames dataset to a variable called
file_url:file_url = 'https://raw.githubusercontent.com/'\ 'fenago/data-science/'\ 'master/Lab10/dataset/ames_iowa_housing.csv' -
Using the
read_csvmethod from thepandaspackage, load the dataset into a new variable calleddf:df = pd.read_csv(file_url) -
Print the data type of each column using the
dtypesattribute:df.dtypesYou should get the following output:
-
Using the
astype()method, convert the'Id'column into a categorical variable, as shown in the following code snippet:df['Id'] = df['Id'].astype('category') -
Convert the
'MSSubClass','OverallQual', and'OverallCond'columns into categorical variables, like we did in the previous step:df['MSSubClass'] = df['MSSubClass'].astype('category') df['OverallQual'] = df['OverallQual'].astype('category') df['OverallCond'] = df['OverallCond'].astype('category') -
Create a for loop that will iterate through the four categorical columns
('Id', 'MSSubClass', 'OverallQual',and'OverallCond') and print their names and categories using the.cat.categoriesattribute:for col_name in ['Id', 'MSSubClass', 'OverallQual', \ 'OverallCond']: print(col_name) print(df[col_name].cat.categories)You should get the following output:
-
Create a new DataFrame called
obj_dfthat will only contain variables of theobjecttype using theselect_dtypesmethod along with theinclude='object'parameter:obj_df = df.select_dtypes(include='object') -
Create a new variable called
obj_colsthat contains a list of column names from theobj_dfDataFrame using the.columnsattribute and display its content:obj_cols = obj_df.columns obj_colsYou should get the following output:
Caption: List of variables of the \'object\' type
-
Like we did in Step 8, create a
forloop that will iterate through the column names contained inobj_colsand print their names and unique values using theunique()method:for col_name in obj_cols: print(col_name) print(df[col_name].unique())You should get the following output:
Caption: List of unique values for each variable of the
\'object\' type
As you can see, all these columns have a finite number of unique
values that are composed of text, which shows us that they are
categorical variables.
-
Now, create a
forloop that will iterate through the column names contained inobj_colsand convert each of them into a categorical variable using theastype()method:for col_name in obj_cols: df[col_name] = df[col_name].astype('category') -
Print the data type of each column using the
dtypesattribute:df.dtypesYou should get the following output:
Handling Incorrect Values
Let's learn how to detect such issues in real life by using the
Online Retail dataset.
First, you need to load the data into a pandas DataFrame:
import pandas as pd
file_url = 'https://github.com/fenago/'\
'data-science/blob/'\
'master/Lab10/dataset/'\
'Online%20Retail.xlsx?raw=true'
df = pd.read_excel(file_url)
In this dataset, there are two variables that seem to be related to each
other: StockCode and Description. The first one
contains the identifier code of the items sold and the other one
contains their descriptions. However, if you look at some of the
examples, such as StockCode 23131, the
Description column has different values:
df.loc[df['StockCode'] == 23131, 'Description'].unique()
You should get the following output
Let's focus on the misspelling issue. What we need to do here is modify
the incorrect spelling and replace it with the correct value. First,
let's create a new column called StockCodeDescription,
which is an exact copy of the Description column:
df['StockCodeDescription'] = df['Description']
You will use this new column to fix the misspelling issue. To do this,
use the subsetting technique you learned about earlier in this lab.
You need to use .loc and filter the rows and columns you
want, that is, all rows with StockCode == 21131 and
Description == MISELTOE HEART WREATH CREAM and the
Description column:
df.loc[(df['StockCode'] == 23131) \
& (df['StockCodeDescription'] \
== 'MISELTOE HEART WREATH CREAM'), \
'StockCodeDescription'] = 'MISTLETOE HEART WREATH CREAM'
If you reprint the value for this issue, you will see that the misspelling value has been fixed and is not present anymore:
df.loc[df['StockCode'] == 23131, 'StockCodeDescription'].unique()
You should get the following output:
Caption: List of unique values for the Description column and StockCode 23131 after fixing the first misspelling issue
As you can see, there are still five different values for this product,
but for one of them, that is, MISTLETOE, has been spelled
incorrectly: MISELTOE.
This time, rather than looking at an exact match (a word must be the
same as another one), we will look at performing a partial match (part
of a word will be present in another word). In our case, instead of
looking at the spelling of MISELTOE, we will only look at
MISEL. The pandas package provides a method
called .str.contains() that we can use to look for
observations that partially match with a given expression.
Let's use this to see if we have the same misspelling issue
(MISEL) in the entire dataset. You will need to add one
additional parameter since this method doesn't handle missing values.
You will also have to subset the rows that don't have missing values
for the Description column. This can be done by providing
the na=False parameter to the .str.contains()
method:
df.loc[df['StockCodeDescription']\
.str.contains('MISEL', na=False),]
You should get the following output:
Caption: Displaying all the rows containing the misspelling 'MISELTOE'
This misspelling issue (MISELTOE) is not only related to
StockCode 23131, but also to other items. You will need to
fix all of these using the str.replace() method. This method
takes the string of characters to be replaced and the replacement string
as parameters:
df['StockCodeDescription'] = df['StockCodeDescription']\
.str.replace\
('MISELTOE', 'MISTLETOE')
Now, if you print all the rows that contain the misspelling of
MISEL, you will see that no such rows exist anymore:
df.loc[df['StockCodeDescription']\
.str.contains('MISEL', na=False),]
You should get the following output
Exercise 11.03: Fixing Incorrect Values in the State Column
In this exercise, you will clean the State variable in a
modified version of a dataset by listing all the finance officers in the
USA. We are doing this because the dataset contains some incorrect
values. Let's get started:
-
Open a new Jupyter notebook.
-
Import the
pandaspackage:import pandas as pd -
Assign the link to the dataset to a variable called
file_url:file_url = 'https://raw.githubusercontent.com/'\ 'fenago/data-science/'\ 'master/Lab11/dataset/officers.csv' -
Using the
read_csv()method from thepandaspackage, load the dataset into a new variable calleddf:df = pd.read_csv(file_url) -
Print the first five rows of the DataFrame using the
.head()method:df.head()You should get the following output:
Caption: The first five rows of the finance officers dataset
-
Print out all the unique values of the
Statevariable:df['State'].unique()You should get the following output:
Caption: List of unique values in the State column
-
Print out the rows that have the
ilvalue in theStatecolumn using thepandas.str.contains()method and the subsetting API, that is, DataFrame [condition]. You will also have to set thenaparameter toFalseinstr.contains()in order to exclude observations with missing values:df[df['State'].str.contains('il', na=False)]You should get the following output:
Caption: Observations with a value of il
-
Now, create a
forloop that will iterate through the following values in theStatecolumn:Il,iL,Il. Then, print out the values of the City and State variables using thepandasmethod for subsetting, that is,.loc(): DataFrame.loc[row_condition, column condition]. Do this for each observation:for state in ['Il', 'iL', 'Il']: print(df.loc[df['State'] == state, ['City', 'State']])You should get the following output:
Caption: Observations with the il value
-
Create a condition mask (
il_mask) to subset all the rows that contain the four incorrect values (il,Il,iL, andIl) by using theisin()method and a list of these values as a parameter. Then, save the result into a variable calledil_mask:il_mask = df['State'].isin(['il', 'Il', 'iL', 'Il']) -
Print the number of rows that match the condition we set in
il_maskusing the.sum()method. This will sum all the rows that have a value ofTrue(they match the condition):il_mask.sum()You should get the following output:
672 -
Using the
pandas.loc()method, subset the rows with theil_maskcondition mask and replace the value of theStatecolumn withIL:df.loc[il_mask, 'State'] = 'IL' -
Print out all the unique values of the
Statevariable once more:df['State'].unique()You should get the following output:
Caption: List of unique values for the \'State\' column
As you can see, the four incorrect values are not present anymore.
Let\'s have a look at the other remaining incorrect values:
`II`, `I`, `8I`, and `60`.
We will look at dealing `II` in the next step.
Print out the rows that have a value of `II` into the
`State` column using the `pandas` subsetting
API, that is, DataFrame.loc\[row\_condition, column\_condition\]:
```
df.loc[df['State'] == 'II',]
```
You should get the following output:
Caption: Subsetting the rows with a value of IL in the State
column
There are only two cases where the `II` value has been
used for the `State` column and both have
`Bloomington` as the city, which is in Illinois. Here, the
correct `State` value should be `IL`.
-
Now, create a
forloop that iterates through the three incorrect values (I,8I, and60) and print out the subsetted rows using the same logic that we used in Step 12. Only display theCityandStatecolumns:for val in ['I', '8I', '60']: print(df.loc[df['State'] == val, ['City', 'State']])You should get the following output:
Caption: Observations with incorrect values (I, 8I, and 60)
All the observations that have incorrect values are cities in
Illinois. Let\'s fix them now.
-
Create a
forloop that iterates through the four incorrect values (II,I,8I, and60) and reuse the subsetting logic from Step 12 to replace the value inStatewithIL:for val in ['II', 'I', '8I', '60']: df.loc[df['State'] == val, 'State'] = 'IL' -
Print out all the unique values of the
Statevariable:df['State'].unique()You should get the following output:
Caption: List of unique values for the State column
You fixed the issues for the state of Illinois. However, there are
two more incorrect values in this column: `In` and
`ng`.
-
Repeat Step 13, but iterate through the
Inandngvalues instead:for val in ['In', 'ng']: print(df.loc[df['State'] == val, ['City', 'State']])You should get the following output:
Caption: Observations with incorrect values (In, ng)
The rows that have the `ng` value in `State` are
missing values. We will cover this topic in the next section. The
observation that has `In` as its `State` is a
city in Indiana, so the correct value should be `IN`.
Let\'s fix this.
-
Subset the rows containing the
Invalue inStateusing the.loc()and.str.contains()methods and replace the state value withIN. Don't forget to specify thena=Falseparameter as.str.contains():df.loc[df['State']\ .str.contains('In', na=False), 'State'] = 'IN'Print out all the unique values of the
Statevariable:df['State'].unique()You should get the following output:
Caption: List of unique values for the State column
You just fixed all the incorrect values for the State
variable using the methods provided by the pandas package.
In the next section, we are going to look at handling missing values.
Handling Missing Values
The pandas package provides a method that we can use to
identify missing values in a DataFrame: .isna(). Let's see
it in action on the Online Retail dataset. First, you need
to import pandas and load the data into a DataFrame:
import pandas as pd
file_url = 'https://github.com/fenago/'\
'data-science/blob/'\
'master/Lab10/dataset/'\
'Online%20Retail.xlsx?raw=true'
df = pd.read_excel(file_url)
The .isna() method returns a pandas series with
a binary value for each cell of a DataFrame and states whether it is
missing a value (True) or not (False):
df.isna()
You should get the following output:
Caption: Output of the .isna() method
As we saw previously, we can give the output of a binary variable to the
.sum() method, which will add all the True
values together (cells that have missing values) and provide a summary
for each column:
df.isna().sum()
You should get the following output:
Caption: Summary of missing values for each variable
As you can see, there are 1454 missing values in the
Description column and 135080 in the
CustomerID column. Let's have a look at the missing value
observations for Description. You can use the output of the
.isna() method to subset the rows with missing values:
df[df['Description'].isna()]
You should get the following output:
Caption: Subsetting the rows with missing values for Description
The pandas package provides a method that we can use to
easily remove missing values: .dropna(). This method returns
a new DataFrame without all the rows that have missing values. By
default, it will look at all the columns. You can specify a list of
columns for it to look for with the subset parameter:
df.dropna(subset=['Description'])
This method returns a new DataFrame with no missing values for the
specified columns. If you want to replace the original dataset directly,
you can use the inplace=True parameter:
df.dropna(subset=['Description'], inplace=True)
Now, look at the summary of the missing values for each variable:
df.isna().sum()
You should get the following output:
Caption: Summary of missing values for each variable
As you can see, there are no more missing values in the
Description column. Let's have a look at the
CustomerID column:
df[df['CustomerID'].isna()]
You should get the following output:
Most algorithms require a value for each observation, so you
need to provide one for these cases. We will use the
.fillna() method from pandas to do this. Provide
the value to be imputed as Missing and use
inplace=True as a parameter:
df['CustomerID'].fillna('Missing', inplace=True)
df[1443:1448]
You should get the following output:
Caption: Examples of rows where missing values for CustomerID have been replaced with Missing
Let's see if we have any missing values in the dataset:
df.isna().sum()
You should get the following output:
Exercise 11.04: Fixing Missing Values for the Horse Colic Dataset
In this exercise, you will be cleaning out all the missing values for
all the numerical variables in the Horse Colic dataset.
Colic is a painful condition that horses can suffer from, and this dataset contains various pieces of information related to specific cases of this condition. You can use the link provided in the Note section if you want to find out more about the dataset's attributes. Let's get started:
-
Open a new Jupyter notebook.
-
Import the
pandaspackage:import pandas as pd -
Assign the link to the dataset to a variable called
file_url:file_url = 'http://raw.githubusercontent.com/'\ 'fenago/data-science/'\ 'master/Lab11/dataset/horse-colic.data' -
Using the
.read_csv()method from thepandaspackage, load the dataset into a new variable calleddfand specify theheader=None,sep='\s+', andprefix='X'parameters:df = pd.read_csv(file_url, header=None, \ sep='\s+', prefix='X') -
Print the first five rows of the DataFrame using the
.head()method:df.head()You should get the following output:
Caption: The first five rows of the Horse Colic dataset
-
Reload the dataset into a
pandasDataFrame using the.read_csv()method, but this time, add thena_values='?'parameter in order to specify that this value needs to be treated as a missing value:df = pd.read_csv(file_url, header=None, sep='\s+', \ prefix='X', na_values='?') -
Print the first five rows of the DataFrame using the
.head()method:df.head()You should get the following output:
Caption: The first five rows of the Horse Colic dataset
Now, you can see that `pandas` have converted all the
`?` values into missing values.
-
Print the data type of each column using the
dtypesattribute:df.dtypesYou should get the following output:
Caption: Data type of each column
-
Print the number of missing values for each column by combining the
.isna()and.sum()methods:df.isna().sum()You should get the following output:
Caption: Number of missing values for each column
-
Create a condition mask called
x0_maskso that you can find the missing values in theX0column using the.isna()method:x0_mask = df['X0'].isna() -
Display the number of missing values for this column by using the
.sum()method onx0_mask:x0_mask.sum()You should get the following output:
1Here, you got the exact same number of missing values for
X0that you did in Step 9. -
Extract the mean of
X0using the.median()method and store it in a new variable calledx0_median. Print its value:x0_median = df['X0'].median() print(x0_median)You should get the following output:
1.0The median value for this column is
1. You will replace all the missing values with this value in theX0column. -
Replace all the missing values in the
X0variable with their median using the.fillna()method, along with theinplace=Trueparameter:df['X0'].fillna(x0_median, inplace=True) -
Print the number of missing values for
X0by combining the.isna()and.sum()methods:df['X0'].isna().sum()You should get the following output:
0There are no more missing values in the variables.
-
Create a
forloop that will iterate through all the columns of the DataFrame. In the for loop, calculate the median for each and save them into a variable calledcol_median. Then, impute missing values with this median value using the.fillna()method, along with theinplace=Trueparameter, and print the name of the column and its median value:for col_name in df.columns: col_median = df[col_name].median() df[col_name].fillna(col_median, inplace=True) print(col_name) print(col_median)You should get the following output:
Caption: Median values for each column
-
Print the number of missing values for each column by combining the
.isna()and.sum()methods:df.isna().sum()You should get the following output:
Caption: Number of missing values for each column
You have successfully fixed the missing values for all the numerical
variables using the methods provided by the pandas package:
.isna() and .fillna().
Activity 11.01: Preparing the Speed Dating Dataset
As an entrepreneur, you are planning to launch a new dating app into the market. The key feature that will differentiate your app from other competitors will be your high performing user-matching algorithm. Before building this model, you have partnered with a speed dating company to collect data from real events. You just received the dataset from your partner company but realized it is not as clean as you expected; there are missing and incorrect values. Your task is to fix the main data quality issues in this dataset.
The following steps will help you complete this activity:
-
Download and load the dataset into Python using
.read_csv(). -
Print out the dimensions of the DataFrame using
.shape. -
Check for duplicate rows by using
.duplicated()and.sum()on all the columns. -
Check for duplicate rows by using
.duplicated()and.sum()for the identifier columns (iid,id,partner, andpid). -
Check for unexpected values for the following numerical variables:
'imprace', 'imprelig', 'sports', 'tvsports', 'exercise', 'dining', 'museums', 'art', 'hiking', 'gaming', 'clubbing', 'reading', 'tv', 'theater', 'movies', 'concerts', 'music', 'shopping',and'yoga'. -
Replace the identified incorrect values.
-
Check the data type of the different columns using
.dtypes. -
Change the data types to categorical for the columns that don't contain numerical values using
.astype(). -
Check for any missing values using
.isna()and.sum()for each numerical variable. -
Replace the missing values for each numerical variable with their corresponding mean or median values using
.fillna(),.mean(), and.median().
You should get the following output. The figure represents the number of
rows with unexpected values for imprace and a list of
unexpected values:
The following figure illustrates the number of rows with unexpected values and a list of unexpected values for each column:
The following figure illustrates a list of unique values for gaming:
The following figure displays the data types of each column:
The following figure displays the updated data types of each column:
The following figure displays the number of missing values for numerical variables:
The following figure displays the list of unique values for
int_corr:
The following figure displays the list of unique values for numerical variables:
The following figure displays the number of missing values for numerical variables:
Summary
In this lab, you learned how important it is to prepare any given dataset and fix the main quality issues it has. This is critical because the cleaner a dataset is, the easier it will be for any machine learning model to easily learn about the relevant patterns. On top of this, most algorithms can't handle issues such as missing values, so they must be handled prior to the modeling phase. In this lab, you covered the most frequent issues that are faced in data science projects: duplicate rows, incorrect data types, unexpected values, and missing values.





















































