LabĀ 11. Data Preparation ==================== Overview By the end of this lab you will be able to filter DataFrames with specific conditions; remove duplicate or irrelevant records or columns; convert variables into different data types; replace values in a column and handle missing values and outlier observations. This lab will introduce you to the main techniques you can use to handle data issues in order to achieve high quality for your dataset prior to modeling it. Handling Row Duplication ======================== Let\'s use the example that we saw in *Lab 10*, *Analyzing a Dataset*. Start by **importing** the dataset into a DataFrame: ``` import pandas as pd file_url = 'https://github.com/fenago/'\ 'data-science/blob/'\ 'master/Lab10/dataset/'\ 'Online%20Retail.xlsx?raw=true' df = pd.read_excel(file_url) ``` The `duplicated()` method from `pandas` checks whether any of the rows are duplicates and returns a **boolean** value for each row, `True` if the row is a duplicate and `False` if not: ``` df.duplicated() ``` You should get the following output: ![](./images/B15019_11_01.jpg) Caption: Output of the duplicated() method In Python, the `True` and `False` binary values correspond to the numerical values 1 and 0, respectively. To find out how many rows have been identified as duplicates, you can use the `sum()` method on the output of `duplicated()`. This will add all the 1s (that is, `True` values) and gives us the count of duplicates: ``` df.duplicated().sum() ``` You should get the following output: ``` 5268 ``` The first API subsets the DataFrame by **row** or **column**. To filter specific columns, you can provide a list that contains their names. For instance, if you want to keep only the variables, that is, `InvoiceNo`, `StockCode`, `InvoiceDate`, and `CustomerID`, you need to use the following code: ``` df[['InvoiceNo', 'StockCode', 'InvoiceDate', 'CustomerID']] ``` You should get the following output: ![](./images/B15019_11_02.jpg) Caption: Subsetting columns If you only want to filter the rows that are considered duplicates, you can use the same API call with the output of the `duplicated()` method. It will only keep the rows with **True** as a value: ``` df[df.duplicated()] ``` You should get the following output: ![](./images/B15019_11_03.jpg) Caption: Subsetting the duplicated rows If you want to subset the rows and columns at the same time, you must use one of the other two available APIs: `.loc` or `.iloc`. These APIs do the exact same thing but `.loc` uses labels or names while `.iloc` only takes indices as input. You will use the `.loc` API to subset the duplicated rows and keep only the selected four columns, as shown in the previous example: ``` df.loc[df.duplicated(), ['InvoiceNo', 'StockCode', \ 'InvoiceDate', 'CustomerID']] ``` You should get the following output: ![](./images/B15019_11_04.jpg) Caption: Subsetting the duplicated rows and selected columns using .loc This preceding output shows that the first few duplicates are row numbers `517`, `527`, `537`, and so on. By default, `pandas` doesn\'t mark the first occurrence of duplicates as duplicates: all the same, duplicates will have a value of `True` except for the first occurrence. You can change this behavior by specifying the `keep` parameter. If you want to keep the last duplicate, you need to specify `keep='last'`: ``` df.loc[df.duplicated(keep='last'), ['InvoiceNo', 'StockCode', \ 'InvoiceDate', 'CustomerID']] ``` You should get the following output: ![](./images/B15019_11_05.jpg) Caption: Subsetting the last duplicated rows As you can see from the previous outputs, row `485` has the same value as row `539`. As expected, row `539` is not marked as a duplicate anymore. If you want to mark all the duplicate records as duplicates, you will have to use `keep=False`: ``` df.loc[df.duplicated(keep=False), ['InvoiceNo', 'StockCode',\ 'InvoiceDate', 'CustomerID']] ``` You should get the following output: ![](./images/B15019_11_06.jpg) Caption: Subsetting all the duplicated rows You can use the `drop_duplicates()` method from `pandas`. It has the same `keep` parameter as `duplicated()`, which specifies which duplicated record you want to keep or if you want to remove all of them. In this case, we want to keep at least one duplicate row. Here, we want to keep the first occurrence: ``` df.drop_duplicates(keep='first') ``` You should get the following output: ![](./images/B15019_11_07.jpg) Caption: Dropping duplicate rows with keep=\'first\' The `drop_duplicates()` and `duplicated()` methods also have another very useful parameter: `subset`. This parameter allows you to specify the list of columns to consider while looking for duplicates. By default, all the columns of a DataFrame are used to find duplicate rows. Let\'s see how many duplicate rows there are while only looking at the `InvoiceNo`, `StockCode`, `invoiceDate`, and `CustomerID` columns: ``` df.duplicated(subset=['InvoiceNo', 'StockCode', 'InvoiceDate',\ 'CustomerID'], keep='first').sum() ``` You should get the following output: ``` 10677 ``` Exercise 11.01: Handling Duplicates in a Breast Cancer Dataset -------------------------------------------------------------- In this exercise, you will learn how to identify duplicate records and how to handle such issues so that the dataset only contains **unique** records. Let\'s get started: 1. Open a new **Jupyter** notebook. 2. Import the `pandas` package: ``` import pandas as pd ``` 3. Assign the link to the `Breast Cancer` dataset to a variable called `file_url`: ``` file_url = 'https://raw.githubusercontent.com/'\ 'fenago/data-science/'\ 'master/Lab11/dataset/'\ 'breast-cancer-wisconsin.data' ``` 4. Using the `read_csv()` method from the `pandas` package, load the dataset into a new variable called `df` with the `header=None` parameter. We\'re doing this because this file doesn\'t contain column names: ``` df = pd.read_csv(file_url, header=None) ``` 5. Create a variable called `col_names` that contains the names of the columns: `Sample code number, Clump Thickness, Uniformity of Cell Size, Uniformity of Cell Shape, Marginal Adhesion, Single Epithelial Cell Size, Bare Nuclei, Bland Chromatin, Normal Nucleoli, Mitoses`, and `Class`: ``` col_names = ['Sample code number','Clump Thickness',\ 'Uniformity of Cell Size',\ 'Uniformity of Cell Shape',\ 'Marginal Adhesion','Single Epithelial Cell Size',\ 'Bare Nuclei','Bland Chromatin',\ 'Normal Nucleoli','Mitoses','Class'] ``` 6. Assign the column names of the DataFrame using the `columns` attribute: ``` df.columns = col_names ``` 7. Display the shape of the DataFrame using the `.shape` attribute: ``` df.shape ``` You should get the following output: ``` (699, 11) ``` This DataFrame contains `699` rows and `11` columns. 8. Display the first five rows of the DataFrame using the `head()` method: ``` df.head() ``` You should get the following output: ![](./images/B15019_11_08.jpg) Caption: The first five rows of the Breast Cancer dataset All the variables are numerical. The Sample code number column is an identifier for the measurement samples. 9. Find the number of duplicate rows using the `duplicated()` and `sum()` methods: ``` df.duplicated().sum() ``` You should get the following output: ``` 8 ``` Looking at the 11 columns in this dataset, we can see that there are `8` duplicate rows. 10. Display the duplicate rows using the `loc()` and `duplicated()` methods: ``` df.loc[df.duplicated()] ``` You should get the following output: ![](./images/B15019_11_09.jpg) Caption: Duplicate records The following rows are duplicates: `208`, `253`, `254`, `258`, `272`, `338`, `561`, and `684`. 11. Display the duplicate rows just like we did in *Step 9*, but with the `keep='last'` parameter instead: ``` df.loc[df.duplicated(keep='last')] ``` You should get the following output: ![](./images/B15019_11_10.jpg) Caption: Duplicate records with keep=\'last\' By using the `keep='last'` parameter, the following rows are considered duplicates: `42`, `62`, `168`, `207`, `267`, `314`, `560`, and `683`. By comparing this output to the one from the previous step, we can see that rows 253 and 42 are identical. 12. Remove the duplicate rows using the `drop_duplicates()` method along with the `keep='first'` parameter and save this into a new DataFrame called `df_unique`: ``` df_unique = df.drop_duplicates(keep='first') ``` 13. Display the shape of `df_unique` with the `.shape` attribute: ``` df_unique.shape ``` You should get the following output: ``` (691, 11) ``` Now that we have removed the eight duplicate records, only `691` rows remain. Now, the dataset only contains unique observations. In this exercise, you learned how to identify and remove duplicate records from a real-world dataset. Converting Data Types ===================== Another problem you may face in a project is incorrect data types being inferred for some columns. As we saw in *Lab 10*, *Analyzing a Dataset*, the `pandas` package provides us with a very easy way to display the data type of each column using the `.dtypes` attribute. You may be wondering, when did `pandas` identify the type of each column? The types are detected when you load the dataset into a `pandas` DataFrame using methods such as `read_csv()`, `read_excel()`, and so on. When you\'ve done this, `pandas` will try its best to automatically find the best type according to the values contained in each column. Let\'s see how this works on the `Online Retail` dataset. First, you must import `pandas`: ``` import pandas as pd ``` Then, you need to assign the URL to the dataset to a new variable: ``` file_url = 'https://github.com/fenago/'\ 'data-science/blob/'\ 'master/Lab10/dataset/'\ 'Online%20Retail.xlsx?raw=true' ``` Let\'s load the dataset into a `pandas` DataFrame using `read_excel()`: ``` df = pd.read_excel(file_url) ``` Finally, let\'s print the data type of each column: ``` df.dtypes ``` You should get the following output: ![Caption: The data type of each column of the Online Retail dataset ](./images/B15019_11_11.jpg) Caption: The data type of each column of the Online Retail dataset The preceding output shows the data types that have been assigned to each column. `Quantity`, `UnitPrice`, and `CustomerID` have been identified as numerical variables (`int64`, `float64`), `InvoiceDate` is a `datetime` variable, and all the other columns are considered text (`object`). This is not too bad. `pandas` did a great job of recognizing non-text columns. But what if you want to change the types of some columns? You have two ways to achieve this. The first way is to reload the dataset, but this time, you will need to specify the data types of the columns of interest using the `dtype` parameter. This parameter takes a dictionary with the column names as keys and the correct data types as values, such as {\'col1\': np.float64, \'col2\': np.int32}, as input. Let\'s try this on `CustomerID`. We know this isn\'t a numerical variable as it contains a unique **identifier** (code). Here, we are going to change its type to **object**: ``` df = pd.read_excel(file_url, dtype={'CustomerID': 'category'}) df.dtypes ``` You should get the following output: ![](./images/B15019_11_12.jpg) Caption: The data types of each column after converting CustomerID As you can see, the data type for `CustomerID` has effectively changed to a `category` type. Now, let\'s look at the second way of converting a single column into a different type. In `pandas`, you can use the `astype()` method and specify the new data type that it will be converted into as its **parameter**. It will return a new column (a new `pandas` series, to be more precise), so you need to reassign it to the same column of the DataFrame. For instance, if you want to change the `InvoiceNo` column into a categorical variable, you would do the following: ``` df['InvoiceNo'] = df['InvoiceNo'].astype('category') df.dtypes ``` You should get the following output: ![](./images/B15019_11_13.jpg) Caption: The data types of each column after converting InvoiceNo As you can see, the data type for `InvoiceNo` has changed to a categorical variable. The difference between `object` and `category` is that the latter has a finite number of possible values (also called discrete variables). Once these have been changed into categorical variables, `pandas` will automatically list all the values. They can be accessed using the `.cat.categories` attribute: ``` df['InvoiceNo'].cat.categories ``` You should get the following output: ![Caption: List of categories (possible values) for the InvoiceNo categorical variable ](./images/B15019_11_14.jpg) Caption: List of categories (possible values) for the InvoiceNo categorical variable `pandas` has identified that there are 25,900 different values in this column and has listed all of them. Depending on the data type that\'s assigned to a variable, `pandas` provides different attributes and methods that are very handy for data transformation or feature engineering (this will be covered in *Lab 12*, *Feature Engineering*). As a final note, you may be wondering when you would use the first way of changing the types of certain columns (while loading the dataset). To find out the current type of each variable, you must load the data first, so why will you need to reload the data again with new data types? It will be easier to change the type with the `astype()` method after the first load. There are a few reasons why you would use it. One reason could be that you have already explored the dataset on a different tool, such as Excel, and already know what the correct data types are. The second reason could be that your dataset is big, and you cannot load it in its entirety. As you may have noticed, by default, `pandas` use 64-bit encoding for numerical variables. This requires a lot of memory and may be overkill. For example, the `Quantity` column has an int64 data type, which means that the range of possible values is -9,223,372,036,854,775,808 to 9,223,372,036,854,775,807. However, in *Lab 10*, *Analyzing a Dataset* while analyzing the distribution of this column, you learned that the range of values for this column is only from -80,995 to 80,995. You don\'t need to use so much space. By reducing the data type of this variable to int32 (which ranges from -2,147,483,648 to 2,147,483,647), you may be able to reload the entire dataset. Exercise 11.02: Converting Data Types for the Ames Housing Dataset ------------------------------------------------------------------ In this exercise, you will prepare a dataset by converting its variables into the correct data types. You will use the Ames Housing dataset to do this, which we also used in *Lab 10*, *Analyzing a Dataset*. For more information about this dataset, refer to the following note. Let\'s get started: 1. Open a new Jupyter notebook. 2. Import the `pandas` package: ``` import pandas as pd ``` 3. Assign the link to the Ames dataset to a variable called `file_url`: ``` file_url = 'https://raw.githubusercontent.com/'\ 'fenago/data-science/'\ 'master/Lab10/dataset/ames_iowa_housing.csv' ``` 4. Using the `read_csv` method from the `pandas` package, load the dataset into a new variable called `df`: ``` df = pd.read_csv(file_url) ``` 5. Print the data type of each column using the `dtypes` attribute: ``` df.dtypes ``` You should get the following output: ![](./images/B15019_11_15.jpg) 6. Using the `astype()` method, convert the `'Id'` column into a categorical variable, as shown in the following code snippet: ``` df['Id'] = df['Id'].astype('category') ``` 7. Convert the `'MSSubClass'`, `'OverallQual'`, and `'OverallCond'` columns into categorical variables, like we did in the previous step: ``` df['MSSubClass'] = df['MSSubClass'].astype('category') df['OverallQual'] = df['OverallQual'].astype('category') df['OverallCond'] = df['OverallCond'].astype('category') ``` 8. Create a for loop that will iterate through the four categorical columns `('Id', 'MSSubClass', 'OverallQual', `and` 'OverallCond'`) and print their names and categories using the `.cat.categories` attribute: ``` for col_name in ['Id', 'MSSubClass', 'OverallQual', \ 'OverallCond']: print(col_name) print(df[col_name].cat.categories) ``` You should get the following output: ![](./images/B15019_11_16.jpg) 9. Create a new DataFrame called `obj_df` that will only contain variables of the `object` type using the `select_dtypes` method along with the `include='object'` parameter: ``` obj_df = df.select_dtypes(include='object') ``` 10. Create a new variable called `obj_cols` that contains a list of column names from the `obj_df` DataFrame using the `.columns` attribute and display its content: ``` obj_cols = obj_df.columns obj_cols ``` You should get the following output: ![](./images/B15019_11_17.jpg) Caption: List of variables of the \'object\' type 11. Like we did in *Step 8*, create a `for` loop that will iterate through the column names contained in `obj_cols` and print their names and unique values using the `unique()` method: ``` for col_name in obj_cols: print(col_name) print(df[col_name].unique()) ``` You should get the following output: ![Caption: List of unique values for each variable of the \'object\' type ](./images/B15019_11_18.jpg) Caption: List of unique values for each variable of the \'object\' type As you can see, all these columns have a finite number of unique values that are composed of text, which shows us that they are categorical variables. 12. Now, create a `for` loop that will iterate through the column names contained in `obj_cols` and convert each of them into a categorical variable using the `astype()` method: ``` for col_name in obj_cols: df[col_name] = df[col_name].astype('category') ``` 13. Print the data type of each column using the `dtypes` attribute: ``` df.dtypes ``` You should get the following output: ![](./images/B15019_11_19.jpg) Handling Incorrect Values ========================= Let\'s learn how to detect such issues in real life by using the `Online Retail` dataset. First, you need to load the data into a `pandas` DataFrame: ``` import pandas as pd file_url = 'https://github.com/fenago/'\ 'data-science/blob/'\ 'master/Lab10/dataset/'\ 'Online%20Retail.xlsx?raw=true' df = pd.read_excel(file_url) ``` In this dataset, there are two variables that seem to be related to each other: `StockCode` and `Description`. The first one contains the identifier code of the items sold and the other one contains their descriptions. However, if you look at some of the examples, such as `StockCode 23131`, the `Description` column has different values: ``` df.loc[df['StockCode'] == 23131, 'Description'].unique() ``` You should get the following output ![](./images/B15019_11_20.jpg) Let\'s focus on the misspelling issue. What we need to do here is modify the incorrect spelling and replace it with the correct value. First, let\'s create a new column called `StockCodeDescription`, which is an exact copy of the `Description` column: ``` df['StockCodeDescription'] = df['Description'] ``` You will use this new column to fix the misspelling issue. To do this, use the subsetting technique you learned about earlier in this lab. You need to use `.loc` and filter the rows and columns you want, that is, all rows with `StockCode == 21131` and `Description == MISELTOE HEART WREATH CREAM` and the `Description` column: ``` df.loc[(df['StockCode'] == 23131) \ & (df['StockCodeDescription'] \ == 'MISELTOE HEART WREATH CREAM'), \ 'StockCodeDescription'] = 'MISTLETOE HEART WREATH CREAM' ``` If you reprint the value for this issue, you will see that the misspelling value has been fixed and is not present anymore: ``` df.loc[df['StockCode'] == 23131, 'StockCodeDescription'].unique() ``` You should get the following output: ![](./images/B15019_11_21.jpg) Caption: List of unique values for the Description column and StockCode 23131 after fixing the first misspelling issue As you can see, there are still five different values for this product, but for one of them, that is, `MISTLETOE`, has been spelled incorrectly: `MISELTOE`. This time, rather than looking at an exact match (a word must be the same as another one), we will look at performing a partial match (part of a word will be present in another word). In our case, instead of looking at the spelling of `MISELTOE`, we will only look at `MISEL`. The `pandas` package provides a method called `.str.contains()` that we can use to look for observations that partially match with a given expression. Let\'s use this to see if we have the same misspelling issue (`MISEL`) in the entire dataset. You will need to add one additional parameter since this method doesn\'t handle missing values. You will also have to subset the rows that don\'t have missing values for the `Description` column. This can be done by providing the `na=False` parameter to the `.str.contains()` method: ``` df.loc[df['StockCodeDescription']\ .str.contains('MISEL', na=False),] ``` You should get the following output: ![](./images/B15019_11_22.jpg) Caption: Displaying all the rows containing the misspelling \'MISELTOE\' This misspelling issue (`MISELTOE`) is not only related to `StockCode 23131`, but also to other items. You will need to fix all of these using the `str.replace()` method. This method takes the string of characters to be replaced and the replacement string as parameters: ``` df['StockCodeDescription'] = df['StockCodeDescription']\ .str.replace\ ('MISELTOE', 'MISTLETOE') ``` Now, if you print all the rows that contain the misspelling of `MISEL`, you will see that no such rows exist anymore: ``` df.loc[df['StockCodeDescription']\ .str.contains('MISEL', na=False),] ``` You should get the following output ![](./images/B15019_11_23.jpg) Exercise 11.03: Fixing Incorrect Values in the State Column ----------------------------------------------------------- In this exercise, you will clean the `State` variable in a modified version of a dataset by listing all the finance officers in the USA. We are doing this because the dataset contains some incorrect values. Let\'s get started: 1. Open a new Jupyter notebook. 2. Import the `pandas` package: ``` import pandas as pd ``` 3. Assign the link to the dataset to a variable called `file_url`: ``` file_url = 'https://raw.githubusercontent.com/'\ 'fenago/data-science/'\ 'master/Lab11/dataset/officers.csv' ``` 4. Using the `read_csv()` method from the `pandas` package, load the dataset into a new variable called `df`: ``` df = pd.read_csv(file_url) ``` 5. Print the first five rows of the DataFrame using the `.head()` method: ``` df.head() ``` You should get the following output: ![](./images/B15019_11_24.jpg) Caption: The first five rows of the finance officers dataset 6. Print out all the unique values of the `State` variable: ``` df['State'].unique() ``` You should get the following output: ![](./images/B15019_11_25.jpg) Caption: List of unique values in the State column 7. Print out the rows that have the `il` value in the `State` column using the `pandas` `.str.contains()` method and the subsetting API, that is, DataFrame \[condition\]. You will also have to set the `na` parameter to `False` in `str.contains()` in order to exclude observations with missing values: ``` df[df['State'].str.contains('il', na=False)] ``` You should get the following output: ![](./images/B15019_11_26.jpg) Caption: Observations with a value of il 8. Now, create a `for` loop that will iterate through the following values in the `State` column: `Il`, `iL`, `Il`. Then, print out the values of the City and State variables using the `pandas` method for subsetting, that is, `.loc()`: DataFrame.loc\[row\_condition, column condition\]. Do this for each observation: ``` for state in ['Il', 'iL', 'Il']: print(df.loc[df['State'] == state, ['City', 'State']]) ``` You should get the following output: ![](./images/B15019_11_27.jpg) Caption: Observations with the il value 9. Create a condition mask (`il_mask`) to subset all the rows that contain the four incorrect values (`il`, `Il`, `iL`, and `Il`) by using the `isin()` method and a list of these values as a parameter. Then, save the result into a variable called `il_mask`: ``` il_mask = df['State'].isin(['il', 'Il', 'iL', 'Il']) ``` 10. Print the number of rows that match the condition we set in `il_mask` using the `.sum()` method. This will sum all the rows that have a value of `True` (they match the condition): ``` il_mask.sum() ``` You should get the following output: ``` 672 ``` 11. Using the `pandas` `.loc()` method, subset the rows with the `il_mask` condition mask and replace the value of the `State` column with `IL`: ``` df.loc[il_mask, 'State'] = 'IL' ``` 12. Print out all the unique values of the `State` variable once more: ``` df['State'].unique() ``` You should get the following output: ![](./images/B15019_11_28.jpg) Caption: List of unique values for the \'State\' column As you can see, the four incorrect values are not present anymore. Let\'s have a look at the other remaining incorrect values: `II`, `I`, `8I`, and `60`. We will look at dealing `II` in the next step. Print out the rows that have a value of `II` into the `State` column using the `pandas` subsetting API, that is, DataFrame.loc\[row\_condition, column\_condition\]: ``` df.loc[df['State'] == 'II',] ``` You should get the following output: ![](./images/B15019_11_29.jpg) Caption: Subsetting the rows with a value of IL in the State column There are only two cases where the `II` value has been used for the `State` column and both have `Bloomington` as the city, which is in Illinois. Here, the correct `State` value should be `IL`. 13. Now, create a `for` loop that iterates through the three incorrect values (`I`, `8I`, and `60`) and print out the subsetted rows using the same logic that we used in *Step 12*. Only display the `City` and `State` columns: ``` for val in ['I', '8I', '60']: print(df.loc[df['State'] == val, ['City', 'State']]) ``` You should get the following output: ![](./images/B15019_11_30.jpg) Caption: Observations with incorrect values (I, 8I, and 60) All the observations that have incorrect values are cities in Illinois. Let\'s fix them now. 14. Create a `for` loop that iterates through the four incorrect values (`II`, `I`, `8I`, and `60`) and reuse the subsetting logic from *Step 12* to replace the value in `State` with `IL`: ``` for val in ['II', 'I', '8I', '60']: df.loc[df['State'] == val, 'State'] = 'IL' ``` 15. Print out all the unique values of the `State` variable: ``` df['State'].unique() ``` You should get the following output: ![](./images/B15019_11_31.jpg) Caption: List of unique values for the State column You fixed the issues for the state of Illinois. However, there are two more incorrect values in this column: `In` and `ng`. 16. Repeat *Step 13*, but iterate through the `In` and `ng` values instead: ``` for val in ['In', 'ng']: print(df.loc[df['State'] == val, ['City', 'State']]) ``` You should get the following output: ![](./images/B15019_11_32.jpg) Caption: Observations with incorrect values (In, ng) The rows that have the `ng` value in `State` are missing values. We will cover this topic in the next section. The observation that has `In` as its `State` is a city in Indiana, so the correct value should be `IN`. Let\'s fix this. 17. Subset the rows containing the `In` value in `State` using the `.loc()` and `.str.contains()` methods and replace the state value with `IN`. Don\'t forget to specify the `na=False` parameter as `.str.contains()`: ``` df.loc[df['State']\ .str.contains('In', na=False), 'State'] = 'IN' ``` Print out all the unique values of the `State` variable: ``` df['State'].unique() ``` You should get the following output: ![](./images/B15019_11_31.jpg) Caption: List of unique values for the State column You just fixed all the incorrect values for the `State` variable using the methods provided by the `pandas` package. In the next section, we are going to look at handling missing values. Handling Missing Values ======================= The `pandas` package provides a method that we can use to identify missing values in a DataFrame: `.isna()`. Let\'s see it in action on the `Online Retail` dataset. First, you need to import `pandas` and load the data into a DataFrame: ``` import pandas as pd file_url = 'https://github.com/fenago/'\ 'data-science/blob/'\ 'master/Lab10/dataset/'\ 'Online%20Retail.xlsx?raw=true' df = pd.read_excel(file_url) ``` The `.isna()` method returns a `pandas` series with a binary value for each cell of a DataFrame and states whether it is missing a value (`True`) or not (`False`): ``` df.isna() ``` You should get the following output: ![](./images/B15019_11_34.jpg) Caption: Output of the .isna() method As we saw previously, we can give the output of a binary variable to the `.sum()` method, which will add all the `True` values together (cells that have missing values) and provide a summary for each column: ``` df.isna().sum() ``` You should get the following output: ![](./images/B15019_11_35.jpg) Caption: Summary of missing values for each variable As you can see, there are `1454` missing values in the `Description` column and `135080` in the `CustomerID` column. Let\'s have a look at the missing value observations for `Description`. You can use the output of the `.isna()` method to subset the rows with missing values: ``` df[df['Description'].isna()] ``` You should get the following output: ![](./images/B15019_11_36.jpg) Caption: Subsetting the rows with missing values for Description The `pandas` package provides a method that we can use to easily remove missing values: `.dropna()`. This method returns a new DataFrame without all the rows that have missing values. By default, it will look at all the columns. You can specify a list of columns for it to look for with the `subset` parameter: ``` df.dropna(subset=['Description']) ``` This method returns a new DataFrame with no missing values for the specified columns. If you want to replace the original dataset directly, you can use the `inplace=True` parameter: ``` df.dropna(subset=['Description'], inplace=True) ``` Now, look at the summary of the missing values for each variable: ``` df.isna().sum() ``` You should get the following output: ![](./images/B15019_11_37.jpg) Caption: Summary of missing values for each variable As you can see, there are no more missing values in the `Description` column. Let\'s have a look at the `CustomerID` column: ``` df[df['CustomerID'].isna()] ``` You should get the following output: ![](./images/B15019_11_38.jpg) Most algorithms require a value for each observation, so you need to provide one for these cases. We will use the `.fillna()` method from `pandas` to do this. Provide the value to be imputed as `Missing` and use `inplace=True` as a parameter: ``` df['CustomerID'].fillna('Missing', inplace=True) df[1443:1448] ``` You should get the following output: ![Caption: Examples of rows where missing values for CustomerID have been replaced with Missing ](./images/B15019_11_39.jpg) Caption: Examples of rows where missing values for CustomerID have been replaced with Missing Let\'s see if we have any missing values in the dataset: ``` df.isna().sum() ``` You should get the following output: ![](./images/B15019_11_40.jpg) Exercise 11.04: Fixing Missing Values for the Horse Colic Dataset ----------------------------------------------------------------- In this exercise, you will be cleaning out all the missing values for all the numerical variables in the `Horse Colic` dataset. Colic is a painful condition that horses can suffer from, and this dataset contains various pieces of information related to specific cases of this condition. You can use the link provided in the Note section if you want to find out more about the dataset\'s attributes. Let\'s get started: 1. Open a new Jupyter notebook. 2. Import the `pandas` package: ``` import pandas as pd ``` 3. Assign the link to the dataset to a variable called `file_url`: ``` file_url = 'http://raw.githubusercontent.com/'\ 'fenago/data-science/'\ 'master/Lab11/dataset/horse-colic.data' ``` 4. Using the `.read_csv()` method from the `pandas` package, load the dataset into a new variable called `df` and specify the `header=None`,` sep='\s+'`, and` prefix='X'` parameters: ``` df = pd.read_csv(file_url, header=None, \ sep='\s+', prefix='X') ``` 5. Print the first five rows of the DataFrame using the `.head()` method: ``` df.head() ``` You should get the following output: ![](./images/B15019_11_41.jpg) Caption: The first five rows of the Horse Colic dataset 6. Reload the dataset into a `pandas` DataFrame using the `.read_csv()` method, but this time, add the `na_values='?'` parameter in order to specify that this value needs to be treated as a missing value: ``` df = pd.read_csv(file_url, header=None, sep='\s+', \ prefix='X', na_values='?') ``` 7. Print the first five rows of the DataFrame using the `.head()` method: ``` df.head() ``` You should get the following output: ![](./images/B15019_11_42.jpg) Caption: The first five rows of the Horse Colic dataset Now, you can see that `pandas` have converted all the `?` values into missing values. 8. Print the data type of each column using the `dtypes` attribute: ``` df.dtypes ``` You should get the following output: ![](./images/B15019_11_43.jpg) Caption: Data type of each column 9. Print the number of missing values for each column by combining the `.isna()` and `.sum()` methods: ``` df.isna().sum() ``` You should get the following output: ![](./images/B15019_11_44.jpg) Caption: Number of missing values for each column 10. Create a condition mask called `x0_mask` so that you can find the missing values in the `X0` column using the `.isna()` method: ``` x0_mask = df['X0'].isna() ``` 11. Display the number of missing values for this column by using the `.sum()` method on `x0_mask`: ``` x0_mask.sum() ``` You should get the following output: ``` 1 ``` Here, you got the exact same number of missing values for `X0` that you did in *Step 9*. 12. Extract the mean of `X0` using the `.median()` method and store it in a new variable called `x0_median`. Print its value: ``` x0_median = df['X0'].median() print(x0_median) ``` You should get the following output: ``` 1.0 ``` The median value for this column is `1`. You will replace all the missing values with this value in the `X0` column. 13. Replace all the missing values in the `X0` variable with their median using the `.fillna()` method, along with the `inplace=True` parameter: ``` df['X0'].fillna(x0_median, inplace=True) ``` 14. Print the number of missing values for `X0` by combining the `.isna()` and `.sum()` methods: ``` df['X0'].isna().sum() ``` You should get the following output: ``` 0 ``` There are no more missing values in the variables. 15. Create a `for` loop that will iterate through all the columns of the DataFrame. In the for loop, calculate the median for each and save them into a variable called `col_median`. Then, impute missing values with this median value using the `.fillna()` method, along with the `inplace=True` parameter, and print the name of the column and its median value: ``` for col_name in df.columns: col_median = df[col_name].median() df[col_name].fillna(col_median, inplace=True) print(col_name) print(col_median) ``` You should get the following output: ![](./images/B15019_11_45.jpg) Caption: Median values for each column 16. Print the number of missing values for each column by combining the `.isna()` and `.sum()` methods: ``` df.isna().sum() ``` You should get the following output: ![](./images/B15019_11_46.jpg) Caption: Number of missing values for each column You have successfully fixed the missing values for all the numerical variables using the methods provided by the `pandas` package: `.isna()` and `.fillna()`. Activity 11.01: Preparing the Speed Dating Dataset -------------------------------------------------- As an entrepreneur, you are planning to launch a new dating app into the market. The key feature that will differentiate your app from other competitors will be your high performing user-matching algorithm. Before building this model, you have partnered with a speed dating company to collect data from real events. You just received the dataset from your partner company but realized it is not as clean as you expected; there are missing and incorrect values. Your task is to fix the main data quality issues in this dataset. The following steps will help you complete this activity: 1. Download and load the dataset into Python using `.read_csv()`. 2. Print out the dimensions of the DataFrame using `.shape`. 3. Check for duplicate rows by using `.duplicated()` and `.sum()` on all the columns. 4. Check for duplicate rows by using `.duplicated() `and `.sum()` for the identifier columns (`iid`, `id`, `partner`, and `pid`). 5. Check for unexpected values for the following numerical variables: `'imprace', 'imprelig', 'sports', 'tvsports', 'exercise', 'dining', 'museums', 'art', 'hiking', 'gaming', 'clubbing', 'reading', 'tv', 'theater', 'movies', 'concerts', 'music', 'shopping',` and `'yoga'`. 6. Replace the identified incorrect values. 7. Check the data type of the different columns using `.dtypes`. 8. Change the data types to categorical for the columns that don\'t contain numerical values using `.astype()`. 9. Check for any missing values using `.isna()` and `.sum()` for each numerical variable. 10. Replace the missing values for each numerical variable with their corresponding mean or median values using `.fillna()`, `.mean()`, and `.median()`. You should get the following output. The figure represents the number of rows with unexpected values for `imprace` and a list of unexpected values: ![](./images/B15019_11_47.jpg) The following figure illustrates the number of rows with unexpected values and a list of unexpected values for each column: ![](./images/B15019_11_48.jpg) The following figure illustrates a list of unique values for gaming: ![](./images/B15019_11_49.jpg) The following figure displays the data types of each column: ![](./images/B15019_11_50.jpg) The following figure displays the updated data types of each column: ![](./images/B15019_11_51.jpg) The following figure displays the number of missing values for numerical variables: ![](./images/B15019_11_52.jpg) The following figure displays the list of unique values for `int_corr`: ![](./images/B15019_11_53.jpg) The following figure displays the list of unique values for numerical variables: ![](./images/B15019_11_54.jpg) The following figure displays the number of missing values for numerical variables: ![](./images/B15019_11_55.jpg) Summary ======= In this lab, you learned how important it is to prepare any given dataset and fix the main quality issues it has. This is critical because the cleaner a dataset is, the easier it will be for any machine learning model to easily learn about the relevant patterns. On top of this, most algorithms can\'t handle issues such as missing values, so they must be handled prior to the modeling phase. In this lab, you covered the most frequent issues that are faced in data science projects: duplicate rows, incorrect data types, unexpected values, and missing values.