diff --git a/Lab01/Data-Science-in-Python-the-Simple-Way.iml b/Lab01/Data-Science-in-Python-the-Simple-Way.iml deleted file mode 100644 index a42dedc..0000000 --- a/Lab01/Data-Science-in-Python-the-Simple-Way.iml +++ /dev/null @@ -1,14 +0,0 @@ - - - - - - - - - - - - \ No newline at end of file diff --git a/Lab01/misc.xml b/Lab01/misc.xml deleted file mode 100644 index 6ab0bd6..0000000 --- a/Lab01/misc.xml +++ /dev/null @@ -1,7 +0,0 @@ - - - - - - \ No newline at end of file diff --git a/Lab01/modules.xml b/Lab01/modules.xml deleted file mode 100644 index 53fb054..0000000 --- a/Lab01/modules.xml +++ /dev/null @@ -1,8 +0,0 @@ - - - - - - - - \ No newline at end of file diff --git a/Lab01/vcs.xml b/Lab01/vcs.xml deleted file mode 100644 index 94a25f7..0000000 --- a/Lab01/vcs.xml +++ /dev/null @@ -1,6 +0,0 @@ - - - - - - \ No newline at end of file diff --git a/lab_guides/Lab_1.md b/lab_guides/Lab_1.md new file mode 100644 index 0000000..a9259e0 --- /dev/null +++ b/lab_guides/Lab_1.md @@ -0,0 +1,1365 @@ + +1. Introduction to Data Science in Python +========================================= + + + +Overview + +This very first lab will introduce you to the field of data science +and walk you through an overview of Python\'s core concepts and their +application in the world of data science. + +By the end of this lab, you will be able to explain what data +science is and distinguish between supervised and unsupervised learning. +You will also be able to explain what machine learning is and +distinguish between regression, classification, and clustering problems. +You\'ll have learnt to create and manipulate different types of Python +variable, including core variables, lists, and dictionaries. You\'ll be +able to build a `for` loop, print results using f-strings, +define functions, import Python packages and load data in different +formats using `pandas`. You will also have had your first +taste of training a model using scikit-learn. + + +Introduction +============ + + +Welcome to the fascinating world of data science! We are sure you must +be pretty excited to start your journey and learn interesting and +exciting techniques and algorithms. This is exactly what this book is +intended for. + +But before diving into it, let\'s define what data science is: it is a +combination of multiple disciplines, including business, statistics, and +programming, that intends to extract meaningful insights from data by +running controlled experiments similar to scientific research. + +The objective of any data science project is to derive valuable +knowledge for the business from data in order to make better decisions. +It is the responsibility of data scientists to define the goals to be +achieved for a project. This requires business knowledge and expertise. +In this book, you will be exposed to some examples of data science tasks +from real-world datasets. + +Statistics is a mathematical field used for analyzing and finding +patterns from data. A lot of the newest and most advanced techniques +still rely on core statistical approaches. This book will present to you +the basic techniques required to understand the concepts we will be +covering. + +With an exponential increase in data generation, more computational +power is required for processing it efficiently. This is the reason why +programming is a required skill for data scientists. You may wonder why +we chose Python for this Workshop. That\'s because Python is one of the +most popular programming languages for data science. It is extremely +easy to learn how to code in Python thanks to its simple and easily +readable syntax. It also has an incredible number of packages available +to anyone for free, such as pandas, scikit-learn, TensorFlow, and +PyTorch. Its community is expanding at an incredible rate, adding more +and more new functionalities and improving its performance and +reliability. It\'s no wonder companies such as Facebook, Airbnb, and +Google are using it as one of their main stacks. No prior knowledge of +Python is required for this book. If you do have some experience with +Python or other programming languages, then this will be an advantage, +but all concepts will be fully explained, so don\'t worry if you are new +to programming. + + +Application of Data Science +=========================== + + +As mentioned in the introduction, data science is a multidisciplinary +approach to analyzing and identifying complex patterns and extracting +valuable insights from data. Running a data science project usually +involves multiple steps, including the following: + +1. Defining the business problem to be solved +2. Collecting or extracting existing data +3. Analyzing, visualizing, and preparing data +4. Training a model to spot patterns in data and make predictions +5. Assessing a model\'s performance and making improvements +6. Communicating and presenting findings and gained insights +7. Deploying and maintaining a model + +As its name implies, data science projects require data, but it is +actually more important to have defined a clear business problem to +solve first. If it\'s not framed correctly, a project may lead to +incorrect results as you may have used the wrong information, not +prepared the data properly, or led a model to learn the wrong patterns. +So, it is absolutely critical to properly define the scope and objective +of a data science project with your stakeholders. + +There are a lot of data science applications in real-world situations or +in business environments. For example, healthcare providers may train a +model for predicting a medical outcome or its severity based on medical +measurements, or a high school may want to predict which students are at +risk of dropping out within a year\'s time based on their historical +grades and past behaviors. Corporations may be interested to know the +likelihood of a customer buying a certain product based on his or her +past purchases. They may also need to better understand which customers +are more likely to stop using existing services and churn. These are +examples where data science can be used to achieve a clearly defined +goal, such as increasing the number of patients detected with a heart +condition at an early stage or reducing the number of customers +canceling their subscriptions after six months. That sounds exciting, +right? Soon enough, you will be working on such interesting projects. + + + +What Is Machine Learning? +------------------------- + +When we mention data science, we usually think about machine learning, +and some people may not understand the difference between them. Machine +learning is the field of building algorithms that can learn patterns by +themselves without being programmed explicitly. So machine learning is a +family of techniques that can be used at the modeling stage of a data +science project. + +Machine learning is composed of three different types of learning: + +- Supervised learning +- Unsupervised learning +- Reinforcement learning + + + +### Supervised Learning + +Supervised learning refers to a type of task where an algorithm is +trained to learn patterns based on prior knowledge. That means this kind +of learning requires the labeling of the outcome (also called the +response variable, dependent variable, or target variable) to be +predicted beforehand. For instance, if you want to train a model that +will predict whether a customer will cancel their subscription, you will +need a dataset with a column (or variable) that already contains the +churn outcome (cancel or not cancel) for past or existing customers. +This outcome has to be labeled by someone prior to the training of a +model. If this dataset contains 5,000 observations, then all of them +need to have the outcome being populated. The objective of the model is +to learn the relationship between this outcome column and the other +features (also called independent variables or predictor variables). +Following is an example of such a dataset: + +![](./images/B15019_01_01.jpg) + +Caption: Example of customer churn dataset + +The `Cancel` column is the response variable. This is the +column you are interested in, and you want the model to predict +accurately the outcome for new input data (in this case, new customers). +All the other columns are the predictor variables. + +The model, after being trained, may find the following pattern: a +customer is more likely to cancel their subscription after 12 months and +if their average monthly spent is over `$50`. So, if a new +customer has gone through 15 months of subscription and is spending \$85 +per month, the model will predict this customer will cancel their +contract in the future. + +When the response variable contains a limited number of possible values +(or classes), it is a classification problem (you will learn more about +this in *Lab 3, Binary Classification*, and *Lab 4, Multiclass +Classification with RandomForest*). The model will learn how to predict +the right class given the values of the independent variables. The churn +example we just mentioned is a classification problem as the response +variable can only take two different values: `yes` or +`no`. + +On the other hand, if the response variable can have a value from an +infinite number of possibilities, it is called a regression problem. + +An example of a regression problem is where you are trying to predict +the exact number of mobile phones produced every day for some +manufacturing plants. This value can potentially range from 0 to an +infinite number (or a number big enough to have a large range of +potential values), as shown in *Figure 1.2*. + +![](./images/B15019_01_02.jpg) + +Caption: Example of a mobile phone production dataset + +In the preceding figure, you can see that the values for +`Daily output` can take any value from `15000` to +more than `50000`. This is a regression problem, which we will +look at in *Lab 2, Regression*. + + + +### Unsupervised Learning + +Unsupervised learning is a type of algorithm that doesn\'t require any +response variables at all. In this case, the model will learn patterns +from the data by itself. You may ask what kind of pattern it can find if +there is no target specified beforehand. + +This type of algorithm usually can detect similarities between variables +or records, so it will try to group those that are very close to each +other. This kind of algorithm can be used for clustering (grouping +records) or dimensionality reduction (reducing the number of variables). +Clustering is very popular for performing customer segmentation, where +the algorithm will look to group customers with similar behaviors +together from the data. *Lab 5*, *Performing Your First Cluster +Analysis*, will walk you through an example of clustering analysis. + + + +### Reinforcement Learning + +Reinforcement learning is another type of algorithm that learns how to +act in a specific environment based on the feedback it receives. You may +have seen some videos where algorithms are trained to play Atari games +by themselves. Reinforcement learning techniques are being used to teach +the agent how to act in the game based on the rewards or penalties it +receives from the game. + +For instance, in the game Pong, the agent will learn to not let the ball +drop after multiple rounds of training in which it receives high +penalties every time the ball drops. + +Note + +Reinforcement learning algorithms are out of scope and will not be +covered in this book. + + +Overview of Python +================== + + +As mentioned earlier, Python is one of the most popular programming +languages for data science. But before diving into Python\'s data +science applications, let\'s have a quick introduction to some core +Python concepts. + + + +Types of Variable +----------------- + +In Python, you can handle and manipulate different types of variables. +Each has its own specificities and benefits. We will not go through +every single one of them but rather focus on the main ones that you will +have to use in this book. For each of the following code examples, you +can run the code in Google Colab to view the given output. + + + +### Numeric Variables + +The most basic variable type is numeric. This can contain integer or +decimal (or float) numbers, and some mathematical operations can be +performed on top of them. + +Let\'s use an integer variable called `var1` that will take +the value `8` and another one called `var2` with the +value `160.88`, and add them together with the `+` +operator, as shown here: + +``` +var1 = 8 +var2 = 160.88 +var1 + var2 +``` +You should get the following output: + +![](./images/B15019_01_03.jpg) + +Caption: Output of the addition of two variables + +In Python, you can perform other mathematical operations on numerical +variables, such as multiplication (with the `*` operator) and +division (with `/`). + + + +### Text Variables + +Another interesting type of variable is `string`, which +contains textual information. You can create a variable with some +specific text using the single or double quote, as shown in the +following example: + +``` +var3 = 'Hello, ' +var4 = 'World' +``` + +In order to display the content of a variable, you can call the +`print()` function: + +``` +print(var3) +print(var4) +``` +You should get the following output: + +![](./images/B15019_01_04.jpg) + +Caption: Printing the two text variables + +Python also provides an interface called f-strings for printing text +with the value of defined variables. It is very handy when you want to +print results with additional text to make it more readable and +interpret results. It is also quite common to use f-strings to print +logs. You will need to add `f` before the quotes (or double +quotes) to specify that the text will be an f-string. Then you can add +an existing variable inside the quotes and display the text with the +value of this variable. You need to wrap the variable with curly +brackets, `{}`. + +For instance, if we want to print `Text:` before the values of +`var3` and `var4`, we will write the following code: + +``` +print(f"Text: {var3} {var4}!") +``` +You should get the following output: + +![](./images/B15019_01_05.jpg) + +Caption: Printing with f-strings + +You can also perform some text-related transformations with string +variables, such as capitalizing or replacing characters. For instance, +you can concatenate the two variables together with the `+` +operator: + +``` +var3 + var4 +``` +You should get the following output: + +![](./images/B15019_01_06.jpg) + +Caption: Concatenation of the two text variables + + + +### Python List + +Another very useful type of variable is the list. It is a collection of +items that can be changed (you can add, update, or remove items). To +declare a list, you will need to use square brackets, `[]`, +like this: + +``` +var5 = ['I', 'love', 'data', 'science'] +print(var5) +``` +You should get the following output: + +![](./images/B15019_01_07.jpg) + +Caption: List containing only string items + +A list can have different item types, so you can mix numerical and text +variables in it: + +``` +var6 = ['Fenago', 15019, 2020, 'Data Science'] +print(var6) +``` + + +An item in a list can be accessed by its index (its position in the +list). To access the first (index 0) and third elements (index 2) of a +list, you do the following: + +``` +print(var6[0]) +print(var6[2]) +``` +Note + +In Python, all indexes start at `0`. + + +Python provides an API to access a range of items using the +`:` operator. You just need to specify the starting index on +the left side of the operator and the ending index on the right side. +The ending index is always excluded from the range. So, if you want to +get the first three items (index 0 to 2), you should do as follows: + +``` +print(var6[0:3]) +``` + +You can also iterate through every item of a list using a +`for` loop. If you want to print every item of the +`var6` list, you should do this: + +``` +for item in var6: + print(item) +``` +You should get the following output: + + + +You can add an item at the end of the list using the +`.append()` method: + +``` +var6.append('Python') +print(var6) +``` + + + +To delete an item from the list, you use the `.remove()` +method: + +``` +var6.remove(15019) +print(var6) +``` + + +### Python Dictionary + +A dictionary contains multiple elements, like a **list**, but each element +is organized as a key-value pair. A dictionary is not indexed by numbers +but by keys. So, to access a specific value, you will have to call the +item by its corresponding key. To define a dictionary in Python, you +will use curly brackets, `{}`, and specify the keys and values +separated by `:`, as shown here: + +``` +var7 = {'Topic': 'Data Science', 'Language': 'Python'} +print(var7) +``` +You should get the following output: + +![](./images/B15019_01_14.jpg) + +Caption: Output of var7 + +To access a specific value, you need to provide the corresponding key +name. For instance, if you want to get the value `Python`, you +do this: + +``` +var7['Language'] +``` +You should get the following output: + +![](./images/B15019_01_15.jpg) + +Caption: Value for the \'Language\' key + +Note + +Each key-value pair in a dictionary needs to be unique. + +Python provides a method to access all the key names from a dictionary, +`.keys()`, which is used as shown in the following code +snippet: + +``` +var7.keys() +``` +You should get the following output: + +![](./images/B15019_01_16.jpg) + +Caption: List of key names + +There is also a method called `.values()`, which is used to +access all the values of a dictionary: + +``` +var7.values() +``` +You should get the following output: + +![](./images/B15019_01_17.jpg) + +Caption: List of values + +You can iterate through all items from a dictionary using a +`for` loop and the `.items()` method, as shown in +the following code snippet: + +``` +for key, value in var7.items(): + print(key) + print(value) +``` +You should get the following output: + +![](./images/B15019_01_18.jpg) + +Caption: Output after iterating through the items of a dictionary + +You can add a new element in a dictionary by providing the key name like +this: + +``` +var7['Publisher'] = 'Fenago' +print(var7) +``` + + +You can delete an item from a dictionary with the `del` +command: + +``` +del var7['Publisher'] +print(var7) +``` +You should get the following output: + +![](./images/B15019_01_20.jpg) + +Caption: Output of a dictionary after removing an item + +In *Exercise 1.01*, *Creating a Dictionary That Will Contain Machine +Learning Algorithms*, we will be looking to use these concepts that +we\'ve just looked at. + + + +Exercise 1.01: Creating a Dictionary That Will Contain Machine Learning Algorithms +---------------------------------------------------------------------------------- + +In this exercise, we will create a dictionary using Python that will +contain a collection of different machine learning algorithms that will +be covered in this book. + +The following steps will help you complete the exercise: + +Note + +Every exercise and activity in this book is to be executed on Google +Colab. + +1. Open on a new Colab notebook. + +2. Create a list called `algorithm` that will contain the + following elements: `Linear Regression`, + `Logistic Regression`, `RandomForest`, and + `a3c`: + + ``` + algorithm = ['Linear Regression', 'Logistic Regression', \ + 'RandomForest', 'a3c'] + ``` + + + Note + + The code snippet shown above uses a backslash ( `\` ) to + split the logic across multiple lines. When the code is executed, + Python will ignore the backslash, and treat the code on the next + line as a direct continuation of the current line. + +3. Now, create a list called `learning` that will contain the + following elements: `Supervised`, `Supervised`, + `Supervised`, and `Reinforcement`: + ``` + learning = ['Supervised', 'Supervised', 'Supervised', \ + 'Reinforcement'] + ``` + + +4. Create a list called `algorithm_type` that will contain + the following elements: `Regression`, + `Classification`, + `Regression or Classification`, and `Game AI`: + ``` + algorithm_type = ['Regression', 'Classification', \ + 'Regression or Classification', 'Game AI'] + ``` + + +5. Add an item called `k-means` into the + `algorithm` list using the `.append()` method: + ``` + algorithm.append('k-means') + ``` + + +6. Display the content of `algorithm` using the + `print()` function: + + ``` + print(algorithm) + ``` + + + You should get the following output: + + +![](./images/B15019_01_21.jpg) + + + Caption: Output of \'algorithm\' + + From the preceding output, we can see that we added the + `k-means` item to the list. + +7. Now, add the `Unsupervised` item into the + `learning` list using the `.append()` method: + ``` + learning.append('Unsupervised') + ``` + + +8. Display the content of `learning` using the + `print()` function: + + ``` + print(learning) + ``` + + + You should get the following output: + + +![](./images/B15019_01_22.jpg) + + + Caption: Output of \'learning\' + + From the preceding output, we can see that we added the + `Unsupervised` item into the list. + +9. Add the `Clustering` item into the + `algorithm_type` list using the `.append()` + method: + ``` + algorithm_type.append('Clustering') + ``` + + +10. Display the content of `algorithm_type` using the + `print()` function: + + ``` + print(algorithm_type) + ``` + + + You should get the following output: + + +![](./images/B15019_01_23.jpg) + + + Caption: Output of \'algorithm\_type\' + + From the preceding output, we can see that we added the + `Clustering` item into the list. + +11. Create an empty dictionary called `machine_learning` using + curly brackets, `{}`: + ``` + machine_learning = {} + ``` + + +12. Create a new item in `machine_learning` with the key as + `algorithm` and the value as all the items from the + `algorithm` list: + ``` + machine_learning['algorithm'] = algorithm + ``` + + +13. Display the content of `machine_learning` using the + `print()` function. + + ``` + print(machine_learning) + ``` + + + You should get the following output: + + +![](./images/B15019_01_24.jpg) + + + Caption: Output of \'machine\_learning\' + + From the preceding output, we notice that we have created a + dictionary from the `algorithm` list. + +14. Create a new item in `machine_learning` with the key as + `learning` and the value as all the items from the + `learning` list: + ``` + machine_learning['learning'] = learning + ``` + + +15. Now, create a new item in `machine_learning` with the key + as `algorithm_type` and the value as all the items from + the algorithm\_type list: + ``` + machine_learning['algorithm_type'] = algorithm_type + ``` + + +16. Display the content of `machine_learning` using the + `print()` function. + + ``` + print(machine_learning) + ``` + + + You should get the following output: + + +![](./images/B15019_01_25.jpg) + + + Caption: Output of \'machine\_learning\' + +17. Remove the `a3c` item from the `algorithm` key + using the `.remove()` method: + ``` + machine_learning['algorithm'].remove('a3c') + ``` + + +18. Display the content of the `algorithm` item from the + `machine_learning` dictionary using the + `print()` function: + + ``` + print(machine_learning['algorithm']) + ``` + + + You should get the following output: + + +![](./images/B15019_01_26.jpg) + + + Caption: Output of \'algorithm\' from \'machine\_learning\' + +19. Remove the `Reinforcement` item from the + `learning` key using the `.remove()` method: + ``` + machine_learning['learning'].remove('Reinforcement') + ``` + + +20. Remove the `Game AI` item from the + `algorithm_type` key using the `.remove()` + method: + ``` + machine_learning['algorithm_type'].remove('Game AI') + ``` + + +21. Display the content of `machine_learning` using the + `print()` function: + + ``` + print(machine_learning) + ``` + + + You should get the following output: + + +![](./images/B15019_01_27.jpg) + + +Caption: Output of \'machine\_learning\' + + + +Python for Data Science +======================= + + +In this section, we will present to you two of the most popular ones: +`pandas` and `scikit-learn`. + + + +The pandas Package +------------------ + +The pandas package provides an incredible amount of APIs for +manipulating data structures. The two main data structures defined in +the `pandas` package are `DataFrame` and +`Series`. + + + +### DataFrame and Series + + +![](./images/B15019_01_28.jpg) + +Caption: Components of a DataFrame + + +In pandas, a DataFrame is represented by the `DataFrame` +class. A `pandas` DataFrame is composed of `pandas` +Series, which are 1-dimensional arrays. A `pandas` Series is +basically a single column in a DataFrame. + + +### CSV Files + +CSV files use the comma character---`,`---to separate columns +and newlines for a new row. The previous example of a DataFrame would +look like this in a CSV file: + +``` +algorithm,learning,type +Linear Regression,Supervised,Regression +Logistic Regression,Supervised,Classification +RandomForest,Supervised,Regression or Classification +k-means,Unsupervised,Clustering +``` + +In Python, you need to first import the packages you require before +being able to use them. To do so, you will have to use the +`import` command. You can create an alias of each imported +package using the `as` keyword. It is quite common to import +the `pandas` package with the alias `pd`: + +``` +import pandas as pd +``` +`pandas` provides a `.read_csv()` method to easily +load a CSV file directly into a DataFrame. You just need to provide the +path or the URL to the CSV file, as shown below. + +Note + +Watch out for the slashes in the string below. Remember that the +backslashes ( `\` ) are used to split the code across multiple +lines, while the forward slashes ( `/` ) are part of the URL. + +``` +pd.read_csv('https://raw.githubusercontent.com/fenago'\ + '/data-science/master/Lab01/'\ + 'Dataset/csv_example.csv') +``` +You should get the following output: + +![](./images/B15019_01_29.jpg) + + + + +### Excel Spreadsheets + +Excel is a Microsoft tool and is very popular in the industry. It has +its own internal structure for recording additional information, such as +the data type of each cell or even Excel formulas. There is a specific +method in `pandas` to load Excel spreadsheets called +`.read_excel()`: + +``` +pd.read_excel('https://github.com/fenago'\ + '/data-science/blob/master'\ + '/Lab01/Dataset/excel_example.xlsx?raw=true') +``` +You should get the following output: + +![](./images/B15019_01_31.jpg) + +Caption: Dataframe after loading an Excel spreadsheet + + + +### JSON + +JSON is a very popular file format, mainly used for transferring data +from web APIs. Its structure is very similar to that of a Python +dictionary with key-value pairs. The example DataFrame we used before +would look like this in JSON format: + +``` +{ + "algorithm":{ + "0":"Linear Regression", + "1":"Logistic Regression", + "2":"RandomForest", + "3":"k-means" + }, + "learning":{ + "0":"Supervised", + "1":"Supervised", + "2":"Supervised", + "3":"Unsupervised" + }, + "type":{ + "0":"Regression", + "1":"Classification", + "2":"Regression or Classification", + "3":"Clustering" + } +} +``` +As you may have guessed, there is a `pandas` method for +reading JSON data as well, and it is called `.read_json()`: + +``` +pd.read_json('https://raw.githubusercontent.com/fenago'\ + '/data-science/master/Lab01'\ + '/Dataset/json_example.json') +``` + +You should get the following output: + +![](./images/B15019_01_32.jpg) + +Caption: Dataframe after loading JSON data + + + +Exercise 1.02: Loading Data of Different Formats into a pandas DataFrame +------------------------------------------------------------------------ + +In this exercise, we will practice loading different data formats, such +as CSV, TSV, and XLSX, into pandas DataFrames. The dataset we will use +is the Top 10 Postcodes for the First Home Owner Grants dataset (this is +a grant provided by the Australian government to help first-time real +estate buyers). It lists the 10 postcodes (also known as zip codes) with +the highest number of First Home Owner grants. + +In this dataset, you will find the number of First Home Owner grant +applications for each postcode and the corresponding suburb. + + +The following steps will help you complete the exercise: + +1. Open a new Colab notebook. + +2. Import the pandas package, as shown in the following code snippet: + ``` + import pandas as pd + ``` + + +3. Create a new variable called `csv_url` containing the URL + to the raw CSV file: + ``` + csv_url = 'https://raw.githubusercontent.com/fenago'\ + '/data-science/master/Lab01'\ + '/Dataset/overall_topten_2012-2013.csv' + ``` + + +4. Load the CSV file into a DataFrame using the pandas + `.read_csv()` method. The first row of this CSV file + contains the name of the file, which you can see if you open the + file directly. You will need to exclude this row by using the + `skiprows=1` parameter. Save the result in a variable + called `csv_df` and print it: + + ``` + csv_df = pd.read_csv(csv_url, skiprows=1) + csv_df + ``` + + + You should get the following output: + + +![](./images/B15019_01_33.jpg) + + + Caption: The DataFrame after loading the CSV file + +5. Create a new variable called `tsv_url` containing the URL + to the raw TSV file: + + ``` + tsv_url = 'https://raw.githubusercontent.com/fenago'\ + '/data-science/master/Lab01'\ + '/Dataset/overall_topten_2012-2013.tsv' + ``` + + + Note + + A TSV file is similar to a CSV file but instead of using the comma + character (`,`) as a separator, it uses the tab character + (`\t`). + +6. Load the TSV file into a DataFrame using the pandas + .`read_csv()` method and specify the + `skiprows=1` and `sep='\t'` parameters. Save the + result in a variable called `tsv_df` and print it: + + ``` + tsv_df = pd.read_csv(tsv_url, skiprows=1, sep='\t') + tsv_df + ``` + + + You should get the following output: + + +![](./images/B15019_01_34.jpg) + + + Caption: The DataFrame after loading the TSV file + +7. Create a new variable called `xlsx_url` containing the URL + to the raw Excel spreadsheet: + ``` + xlsx_url = 'https://github.com/fenago'\ + '/data-science/blob/master/'\ + 'Lab01/Dataset'\ + '/overall_topten_2012-2013.xlsx?raw=true' + ``` + + +8. Load the Excel spreadsheet into a DataFrame using the pandas + `.read_excel()` method. Save the result in a variable + called `xlsx_df` and print it: + + ``` + xlsx_df = pd.read_excel(xlsx_url) + xlsx_df + ``` + + + You should get the following output: + + +![](./images/B15019_01_35.jpg) + + + + By default, `.read_excel()` loads the first sheet of an + Excel spreadsheet. In this example, the data we\'re looking for is + actually stored in the second sheet. + +9. Load the Excel spreadsheet into a Dataframe using the pandas + `.read_excel()` method and specify the + `skiprows=1` and `sheet_name=1` parameters. + (Note that the `sheet_name` parameter is zero-indexed, so + `sheet_name=0` returns the first sheet, while + `sheet_name=1` returns the second sheet.) Save the result + in a variable called `xlsx_df1` and print it: + + ``` + xlsx_df1 = pd.read_excel(xlsx_url, skiprows=1, sheet_name=1) + xlsx_df1 + ``` + + + You should get the following output: + + +![](./images/B15019_01_36.jpg) + + + +### The sklearn API + + +`sklearn` groups algorithms by family. For instance, +`RandomForest` and `GradientBoosting` are part of +the `ensemble` module. In order to make use of an algorithm, +you will need to import it first like this: + +``` +from sklearn.ensemble import RandomForestClassifier +``` + + +It is recommended to at least set the `random_state` +hyperparameter in order to get reproducible results every time that you +have to run the same code: + +``` +rf_model = RandomForestClassifier(random_state=1) +``` + +The second step is to train the model with some data. In this example, +we will use a simple dataset that classifies 178 instances of Italian +wines into 3 categories based on 13 features. This dataset is part of +the few examples that `sklearn` provides within its API. We +need to load the data first: + +``` +from sklearn.datasets import load_wine +features, target = load_wine(return_X_y=True) +``` + +Then using the `.fit()` method to train the model, you will +provide the features and the target variable as input: + +``` +rf_model.fit(features, target) +``` +You should get the following output: + +![](./images/B15019_01_44.jpg) + +Caption: Logs of the trained Random Forest model + +In the preceding output, we can see a Random Forest model with the +default hyperparameters. You will be introduced to some of them in +*Lab 4*, *Multiclass Classification with RandomForest*. + +Once trained, we can use the `.predict()` method to predict +the target for one or more observations. Here we will use the same data +as for the training step: + +``` +preds = rf_model.predict(features) +preds +``` +You should get the following output: + +![](./images/B15019_01_45.jpg) + +Caption: Predictions of the trained Random Forest model + + + +Finally, we want to assess the model\'s performance by comparing its +predictions to the actual values of the target variable. There are a lot +of different metrics that can be used for assessing model performance, +and you will learn more about them later in this book. For now, though, +we will just use a metric called **accuracy**. This metric calculates +the ratio of correct predictions to the total number of observations: + +``` +from sklearn.metrics import accuracy_score +accuracy_score(target, preds) +``` +You should get the following output + +![](./images/B15019_01_46.jpg) + +Caption: Accuracy of the trained Random Forest model + + + +Exercise 1.03: Predicting Breast Cancer from a Dataset Using sklearn +-------------------------------------------------------------------- + +In this exercise, we will build a machine learning classifier using +`RandomForest` from `sklearn` to predict whether the +breast cancer of a patient is malignant (harmful) or benign (not +harmful). + + +The following steps will help you complete the exercise: + +1. Open a new Colab notebook. + +2. Import the `load_breast_cancer` function from + `sklearn.datasets`: + ``` + from sklearn.datasets import load_breast_cancer + ``` + + +3. Load the dataset from the `load_breast_cancer` function + with the `return_X_y=True` parameter to return the + features and response variable only: + ``` + features, target = load_breast_cancer(return_X_y=True) + ``` + + +4. Print the variable features: + + ``` + print(features) + ``` + + + You should get the following output: + + +![](./images/B15019_01_47.jpg) + + + Caption: Output of the variable features + + The preceding output shows the values of the features. (You can + learn more about the features from the link given previously.) + +5. Print the `target` variable: + + ``` + print(target) + ``` + + + You should get the following output: + + +![](./images/B15019_01_48.jpg) + + + Caption: Output of the variable target + + The preceding output shows the values of the target variable. There + are two classes shown for each instance in the dataset. These + classes are `0` and `1`, representing whether + the cancer is malignant or benign. + +6. Import the `RandomForestClassifier` class from + `sklearn.ensemble`: + ``` + from sklearn.ensemble import RandomForestClassifier + ``` + + +7. Create a new variable called `seed`, which will take the + value `888` (chosen arbitrarily): + ``` + seed = 888 + ``` + + +8. Instantiate `RandomForestClassifier` with the + `random_state=seed` parameter and save it into a variable + called `rf_model`: + ``` + rf_model = RandomForestClassifier(random_state=seed) + ``` + + +9. Train the model with the `.fit()` method with + `features` and `target` as parameters: + + ``` + rf_model.fit(features, target) + ``` + + + You should get the following output: + + +![](./images/B15019_01_49.jpg) + + + Caption: Logs of RandomForestClassifier + +10. Make predictions with the trained model using the + `.predict()` method and `features` as a + parameter and save the results into a variable called + `preds`: + ``` + preds = rf_model.predict(features) + ``` + + +11. Print the `preds` variable: + + ``` + print(preds) + ``` + + + You should get the following output: + + +![](./images/B15019_01_50.jpg) + + + Caption: Predictions of the Random Forest model + + The preceding output shows the predictions for the training set. You + can compare this with the actual target variable values shown in + *Figure 1.48*. + +12. Import the `accuracy_score` method from + `sklearn.metrics`: + ``` + from sklearn.metrics import accuracy_score + ``` + + +13. Calculate `accuracy_score()` with `target` and + `preds` as parameters: + + ``` + accuracy_score(target, preds) + ``` + + + You should get the following output: + + +![](./images/B15019_01_51.jpg) + + + +Activity 1.01: Train a Spam Detector Algorithm +---------------------------------------------- + +You are working for an email service provider and have been tasked with +training an algorithm that recognizes whether an email is spam or not +from a given dataset and checking its performance. + +In this dataset, the authors have already created 57 different features +based on some statistics for relevant keywords in order to classify +whether an email is spam or not. + + +The following steps will help you to complete this activity: + +1. Import the required libraries. + +2. Load the dataset using `.pd.read_csv()`. + +3. Extract the response variable using .`pop()` from + `pandas`. This method will extract the column provided as + a parameter from the DataFrame. You can then assign it a variable + name, for example, `target = df.pop('class')`. + +4. Instantiate `RandomForestClassifier`. + +5. Train a Random Forest model to predict the outcome with + .`fit()`. + +6. Predict the outcomes from the input data using + `.predict()`. + +7. Calculate the accuracy score using `accuracy_score`. + + The output will be similar to the following: + + +![](./images/B15019_01_52.jpg) + + + +Summary +======= + + +This lab provided you with an overview of what data science is in +general. We also learned the different types of machine learning +algorithms, including supervised and unsupervised, as well as regression +and classification. We had a quick introduction to Python and how to +manipulate the main data structures (lists and dictionaries) that will +be used in this book. + +Then we walked you through what a DataFrame is and how to create one by +loading data from different file formats using the famous pandas +package. Finally, we learned how to use the sklearn package to train a +machine learning model and make predictions with it. + +This was just a quick glimpse into the fascinating world of data +science. In this book, you will learn much more and discover new +techniques for handling data science projects from end to end. + +The next lab will show you how to perform a regression task on a +real-world dataset. diff --git a/lab_guides/Lab_10.md b/lab_guides/Lab_10.md new file mode 100644 index 0000000..97c40c5 --- /dev/null +++ b/lab_guides/Lab_10.md @@ -0,0 +1,1641 @@ + +10. Analyzing a Dataset +======================= + + + +Overview + +By the end of this lab, you will be able to explain the key steps +involved in performing exploratory data analysis; identify the types of +data contained in the dataset; summarize the dataset and at a detailed +level for each variable; visualize the data distribution in each column; +find relationships between variables and analyze missing values and +outliers for each variable + +This lab will introduce you to the art of performing exploratory +data analysis and visualizing the data in order to identify quality +issues, potential data transformations, and interesting patterns. + + + +Exploring Your Data +=================== + + +If you are running your project by following the CRISP-DM methodology, +the first step will be to discuss the project with the stakeholders and +clearly define their requirements and expectations. Only once this is +clear can you start having a look at the data and see whether you will +be able to achieve these objectives. + +After receiving a dataset, you may want to make sure that the dataset +contains the information you need for your project. For instance, if you +are working on a supervised project, you will check whether this dataset +contains the target variable you need and whether there are any missing +or incorrect values for this field. You may also check how many +observations (rows) and variables (columns) there are. These are the +kind of questions you will have initially with a new dataset. This +section will introduce you to some techniques you can use to get the +answers to these questions. + +For the rest of this section, we will be working with a dataset +containing transactions from an online retail store. + + + +Our dataset is an Excel spreadsheet. Luckily, the `pandas` +package provides a method we can use to load this type of file: +`read_excel()`. + +Let\'s read the data using the `.read_excel()` method and +store it in a `pandas` DataFrame, as shown in the following +code snippet: + +``` +import pandas as pd +file_url = 'https://github.com/fenago/'\ + 'data-science/blob/'\ + 'master/Lab10/dataset/'\ + 'Online%20Retail.xlsx?raw=true' +df = pd.read_excel(file_url) +``` +After loading the data into a DataFrame, we want to know the size of +this dataset, that is, its number of rows and columns. To get this +information, we just need to call the `.shape` attribute from +`pandas`: + +``` +df.shape +``` +You should get the following output: + +``` +(541909, 8) +``` +This attribute returns a tuple containing the number of rows as the +first element and the number of columns as the second element. The +loaded dataset contains `541909` rows and `8` +different columns. + +Since this attribute returns a tuple, we can access each of its elements +independently by providing the relevant index. Let\'s extract the number +of rows (index `0`): + +``` +df.shape[0] +``` +You should get the following output: + +``` +541909 +``` +Similarly, we can get the number of columns with the second index: + +``` +df.shape[1] +``` +You should get the following output: + +``` +8 +``` +Usually, the first row of a dataset is the header. It contains the name +of each column. By default, the `read_excel()` method assumes +that the first row of the file is the header. If the `header` +is stored in a different row, you can specify a different index for the +header with the parameter header from `read_excel()`, such as +`pd.read_excel(header=1)` for specifying the header column is +the second row. + +Once loaded into a `pandas` DataFrame, you can print out its +content by calling it directly: + +``` +df +``` +You should get the following output: + +![](./images/B15019_10_01.jpg) + +Caption: First few rows of the loaded online retail DataFrame + +To access the names of the columns for this DataFrame, we can call the +`.columns` attribute: + +``` +df.columns +``` +You should get the following output: + +![](./images/B15019_10_02.jpg) + +Caption: List of the column names for the online retail DataFrame + +The columns from this dataset are `InvoiceNo`, +`StockCode`, `Description`, `Quantity`, +`InvoiceDate`, `UnitPrice`, `CustomerID`, +and `Country`. We can infer that a row from this dataset +represents the sale of an article for a given quantity and price for a +specific customer at a particular date. + +Looking at these names, we can potentially guess what types of +information are contained in these columns, however, to be sure, we can +use the `dtypes` attribute, as shown in the following code +snippet: + +``` +df.dtypes +``` +You should get the following output: + +![Caption: Description of the data type for each column of the +DataFrame ](./images/B15019_10_03.jpg) + +Caption: Description of the data type for each column of the +DataFrame + +From this output, we can see that the `InvoiceDate` column is +a date type (`datetime64[ns]`), `Quantity` is an +integer (`int64`), and that `UnitPrice` and +`CustomerID` are decimal numbers (`float64`). The +remaining columns are text (`object`). + +The `pandas` package provides a single method that can display +all the information we have seen so far, that is, the `info()` +method: + +``` +df.info() +``` +You should get the following output: + +![](./images/B15019_10_04.jpg) + +Caption: Output of the info() method + +In just a few lines of code, we learned some high-level information +about this dataset, such as its size, the column names, and their types. + +In the next section, we will analyze the content of a dataset. + + +Analyzing Your Dataset +====================== + + +Previously, we learned about the overall structure of a dataset and the +kind of information it contains. Now, it is time to really dig into it +and look at the values of each column. + +First, we need to import the `pandas` package: + +``` +import pandas as pd +``` + +Then, we\'ll load the data into a `pandas` DataFrame: + +``` +file_url = 'https://github.com/fenago/'\ + 'data-science/blob/'\ + 'master/Lab10/dataset/'\ + 'Online%20Retail.xlsx?raw=true' +df = pd.read_excel(file_url) +``` + +The `pandas` package provides several methods so that you can +display a snapshot of your dataset. The most popular ones are +`head()`, `tail()`, and `sample()`. + +The `head()` method will show the top rows of your dataset. By +default, `pandas` will display the first five rows: + +``` +df.head() +``` +You should get the following output: + +![](./images/B15019_10_05.jpg) + +Caption: Displaying the first five rows using the head() method + +The output of the `head()` method shows that the +`InvoiceNo`, `StockCode`, and `CustomerID` +columns are unique identifier fields for each purchasing invoice, item +sold, and customer. The `Description` field is text describing +the item sold. `Quantity` and `UnitPrice` are the +number of items sold and their unit price, respectively. +`Country` is a text field that can be used for specifying +where the customer or the item is located or from which country version +of the online store the order has been made. In a real project, you may +reach out to the team who provided this dataset and confirm what the +meaning of the `Country` column is, or any other column +details that you may need, for that matter. + +With `pandas`, you can specify the number of top rows to be +displayed with the `head()` method by providing an integer as +its parameter. Let\'s try this by displaying the first `10` +rows: + +``` +df.head(10) +``` +You should get the following output: + +![](./images/B15019_10_06.jpg) + +Caption: Displaying the first 10 rows using the head() method + +Looking at this output, we can assume that the data is sorted by the +`InvoiceDate` column and grouped by `CustomerID` and +`InvoiceNo`. We can only see one value in the +`Country` column: `United Kingdom`. Let\'s check +whether this is really the case by looking at the last rows of the +dataset. This can be achieved by calling the `tail()` method. +Like `head()`, this method, by default, will display only five +rows, but you can specify the number of rows you want as a parameter. +Here, we will display the last eight rows: + +``` +df.tail(8) +``` +You should get the following output: + +![](./images/B15019_10_07.jpg) + +Caption: Displaying the last eight rows using the tail() method + +It seems that we were right in assuming that the data is sorted in +ascending order by the `InvoiceDate` column. We can also +confirm that there is actually more than one value in the +`Country` column. + +We can also use the `sample()` method to randomly pick a given +number of rows from the dataset with the `n` parameter. You +can also specify a **seed** (which we covered in *Lab 5*, +*Performing Your First Cluster Analysis*) in order to get reproducible +results if you run the same code again with the `random_state` +parameter: + +``` +df.sample(n=5, random_state=1) +``` +You should get the following output: + +![Caption: Displaying five random sampled rows using the sample() +method ](./images/B15019_10_08.jpg) + +Caption: Displaying five random sampled rows using the sample() +method + +In this output, we can see an additional value in the +`Country` column: `Germany`. We can also notice a +few interesting points: + +- `InvoiceNo` can also contain alphabetical letters (row + `94,801` starts with a `C`, which may have a + special meaning). +- `Quantity` can have negative values: `-2` (row + `94801`). +- `CustomerID` contains missing values: `NaN` (row + `210111`). + + + +Exercise 10.01: Exploring the Ames Housing Dataset with Descriptive Statistics +------------------------------------------------------------------------------ + +In this exercise, we will explore the `Ames Housing dataset` +in order to get a good understanding of it by analyzing its structure +and looking at some of its rows. + + +The following steps will help you to complete this exercise: + +1. Open a new Colab notebook. + +2. Import the `pandas` package: + ``` + import pandas as pd + ``` + + +3. Assign the link to the AMES dataset to a variable called + `file_url`: + ``` + file_url = 'https://raw.githubusercontent.com/'\ + 'fenago/data-science/'\ + 'master/Lab10/dataset/ames_iowa_housing.csv' + ``` + + +4. Use the `.read_csv()` method from the + `pandas `package and load the dataset into a new variable + called `df`: + ``` + df = pd.read_csv(file_url) + ``` + + +5. Print the number of rows and columns of the DataFrame using the + `shape` attribute from the `pandas` package: + + ``` + df.shape + ``` + + + You should get the following output: + + ``` + (1460, 81) + ``` + + + We can see that this dataset contains `1460` rows and + `81` different columns. + +6. Print the names of the variables contained in this DataFrame using + the `columns` attribute from the `pandas` + package: + + ``` + df.columns + ``` + + + You should get the following output: + + +![](./images/B15019_10_09.jpg) + + + Caption: List of columns in the housing dataset + + We can infer the type of information contained in some of the + variables by looking at their names, such as `LotArea` + (property size), `YearBuilt` (year of construction), and + `SalePrice` (property sale price). + +7. Print out the type of each variable contained in this DataFrame + using the `dtypes` attribute from the `pandas` + package: + + ``` + df.dtypes + ``` + + + You should get the following output: + + +![Caption: List of columns and their type from the housing + dataset ](./images/B15019_10_10.jpg) + + + Caption: List of columns and their type from the housing + dataset + + We can see that the variables are either numerical or text types. + There is no date column in this dataset. + +8. Display the top rows of the DataFrame using the `head()` + method from `pandas`: + + ``` + df.head() + ``` + + + You should get the following output: + + +![](./images/B15019_10_11.jpg) + + + Caption: First five rows of the housing dataset + +9. Display the last five rows of the DataFrame using the + `tail()` method from `pandas`: + + ``` + df.tail() + ``` + + + You should get the following output: + + +![](./images/B15019_10_12.jpg) + + + Caption: Last five rows of the housing dataset + + It seems that the `Alley` column has a lot of missing + values, which are represented by the `NaN` value (which + stands for `Not a Number`). The `Street` and + `Utilities` columns seem to have only one value. + +10. Now, display `5` random sampled rows of the DataFrame + using the `sample()` method from `pandas` and + pass it a `'random_state'` of `8`: + + ``` + df.sample(n=5, random_state=8) + ``` + + + You should get the following output: + + +![](./images/B15019_10_13.jpg) + + + +We learned quite a lot about this dataset in just a few lines of code, +such as the number of rows and columns, the data type of each variable, +and their information. We also identified some issues with missing +values. + + +Analyzing the Content of a Categorical Variable +=============================================== + + +Now that we\'ve got a good feel for the kind of information contained in +the `online retail dataset`, we want to dig a little deeper +into each of its columns: + +``` +import pandas as pd +file_url = 'https://github.com/fenago/'\ + 'data-science/blob'\ + '/master/Lab10/dataset/'\ + 'Online%20Retail.xlsx?raw=true' +df = pd.read_excel(file_url) +``` +For instance, we would like to know how many different values are +contained in each of the variables by calling the `nunique()` +method. This is particularly useful for a categorical variable with a +limited number of values, such as `Country`: + +``` +df['Country'].nunique() +``` +You should get the following output: + +``` +38 +``` +We can see that there are 38 different countries in this dataset. It +would be great if we could get a list of all the values in this column. +Thankfully, the `pandas` package provides a method to get +these results: `unique()`: + +``` +df['Country'].unique() +``` +You should get the following output: + +![](./images/B15019_10_14.jpg) + +Caption: List of unique values for the \'Country\' column + +We can see that there are multiple countries from different continents, +but most of them come from Europe. We can also see that there is a value +called `Unspecified` and another one for +`European Community`, which may be for all the countries of +the eurozone that are not listed separately. + +Another very useful method from `pandas `is +`value_counts()`. This method lists all the values from a +given column but also their occurrence. By providing the +`dropna=False` and `normalise=True` parameters, this +method will include the missing value in the listing and calculate the +number of occurrences as a ratio, respectively: + +``` +df['Country'].value_counts(dropna=False, normalize=True) +``` +You should get the following output: + +![Caption: A truncated list of unique values and their occurrence ](./images/B15019_10_15.jpg) + + +From this output, we can see that the `United Kingdom` value +is totally dominating this column as it represents over 91% of the rows +and that other values such as `Austria` and +`Denmark` are quite rare as they represent less than 1% of +this dataset. + + + +Exercise 10.02: Analyzing the Categorical Variables from the Ames Housing Dataset +--------------------------------------------------------------------------------- + +In this exercise, we will continue our dataset exploration by analyzing +the categorical variables of this dataset. To do so, we will implement +our own `describe` functions. + + +1. Open a new Colab notebook. + +2. Import the `pandas `package: + ``` + import pandas as pd + ``` + + +3. Assign the following link to the AMES dataset to a variable called + `file_url`: + ``` + file_url = 'https://raw.githubusercontent.com/'\ + 'fenago/data-science/'\ + 'master/Lab10/dataset/ames_iowa_housing.csv' + ``` + + +4. Use the `.read_csv()` method from the `pandas` + package and load the dataset into a new variable called + `df`: + ``` + df = pd.read_csv(file_url) + ``` + + +5. Create a new DataFrame called `obj_df` with only the + columns that are of numerical types using the + `select_dtypes` method from `pandas` package. + Then, pass in the `object` value to the + `include `parameter: + ``` + obj_df = df.select_dtypes(include='object') + ``` + + +6. Using the `columns` attribute from `pandas`, + extract the list of columns of this DataFrame, `obj_df`, + assign it to a new variable called `obj_cols`, and print + its content: + + ``` + obj_cols = obj_df.columns + obj_cols + ``` + + + You should get the following output: + + +![](./images/B15019_10_16.jpg) + + + Caption: List of categorical variables + +7. Create a function called `describe_object` that takes a + `pandas `DataFrame and a column name as input parameters. + Then, inside the function, print out the name of the given column, + its number of unique values using the `nunique()` method, + and the list of values and their occurrence using the + `value_counts()` method, as shown in the following code + snippet: + ``` + def describe_object(df, col_name): + print(f"\nCOLUMN: {col_name}") + print(f"{df[col_name].nunique()} different values") + print(f"List of values:") + print(df[col_name].value_counts\ + (dropna=False, normalize=True)) + ``` + + +8. Test this function by providing the `df` DataFrame and the + `'MSZoning'` column: + + ``` + describe_object(df, 'MSZoning') + ``` + + + You should get the following output: + + +![Caption: Display of the created function for the MSZoning + column ](./images/B15019_10_17.jpg) + + + Caption: Display of the created function for the MSZoning + column + + For the `MSZoning` column, the `RL` value + represents almost `79%` of the values, while `C` + `(all)` is only present in less than `1%` of the + rows. + +9. Create a `for `loop that will call the created function + for every element from the `obj_cols` list: + + ``` + for col_name in obj_cols: + describe_object(df, col_name) + ``` + + + You should get the following output: + + +![](./images/B15019_10_18.jpg) + + + + +Summarizing Numerical Variables +=============================== + + +Now, let\'s have a look at a numerical column and get a good +understanding of its content. We will use some statistical measures that +summarize a variable. All of these measures are referred to as +descriptive statistics. In this lab, we will introduce you to the +most popular ones. + +With the `pandas` package, a lot of these measures have been +implemented as methods. For instance, if we want to know what the +highest value contained in the `'Quantity'` column is, we can +use the `.max()` method: + +``` +df['Quantity'].max() +``` +You should get the following output: + +``` +80995 +``` +We can see that the maximum quantity of an item sold in this dataset is +`80995`, which seems extremely high for a retail business. In +a real project, this kind of unexpected value will have to be discussed +and confirmed with the data owner or key stakeholders to see whether +this is a genuine or an incorrect value. Now, let\'s have a look at the +lowest value for `'Quantity'` using the `.min()` +method: + +``` +df['Quantity'].min() +``` +You should get the following output: + +``` +-80995 +``` + +The lowest value in this variable is extremely low. We can think that +having negative values is possible for returned items, but here, the +minimum (`-80995`) is very low. This, again, will be something +to be confirmed with the relevant people in your organization. + +Now, we are going to have a look at the central tendency of this column. +**Central tendency** is a statistical term referring to the central +point where the data will cluster around. The most famous central +tendency measure is the average (or mean). The average is calculated by +summing all the values of a column and dividing them by the number of +values. + +If we plot the `Quantity `column on a graph with its average, +it would look as follows: + +![](./images/B15019_10_19.jpg) + +Caption: Average value for the \'Quantity\' column + +We can see the average for the `Quantity `column is very close +to 0 and most of the data is between `-50` and +`+50`. + +We can get the average value of a feature by using the +`mean()` method from `pandas`: + +``` +df['Quantity'].mean() +``` +You should get the following output: + +``` +9.55224954743324 +``` + +In this dataset, the average quantity of items sold is around +`9.55`. The average measure is very sensitive to outliers and, +as we saw previously, the minimum and maximum values of the +`Quantity` column are quite extreme +(`-80995 to +80995`). + +We can use the median instead as another measure of central tendency. +The median is calculated by splitting the column into two groups of +equal lengths and getting the value of the middle point by separating +these two groups, as shown in the following example: + +![](./images/B15019_10_20.jpg) + +Caption: Sample median example + +In `pandas`, you can call the `median()` method to +get this value: + +``` +df['Quantity'].median() +``` +You should get the following output: + +``` +3.0 +``` + +The median value for this column is 3, which is quite different from the +mean (`9.55`) we found earlier. This tells us that there are +some outliers in this dataset and we will have to decide on how to +handle them after we\'ve done more investigation (this will be covered +in *Lab 11*, *Data Preparation*). + +We can also evaluate the spread of this column (how much the data points +vary from the central point). A common measure of spread is the standard +deviation. The smaller this measure is, the closer the data is to its +mean. On the other hand, if the standard deviation is high, this means +there are some observations that are far from the average. We will use +the `std()` method from `pandas `to calculate this +measure: + +``` +df['Quantity'].std() +``` +You should get the following output: + +``` +218.08115784986612 +``` +As expected, the standard deviation for this column is quite high, so +the data is quite spread from the average, which is `9.55` in +this example. + +In the `pandas `package, there is a method that can display +most of these descriptive statistics with one single line of code: +`describe()`: + +``` +df.describe() +``` +You should get the following output: + +![](./images/B15019_10_21.jpg) + +Caption: Output of the describe() method + +We got the exact same values for the `Quantity` column as we +saw previously. This method has calculated the descriptive statistics +for the three numerical columns (`Quantity`, +`UnitPrice`, and `CustomerID`). + +Even though the `CustomerID` column contains only numerical +data, we know these values are used to identify each customer and have +no mathematical meaning. For instance, it will not make sense to add +customer ID `12680 to 17850` in the table or calculate the +mean of these identifiers. This column is not actually numerical but +categorical. + +The `describe()` method doesn\'t know this information and +just noticed there are numbers, so it assumed this is a numerical +variable. This is the perfect example of why you should understand your +dataset perfectly and identify the issues to be fixed before feeding the +data to an algorithm. In this case, we will have to change the type of +this column to categorical. In *Lab 11*, *Data Preparation*, we will +see how we can handle this kind of issue, but for now, we will look at +some graphical tools and techniques that will help us have an even +better understanding of the data. + + + +Exercise 10.03: Analyzing Numerical Variables from the Ames Housing Dataset +--------------------------------------------------------------------------- + +In this exercise, we will continue our dataset exploration by analyzing +the numerical variables of this dataset. To do so, we will implement our +own `describe `functions. + + +1. Open a new Colab notebook. + +2. Import the `pandas` package: + ``` + import pandas as pd + ``` + + +3. Assign the link to the AMES dataset to a variable called + `file_url`: + ``` + file_url = 'https://raw.githubusercontent.com/'\ + 'fenago/data-science/'\ + 'master/Lab10/dataset/ames_iowa_housing.csv' + ``` + + +4. Use the `.read_csv()` method from the + `pandas `package and load the dataset into a new variable + called `df`: + ``` + df = pd.read_csv(file_url) + ``` + + +5. Create a new DataFrame called `num_df` with only the + columns that are numerical using the `select_dtypes` + method from the `pandas `package and pass in the + `'number'` value to the `include` parameter: + ``` + num_df = df.select_dtypes(include='number') + ``` + + +6. Using the `columns` attribute from `pandas`, + extract the list of columns of this DataFrame, `num_df`, + assign it to a new variable called `num_cols`, and print + its content: + + ``` + num_cols = num_df.columns + num_cols + ``` + + + You should get the following output: + + +![](./images/B15019_10_22.jpg) + + + Caption: List of numerical columns + +7. Create a function called `describe_numeric` that takes a + `pandas `DataFrame and a column name as input parameters. + Then, inside the function, print out the name of the given column, + its minimum value using `min()`, its maximum value using + `max()`, its average value using `mean()`, its + standard deviation using `std()`, and its + `median` using `median()`: + ``` + def describe_numeric(df, col_name): + print(f"\nCOLUMN: {col_name}") + print(f"Minimum: {df[col_name].min()}") + print(f"Maximum: {df[col_name].max()}") + print(f"Average: {df[col_name].mean()}") + print(f"Standard Deviation: {df[col_name].std()}") + print(f"Median: {df[col_name].median()}") + ``` + + +8. Now, test this function by providing the `df` DataFrame + and the `SalePrice` column: + + ``` + describe_numeric(df, 'SalePrice') + ``` + + + You should get the following output: + + +![](./images/B15019_10_23.jpg) + + + Caption: The display of the created function for the + \'SalePrice\' column + + The sale price ranges from `34,900` to + `755,000 `with an average of `180,921`. The + median is slightly lower than the average, which tells us there are + some outliers with high sales prices. + +9. Create a `for `loop that will call the created function + for every element from the `num_cols` list: + + ``` + for col_name in num_cols: + describe_numeric(df, col_name) + ``` + + + You should get the following output: + + +![](./images/B15019_10_24.jpg) + + + +Visualizing Your Data +===================== + + +In the previous section, we saw how to explore a new dataset and +calculate some simple descriptive statistics. These measures helped +summarize the dataset into interpretable metrics, such as the average or +maximum values. Now it is time to dive even deeper and get a more +granular view of each column using data visualization. + +In a data science project, data visualization can be used either for +data analysis or communicating gained insights. Presenting results in a +visual way that stakeholders can easily understand and interpret them in +is definitely a must-have skill for any good data scientist. + +However, in this lab, we will be focusing on using data +visualization for analyzing data. Most people tend to interpret +information more easily on a graph than reading written information. For +example, when looking at the following descriptive statistics and the +scatter plot for the same variable, which one do you think is easier to +interpret? Let\'s take a look: + +![](./images/B15019_10_25.jpg) + +Caption: Sample visual data analysis + +Even though the information shown with the descriptive statistics are +more detailed, by looking at the graph, you have already seen that the +data is stretched and mainly concentrated around the value 0. It +probably took you less than 1 or 2 seconds to come up with this +conclusion, that is, there is a cluster of points near the 0 value and +that it gets reduced while moving away from it. Coming to this +conclusion would have taken you more time if you were interpreting the +descriptive statistics. This is the reason why data visualization is a +very powerful tool for effectively analyzing data. + + + +Using the Altair API +-------------------- + +We will be using a package called `altair` (if you recall, we +already briefly used it in *Lab 5*, *Performing Your First Cluster +Analysis*). There are quite a lot of Python packages for data +visualization on the market, such as `matplotlib`, +`seaborn`, or `Bokeh`, and compared to them, +`altair` is relatively new, but its community of users is +growing fast thanks to its simple API syntax. + +Let\'s see how we can display a bar chart step by step on the online +retail dataset. + +First, import the `pandas` and `altair` packages: + +``` +import pandas as pd +import altair as alt +``` + +Then, load the data into a `pandas` DataFrame: + +``` +file_url = 'https://github.com/fenago/'\ + 'data-science/blob/'\ + 'master/Lab10/dataset/'\ + 'Online%20Retail.xlsx?raw=true' +df = pd.read_excel(file_url) +``` +We will randomly sample 5,000 rows of this DataFrame using the +`sample()` method (`altair `requires additional +steps in order to display a larger dataset): + +``` +sample_df = df.sample(n=5000, random_state=8) +``` +Now instantiate a `Chart` object from `altair` with +the `pandas `DataFrame as its input parameter: + +``` +base = alt.Chart(sample_df) +``` +Next, we call the `mark_circle()` method to specify the type +of graph we want to plot: a scatter plot: + +``` +chart = base.mark_circle() +``` +Finally, we specify the names of the columns that will be displayed on +the *x* and *y* axes using the `encode()` method: + +``` +chart.encode(x='Quantity', y='UnitPrice') +``` +We just plotted a scatter plot in seven lines of code: + +![](./images/B15019_10_26.jpg) + +Caption: Output of the scatter plot + +Altair provides the option for combining its methods all together into +one single line of code, like this: + +``` +alt.Chart(sample_df).mark_circle()\ + .encode(x='Quantity', y='UnitPrice') +``` +You should get the following output: + +![](./images/B15019_10_27.jpg) + +Caption: Output of the scatter plot with combined altair methods + +We can see that we got the exact same output as before. This graph shows +us that there are a lot of outliers (extreme values) for both variables: +most of the values of `UnitPrice` are below 100, but there are +some over 300, and `Quantity` ranges from -200 to 800, while +most of the observations are between -50 to 150. We can also notice a +pattern where items with a high unit price have lower quantity (items +over 50 in terms of unit price have a quantity close to 0) and the +opposite is also true (items with a quantity over 100 have a unit price +close to 0). + +Now, let\'s say we want to visualize the same plot while adding the +`Country` column\'s information. One easy way to do this is to +use the `color` parameter from the `encode()` +method. This will color all the data points according to their value in +the `Country` column: + +``` +alt.Chart(sample_df).mark_circle()\ + .encode(x='Quantity', y='UnitPrice', color='Country') +``` +You should get the following output: + +![](./images/B15019_10_28.jpg) + +Caption: Scatter plot with colors based on the \'Country\' column + +We added the information from the `Country` column into the +graph, but as we can see, there are too many values and it is hard to +differentiate between countries: there are a lot of blue points, but it +is hard to tell which countries they are representing. + +With `altair`, we can easily add some interactions on the +graph in order to display more information for each observation; we just +need to use the `tooltip` parameter from the +`encode()` method and specify the list of columns to be +displayed and then call the `interactive()` method to make the +whole thing interactive (as seen previously in *Lab 5*, *Performing +Your First Cluster Analysis*): + +``` +alt.Chart(sample_df).mark_circle()\ + .encode(x='Quantity', y='UnitPrice', color='Country', \ + tooltip=['InvoiceNo','StockCode','Description',\ + 'InvoiceDate','CustomerID']).interactive() +``` +You should get the following output: + +![](./images/B15019_10_29.jpg) + +Caption: Interactive scatter plot with tooltip + +Now, if we hover on the observation with the highest +`UnitPrice` value (the one near 600), we can see the +information displayed by the tooltip: this observation doesn\'t have any +value for `StockCode` and its `Description` is +`Manual`. So, it seems that this is not a normal transaction +to happen on the website. It may be a special order that has been +manually entered into the system. This is something you will have to +discuss with your stakeholder and confirm. + + + +Histogram for Numerical Variables +--------------------------------- + +Now that we are familiar with the `altair` API, let\'s have a +look at some specific type of charts that will help us analyze and +understand each variable. First, let\'s focus on numerical variables +such as `UnitPrice` or `Quantity` in the online +retail dataset. + +For this type of variable, a histogram is usually used to show the +distribution of a given variable. The x axis of a histogram will show +the possible values in this column and the y axis will plot the number +of observations that fall under each value. Since the number of possible +values can be very high for a numerical variable (potentially an +infinite number of potential values), it is better to group these values +by chunks (also called bins). For instance, we can group prices into +bins of 10 steps (that is, groups of 10 items each) such as 0 to 10, 11 +to 20, 21 to 30, and so on. + +Let\'s look at this by using a real example. We will plot a histogram +for `'UnitPrice'` using the `mark_bar()` and +`encode()` methods with the following parameters: + +- `alt.X("UnitPrice:Q", bin=True)`: This is another + `altair `API syntax that allows you to tune some of the + parameters for the x axis. Here, we are telling altair to use the + `'UnitPrice'` column as the axis. `':Q'` + specifies that this column is quantitative data (that is, numerical) + and `bin=True` forces the grouping of the possible values + into bins. +- `y='count()'`: This is used for calculating the number of + observations and plotting them on the y axis, like so: + +``` +alt.Chart(sample_df).mark_bar()\ + .encode(alt.X("UnitPrice:Q", bin=True), \ + y='count()') +``` +You should get the following output: + +![](./images/B15019_10_30.jpg) + +Caption: Histogram for UnitPrice with the default bin step size + +By default, `altair` grouped the observations by bins of 100 +steps: 0 to 100, then 100 to 200, and so on. The step size that was +chosen is not optimal as almost all the observations fell under the +first bin (0 to 100) and we can\'t see any other bin. With +`altair`, we can specify the values of the parameter bin and +we will try this with 5, that is, `alt.Bin(step=5)`: + +``` +alt.Chart(sample_df).mark_bar()\ + .encode(alt.X("UnitPrice:Q", bin=alt.Bin(step=5)), \ + y='count()') +``` +You should get the following output: + +![](./images/B15019_10_31.jpg) + +Caption: Histogram for UnitPrice with a bin step size of 5 + +This is much better. With this step size, we can see that most of the +observations have a unit price under 5 (almost 4,200 observations). We +can also see that a bit more than 500 data points have a unit price +under 10. The count of records keeps decreasing as the unit price +increases. + +Let\'s plot the histogram for the `Quantity` column with a bin +step size of 10: + +``` +alt.Chart(sample_df).mark_bar()\ + .encode(alt.X("Quantity:Q", bin=alt.Bin(step=10)), \ + y='count()') +``` +You should get the following output: + +![](./images/B15019_10_32.jpg) + +Caption: Histogram for Quantity with a bin step size of 10 + +In this histogram, most of the records have a positive quantity between +0 and 30 (first three highest bins). There is also a bin with around 50 +observations that have a negative quantity from -10 to 0. As we +mentioned earlier, these may be returned items from customers. + + + +Bar Chart for Categorical Variables +----------------------------------- + +Now, we are going to have a look at categorical variables. For such +variables, there is no need to group the values into bins as, by +definition, they have a limited number of potential values. We can still +plot the distribution of such columns using a simple bar chart. In +`altair`, this is very simple -- it is similar to plotting a +histogram but without the `bin` parameter. Let\'s try this on +the `Country` column and look at the number of records for +each of its values: + +``` +alt.Chart(sample_df).mark_bar()\ + .encode(x='Country',y='count()') +``` +You should get the following output: + +![](./images/B15019_10_33.jpg) + +Caption: Bar chart of the Country column\'s occurrence + +We can confirm that `United Kingdom` is the most represented +country in this dataset (and by far), followed by `Germany`, +`France`, and `EIRE`. We clearly have imbalanced +data that may affect the performance of a predictive model. In *Lab +13*, *Imbalanced Datasets*, we will look at how we can handle this +situation. + +Now, let\'s analyze the datetime column, that is, +`InvoiceDate`. The `altair` package provides some +functionality that we can use to group datetime information by period, +such as day, day of week, month, and so on. For instance, if we want to +have a monthly view of the distribution of a variable, we can use the +`yearmonth` function to group datetimes. We also need to +specify that the type of this variable is ordinal (there is an order +between the values) by adding `:O` to the column name: + +``` +alt.Chart(sample_df).mark_bar()\ + .encode(alt.X('yearmonth(InvoiceDate):O'),\ + y='count()') +``` +You should get the following output: + +![](./images/B15019_10_34.jpg) + +Caption: Distribution of InvoiceDate by month + +This graph tells us that there was a huge spike of items sold in +November 2011. It peaked to 800 items sold in this month, while the +average is around 300. Was there a promotion or an advertising campaign +run at that time that can explain this increase? These are the questions +you may want to ask your stakeholders so that they can confirm this +sudden increase of sales. + + +Boxplots +======== + + +Now, we will have a look at another specific type of chart called a +**boxplot**. This kind of graph is used to display the distribution of a +variable based on its quartiles. Quartiles are the values that split a +dataset into quarters. Each quarter contains exactly 25% of the +observations. For example, in the following sample data, the quartiles +will be as follows: + +![](./images/B15019_10_35.jpg) + +Caption: Example of quartiles for the given data + +So, the first quartile (usually referred to as Q1) is 4; the second one +(Q2), which is also the median, is 5; and the third quartile (Q3) is 8. + +A boxplot will show these quartiles but also additional information, +such as the following: + +- The **interquartile range (or IQR)**, which corresponds to Q3 - Q1 +- The *lowest* value, which corresponds to Q1 - (1.5 \* IQR) +- The *highest* value, which corresponds to Q3 + (1.5 \* IQR) +- Outliers, that is, any point outside of the lowest and highest + points: + +![](./images/B15019_10_36.jpg) + + +Caption: Example of a boxplot + +With a boxplot, it is quite easy to see the central point (median), +where 50% of the data falls under (IQR), and the outliers. + +Another benefit of using a boxplot is to plot the distribution of +categorical variables against a numerical variable and compare them. +Let\'s try it with the `Country` and `Quantity` +columns using the `mark_boxplot()` method: + +``` +alt.Chart(sample_df).mark_boxplot()\ + .encode(x='Country:O', y='Quantity:Q') +``` +You should receive the following output: + +![](./images/B15019_10_37.jpg) + +Caption: Boxplot of the \'Country\' and \'Quantity\' columns + +This chart shows us how the `Quantity` variable is distributed +across the different countries for this dataset. We can see that +`United Kingdom` has a lot of outliers, especially in the +negative values. `Eire` is another country that has negative +outliers. Most of the countries have very low value quantities except +for `Japan`, `Netherlands`, and `Sweden`, +who sold more items. + +In this section, we saw how to use the `altair` package to +generate graphs that helped us get additional insights about the dataset +and identify some potential issues. + + + +Exercise 10.04: Visualizing the Ames Housing Dataset with Altair +---------------------------------------------------------------- + +In this exercise, we will learn how to get a better understanding of a +dataset and the relationship between variables using data visualization +features such as histograms, scatter plots, or boxplots. + +Note + +You will be using the same Ames housing dataset that was used in the +previous exercises. + +1. Open a new Colab notebook. + +2. Import the `pandas` and `altair` packages: + ``` + import pandas as pd + import altair as alt + ``` + + +3. Assign the link to the AMES dataset to a variable called + `file_url`: + ``` + file_url = 'https://raw.githubusercontent.com/'\ + 'fenago/data-science/'\ + 'master/Lab10/dataset/ames_iowa_housing.csv' + ``` + + +4. Using the `read_csv` method from the pandas package, load + the dataset into a new variable called `'df'`: + + ``` + df = pd.read_csv(file_url) + ``` + + + Plot the histogram for the `SalePrice` variable using the + `mark_bar()` and `encode()` methods from the + `altair` package. Use the `alt.X` and + `alt.Bin` APIs to specify the number of bin steps, that + is, `50000`: + + ``` + alt.Chart(df).mark_bar()\ + .encode(alt.X("SalePrice:Q", bin=alt.Bin(step=50000)),\ + y='count()') + ``` + + + You should get the following output: + + +![](./images/B15019_10_38.jpg) + + + Caption: Histogram of SalePrice + + This chart shows that most of the properties have a sale price + centered around `100,000 – 150,000`. There are also a few + outliers with a high sale price over `500,000`. + +5. Now, let\'s plot the histogram for `LotArea` but this time + with a bin step size of `10000`: + + ``` + alt.Chart(df).mark_bar()\ + .encode(alt.X("LotArea:Q", bin=alt.Bin(step=10000)),\ + y='count()') + ``` + + + You should get the following output: + + +![](./images/B15019_10_39.jpg) + + + Caption: Histogram of LotArea + + `LotArea` has a totally different distribution compared to + `SalePrice`. Most of the observations are between + `0` and `20,000`. The rest of the observations + represent a small portion of the dataset. We can also notice some + extreme outliers over `150,000`. + +6. Now, plot a scatter plot with `LotArea` as the *x* axis + and `SalePrice` as the *y* axis to understand the + interactions between these two variables: + + ``` + alt.Chart(df).mark_circle()\ + .encode(x='LotArea:Q', y='SalePrice:Q') + ``` + + + You should get the following output: + + +![](./images/B15019_10_40.jpg) + + + Caption: Scatter plot of SalePrice and LotArea + + There is clearly a correlation between the size of the property and + the sale price. If we look only at the properties with + `LotArea` under 50,000, we can see a linear relationship: + if we draw a straight line from the (`0,0`) coordinates to + the (`20000,800000`) coordinates, we can say that + `SalePrice` increases by 40,000 for each additional + increase of 1,000 for `LotArea`. The formula of this + straight line (or regression line) will be + `SalePrice = 40000 * LotArea / 1000`. We can also see + that, for some properties, although their size is quite high, their + price didn\'t follow this pattern. For instance, the property with a + size of 160,000 has been sold for less than 300,000. + +7. Now, let\'s plot the histogram for `OverallCond`, but this + time with the default bin step size, that is, + (`bin=True`): + + ``` + alt.Chart(df).mark_bar()\ + .encode(alt.X("OverallCond", bin=True), \ + y='count()') + ``` + + + You should get the following output: + + +![](./images/B15019_10_41.jpg) + + + Caption: Histogram of OverallCond + + We can see that the values contained in this column are discrete: + they can only take a finite number of values (any integer between + `1` and `9`). This variable is not numerical, + but ordinal: the order matters, but you can\'t perform some + mathematical operations on it such as adding value `2` to + value `8`. This column is an arbitrary mapping to assess + the overall quality of the property. In the next lab, we will + look at how we can change the type of such a column. + +8. Build a boxplot with `OverallCond:O` (`':O'` is + for specifying that this column is ordinal) on the *x* axis and + `SalePrice` on the *y* axis using the + `mark_boxplot()` method, as shown in the following code + snippet: + + ``` + alt.Chart(df).mark_boxplot()\ + .encode(x='OverallCond:O', y='SalePrice:Q') + ``` + + + You should get the following output: + + +![](./images/B15019_10_42.jpg) + + + Caption: Boxplot of OverallCond + + It seems that the `OverallCond` variable is in ascending + order: the sales price is higher if the condition value is high. + However, we will notice that `SalePrice` is quite high for + the value 5, which seems to represent a medium condition. There may + be other factors impacting the sales price for this category, such + as higher competition between buyers for such types of properties. + +9. Now, let\'s plot a bar chart for `YrSold` as its *x* axis + and `count()` as its *y* axis. Don\'t forget to specify + that `YrSold` is an ordinal variable and not numerical + using `':O'`: + + ``` + alt.Chart(df).mark_bar()\ + .encode(alt.X('YrSold:O'), y='count()') + ``` + + + You should get the following output: + + +![](./images/B15019_10_43.jpg) + + + Caption: Bar chart of YrSold + + We can see that, roughly, the same number of properties are sold + every year, except for 2010. From 2006 to 2009, there was, on + average, 310 properties sold per year, while there were only 170 + in 2010. + +10. Plot a boxplot similar to the one shown in *Step 8* but for + `YrSold` as its *x* axis: + + ``` + alt.Chart(df).mark_boxplot()\ + .encode(x='YrSold:O', y='SalePrice:Q') + ``` + + + You should get the following output: + + +![](./images/B15019_10_44.jpg) + + + Caption: Boxplot of YrSold and SalePrice + + Overall, the median sale price is quite stable across the years, + with a slight decrease in 2010. + +11. Let\'s analyze the relationship between `SalePrice` and + `Neighborhood` by plotting a bar chart, similar to the one + shown in *Step 9*: + + ``` + alt.Chart(df).mark_bar()\ + .encode(x='Neighborhood',y='count()') + ``` + + + You should get the following output: + + +![](./images/B15019_10_45.jpg) + + + Caption: Bar chart of Neighborhood + + The number of sold properties differs, depending on their location. + The `'NAmes'` neighborhood has the higher number of + properties sold: over 220. On the other hand, neighborhoods such as + `'Blueste'` or `'NPkVill'` only had a few + properties sold. + +12. Let\'s analyze the relationship between `SalePrice` and + `Neighborhood` by plotting a boxplot chart similar to the + one in *Step 10*: + + ``` + alt.Chart(df).mark_boxplot()\ + .encode(x='Neighborhood:O', y='SalePrice:Q') + ``` + + + You should get the following output: + + +![](./images/B15019_10_46.jpg) + + +Caption: Boxplot of Neighborhood and SalePrice + + + +Activity 10.01: Analyzing Churn Data Using Visual Data Analysis Techniques +-------------------------------------------------------------------------- + +You are working for a major telecommunications company. The marketing +department has noticed a recent spike of customer churn (*customers that +stopped using or canceled their service from the company*). + + +The following steps will help you complete this activity: + +1. Download and load the dataset into Python using + `.read_csv()`. +2. Explore the structure and content of the dataset by using + `.shape`, `.dtypes`, `.head()`, + `.tail()`, or `.sample()`. +3. Calculate and interpret descriptive statistics with + `.describe()`. +4. Analyze each variable using data visualization with bar charts, + histograms, or boxplots. +5. Identify areas that need clarification from the marketing department + and potential data quality issues. + +**Expected Output** + +Here is the expected bar chart output: + +![](./images/B15019_10_47.jpg) + +Caption: Expected bar chart output + +Here is the expected histogram output: + +![](./images/B15019_10_48.jpg) + +Caption: Expected histogram output + +Here is the expected boxplot output: + +![](./images/B15019_10_49.jpg) + +Caption: Expected boxplot output + + + +Summary +======= + + +You just learned a lot regarding how to analyze a dataset. This a very +critical step in any data science project. Getting a deep understanding +of the dataset will help you to better assess the feasibility of +achieving the requirements from the business. + +You learned how to use descriptive statistics to summarize key +attributes of the dataset such as the average value of a numerical +column, its spread with standard deviation or its range (minimum and +maximum values), the unique values of a categorical variable, and its +most frequent values. You also saw how to use data visualization to get +valuable insights for each variable. Now, you know how to use scatter +plots, bar charts, histograms, and boxplots to understand the +distribution of a column. + diff --git a/lab_guides/Lab_11.md b/lab_guides/Lab_11.md new file mode 100644 index 0000000..8889cb8 --- /dev/null +++ b/lab_guides/Lab_11.md @@ -0,0 +1,1794 @@ + +11. Data Preparation +==================== + + + +Overview + +By the end of this lab you will be able to filter DataFrames with +specific conditions; remove duplicate or irrelevant records or columns; +convert variables into different data types; replace values in a column +and handle missing values and outlier observations. + +This lab will introduce you to the main techniques you can use to +handle data issues in order to achieve high quality for your dataset +prior to modeling it. + + +Introduction +============ + + +In the previous lab, you saw how critical it was to get a very good +understanding of your data and learned about different techniques and +tools to achieve this goal. While performing **Exploratory Data +Analysis** (**EDA**) on a given **dataset**, you may find some potential +issues that need to be addressed before the modeling stage. This is +exactly the topic that will be covered in this lab. You will learn +how you can handle some of the most frequent data quality issues and +prepare the dataset properly. + +This lab will introduce you to the issues that you will encounter +frequently during your data scientist career (such as **duplicated** +**rows**, incorrect data types, incorrect values, and missing values) +and you will learn about the techniques you can use to easily fix them. +But be careful -- some issues that you come across don\'t necessarily +need to be fixed. Some of the suspicious or unexpected values you find +may be genuine from a business point of view. This includes values that +crop up very rarely but are totally genuine. Therefore, it is extremely +important to get confirmation either from your stakeholder or the data +engineering team before you alter the dataset. It is your responsibility +to make sure you are making the right decisions for the business while +preparing the dataset. + +For instance, in *Lab 10*, *Analyzing a Dataset*, you looked at the +*Online Retail dataset*, which had some negative values in the +`Quantity` column. Here, we expected only positive values. But +before fixing this issue straight away (by either dropping the records +or transforming them into positive values), it is preferable to get in +touch with your stakeholders first and get confirmation that these +values are not significant for the business. They may tell you that +these values are extremely important as they represent returned items +and cost the company a lot of money, so they want to analyze these cases +in order to reduce these numbers. If you had moved to the data cleaning +stage straight away, you would have missed this critical piece of +information and potentially came up with incorrect results. + + +Handling Row Duplication +======================== + + +Most of the time, the datasets you will receive or have access to will +not have been 100% cleaned. They usually have some issues that need to +be fixed. One of these issues could be duplicated rows. Row duplication +means that several observations contain the exact same information in +the dataset. With the `pandas` package, it is extremely easy +to find these cases. + +Let\'s use the example that we saw in *Lab 10*, *Analyzing a +Dataset*. + +Start by **importing** the dataset into a DataFrame: + +``` +import pandas as pd +file_url = 'https://github.com/fenago/'\ + 'data-science/blob/'\ + 'master/Lab10/dataset/'\ + 'Online%20Retail.xlsx?raw=true' +df = pd.read_excel(file_url) +``` + +The `duplicated()` method from `pandas` checks +whether any of the rows are duplicates and returns a **boolean** value +for each row, `True` if the row is a duplicate and +`False` if not: + +``` +df.duplicated() +``` +You should get the following output: + +![](./images/B15019_11_01.jpg) + +Caption: Output of the duplicated() method + +Note + +The outputs in this lab have been truncated to effectively use the +page area. + +In Python, the `True` and `False` binary values +correspond to the numerical values 1 and 0, respectively. To find out +how many rows have been identified as duplicates, you can use the +`sum()` method on the output of `duplicated()`. This +will add all the 1s (that is, `True` values) and gives us the +count of duplicates: + +``` +df.duplicated().sum() +``` +You should get the following output: + +``` +5268 +``` +Since the output of the `duplicated()` method is a +`pandas` series of binary values for each row, you can also +use it to subset the rows of a DataFrame. The `pandas` package +provides different APIs for subsetting a DataFrame, as follows: + +- df\[\ or \\] +- df.loc\[\, \\] +- df.iloc\[\, \\] + +The first API subsets the DataFrame by **row** or **column**. To filter +specific columns, you can provide a list that contains their names. For +instance, if you want to keep only the variables, that is, +`InvoiceNo`, `StockCode`, `InvoiceDate`, +and `CustomerID`, you need to use the following code: + +``` +df[['InvoiceNo', 'StockCode', 'InvoiceDate', 'CustomerID']] +``` +You should get the following output: + +![](./images/B15019_11_02.jpg) + +Caption: Subsetting columns + +If you only want to filter the rows that are considered duplicates, you +can use the same API call with the output of the +`duplicated()` method. It will only keep the rows with +**True** as a value: + +``` +df[df.duplicated()] +``` +You should get the following output: + +![](./images/B15019_11_03.jpg) + +Caption: Subsetting the duplicated rows + +If you want to subset the rows and columns at the same time, you must +use one of the other two available APIs: `.loc` or +`.iloc`. These APIs do the exact same thing but +`.loc` uses labels or names while `.iloc` only takes +indices as input. You will use the `.loc` API to subset the +duplicated rows and keep only the selected four columns, as shown in the +previous example: + +``` +df.loc[df.duplicated(), ['InvoiceNo', 'StockCode', \ + 'InvoiceDate', 'CustomerID']] +``` +You should get the following output: + +![Caption: Subsetting the duplicated rows and selected columns using +.loc ](./images/B15019_11_04.jpg) + +Caption: Subsetting the duplicated rows and selected columns using +.loc + +This preceding output shows that the first few duplicates are row +numbers `517`, `527`, `537`, and so on. By +default, `pandas` doesn\'t mark the first occurrence of +duplicates as duplicates: all the same, duplicates will have a value of +`True` except for the first occurrence. You can change this +behavior by specifying the `keep` parameter. If you want to +keep the last duplicate, you need to specify `keep='last'`: + +``` +df.loc[df.duplicated(keep='last'), ['InvoiceNo', 'StockCode', \ + 'InvoiceDate', 'CustomerID']] +``` +You should get the following output: + +![](./images/B15019_11_05.jpg) + +Caption: Subsetting the last duplicated rows + +As you can see from the previous outputs, row `485` has the +same value as row `539`. As expected, row `539` is +not marked as a duplicate anymore. If you want to mark all the duplicate +records as duplicates, you will have to use `keep=False`: + +``` +df.loc[df.duplicated(keep=False), ['InvoiceNo', 'StockCode',\ + 'InvoiceDate', 'CustomerID']] +``` +You should get the following output: + +![](./images/B15019_11_06.jpg) + +Caption: Subsetting all the duplicated rows + +This time, rows `485` and `539` have been listed as +duplicates. Now that you know how to identify duplicate observations, +you can decide whether you wish to remove them from the dataset. As we +mentioned previously, you must be careful when changing the data. You +may want to confirm with the business that they are comfortable with you +doing so. You will have to explain the reason why you want to remove +these rows. In the Online Retail dataset, if you take rows +`485` and `539` as an example, these two +observations are identical. From a business perspective, this means that +a specific customer (`CustomerID 17908`) has bought the same +item (`StockCode 22111`) at the exact same date and time +(`InvoiceDate 2010-12-01 11:45:00`) on the same invoice +(`InvoiceNo 536409`). This is highly suspicious. When you\'re +talking with the business, they may tell you that new software was +released at that time and there was a bug that was splitting all the +purchased items into single-line items. + +In this case, you know that you shouldn\'t remove these rows. On the +other hand, they may tell you that duplication shouldn\'t happen and +that it may be due to human error as the data was entered or during the +data extraction step. Let\'s assume this is the case; now, it is safe +for you to remove these rows. + +To do so, you can use the `drop_duplicates()` method from +`pandas`. It has the same `keep` parameter as +`duplicated()`, which specifies which duplicated record you +want to keep or if you want to remove all of them. In this case, we want +to keep at least one duplicate row. Here, we want to keep the first +occurrence: + +``` +df.drop_duplicates(keep='first') +``` +You should get the following output: + +![](./images/B15019_11_07.jpg) + +Caption: Dropping duplicate rows with keep=\'first\' + +The output of this method is a new DataFrame that contains unique +records where only the first occurrence of duplicates has been kept. If +you want to replace the existing DataFrame rather than getting a new +DataFrame, you need to use the `inplace=True` parameter. + +The `drop_duplicates()` and `duplicated()` methods +also have another very useful parameter: `subset`. This +parameter allows you to specify the list of columns to consider while +looking for duplicates. By default, all the columns of a DataFrame are +used to find duplicate rows. Let\'s see how many duplicate rows there +are while only looking at the `InvoiceNo`, +`StockCode`, `invoiceDate`, and +`CustomerID` columns: + +``` +df.duplicated(subset=['InvoiceNo', 'StockCode', 'InvoiceDate',\ + 'CustomerID'], keep='first').sum() +``` +You should get the following output: + +``` +10677 +``` + +By looking only at these four columns instead of all of them, we can see +that the number of duplicate rows has increased from `5268` to +`10677`. This means that there are rows that have the exact +same values as these four columns but have different values in other +columns, which means they may be different records. In this case, it is +better to use all the columns to identify duplicate records. + + + +Exercise 11.01: Handling Duplicates in a Breast Cancer Dataset +-------------------------------------------------------------- + +In this exercise, you will learn how to identify duplicate records and +how to handle such issues so that the dataset only contains **unique** +records. Let\'s get started: + + +1. Open a new **Colab** notebook. + +2. Import the `pandas` package: + ``` + import pandas as pd + ``` + + +3. Assign the link to the `Breast Cancer` dataset to a + variable called `file_url`: + ``` + file_url = 'https://raw.githubusercontent.com/'\ + 'fenago/data-science/'\ + 'master/Lab11/dataset/'\ + 'breast-cancer-wisconsin.data' + ``` + + +4. Using the `read_csv()` method from the `pandas` + package, load the dataset into a new variable called `df` + with the `header=None` parameter. We\'re doing this + because this file doesn\'t contain column names: + ``` + df = pd.read_csv(file_url, header=None) + ``` + + +5. Create a variable called `col_names` that contains the + names of the columns: + `Sample code number, Clump Thickness, Uniformity of Cell Size, Uniformity of Cell Shape, Marginal Adhesion, Single Epithelial Cell Size, Bare Nuclei, Bland Chromatin, Normal Nucleoli, Mitoses`, + and `Class`: + + + + ``` + col_names = ['Sample code number','Clump Thickness',\ + 'Uniformity of Cell Size',\ + 'Uniformity of Cell Shape',\ + 'Marginal Adhesion','Single Epithelial Cell Size',\ + 'Bare Nuclei','Bland Chromatin',\ + 'Normal Nucleoli','Mitoses','Class'] + ``` + + +6. Assign the column names of the DataFrame using the + `columns` attribute: + ``` + df.columns = col_names + ``` + + +7. Display the shape of the DataFrame using the `.shape` + attribute: + + ``` + df.shape + ``` + + + You should get the following output: + + ``` + (699, 11) + ``` + + + This DataFrame contains `699` rows and `11` + columns. + +8. Display the first five rows of the DataFrame using the + `head()` method: + + ``` + df.head() + ``` + + + You should get the following output: + + +![](./images/B15019_11_08.jpg) + + + Caption: The first five rows of the Breast Cancer dataset + + All the variables are numerical. The Sample code number column is an + identifier for the measurement samples. + +9. Find the number of duplicate rows using the `duplicated()` + and `sum()` methods: + + ``` + df.duplicated().sum() + ``` + + + You should get the following output: + + ``` + 8 + ``` + + + Looking at the 11 columns in this dataset, we can see that there are + `8` duplicate rows. + +10. Display the duplicate rows using the `loc()` and + `duplicated()` methods: + + ``` + df.loc[df.duplicated()] + ``` + + + You should get the following output: + + +![](./images/B15019_11_09.jpg) + + + Caption: Duplicate records + + The following rows are duplicates: `208`, `253`, + `254`, `258`, `272`, `338`, + `561`, and `684`. + +11. Display the duplicate rows just like we did in *Step 9*, but with + the `keep='last'` parameter instead: + + ``` + df.loc[df.duplicated(keep='last')] + ``` + + + You should get the following output: + + +![](./images/B15019_11_10.jpg) + + + Caption: Duplicate records with keep=\'last\' + + By using the `keep='last'` parameter, the following rows + are considered duplicates: `42`, `62`, + `168`, `207`, `267`, `314`, + `560`, and `683`. By comparing this output to + the one from the previous step, we can see that rows 253 and 42 are + identical. + +12. Remove the duplicate rows using the `drop_duplicates()` + method along with the `keep='first'` parameter and save + this into a new DataFrame called `df_unique`: + ``` + df_unique = df.drop_duplicates(keep='first') + ``` + + +13. Display the shape of `df_unique` with the + `.shape` attribute: + + ``` + df_unique.shape + ``` + + + You should get the following output: + + ``` + (691, 11) + ``` + + + Now that we have removed the eight duplicate records, only + `691` rows remain. Now, the dataset only contains unique + observations. + + + +In this exercise, you learned how to identify and remove duplicate +records from a real-world dataset. + + +Converting Data Types +===================== + + +Another problem you may face in a project is incorrect data types being +inferred for some columns. As we saw in *Lab 10*, *Analyzing a +Dataset*, the `pandas` package provides us with a very easy +way to display the data type of each column using the +`.dtypes` attribute. You may be wondering, when did +`pandas` identify the type of each column? The types are +detected when you load the dataset into a `pandas` DataFrame +using methods such as `read_csv()`, `read_excel()`, +and so on. + +When you\'ve done this, `pandas` will try its best to +automatically find the best type according to the values contained in +each column. Let\'s see how this works on the `Online Retail` +dataset. + +First, you must import `pandas`: + +``` +import pandas as pd +``` + +Then, you need to assign the URL to the dataset to a new variable: + +``` +file_url = 'https://github.com/fenago/'\ + 'data-science/blob/'\ + 'master/Lab10/dataset/'\ + 'Online%20Retail.xlsx?raw=true' +``` +Let\'s load the dataset into a `pandas` DataFrame using +`read_excel()`: + +``` +df = pd.read_excel(file_url) +``` +Finally, let\'s print the data type of each column: + +``` +df.dtypes +``` +You should get the following output: + +![Caption: The data type of each column of the Online Retail +dataset ](./images/B15019_11_11.jpg) + +Caption: The data type of each column of the Online Retail dataset + +The preceding output shows the data types that have been assigned to +each column. `Quantity`, `UnitPrice`, and +`CustomerID` have been identified as numerical variables +(`int64`, `float64`), `InvoiceDate` is a +`datetime` variable, and all the other columns are considered +text (`object`). This is not too bad. `pandas` did a +great job of recognizing non-text columns. + +But what if you want to change the types of some columns? You have two +ways to achieve this. + +The first way is to reload the dataset, but this time, you will need to +specify the data types of the columns of interest using the +`dtype` parameter. This parameter takes a dictionary with the +column names as keys and the correct data types as values, such as +{\'col1\': np.float64, \'col2\': np.int32}, as input. Let\'s try this on +`CustomerID`. We know this isn\'t a numerical variable as it +contains a unique **identifier** (code). Here, we are going to change +its type to **object**: + +``` +df = pd.read_excel(file_url, dtype={'CustomerID': 'category'}) +df.dtypes +``` +You should get the following output: + +![](./images/B15019_11_12.jpg) + +Caption: The data types of each column after converting CustomerID + +As you can see, the data type for `CustomerID` has effectively +changed to a `category` type. + +Now, let\'s look at the second way of converting a single column into a +different type. In `pandas`, you can use the +`astype()` method and specify the new data type that it will +be converted into as its **parameter**. It will return a new column (a +new `pandas` series, to be more precise), so you need to +reassign it to the same column of the DataFrame. For instance, if you +want to change the `InvoiceNo` column into a categorical +variable, you would do the following: + +``` +df['InvoiceNo'] = df['InvoiceNo'].astype('category') +df.dtypes +``` +You should get the following output: + +![](./images/B15019_11_13.jpg) + +Caption: The data types of each column after converting InvoiceNo + +As you can see, the data type for `InvoiceNo` has changed to a +categorical variable. The difference between `object` and +`category` is that the latter has a finite number of possible +values (also called discrete variables). Once these have been changed +into categorical variables, `pandas` will automatically list +all the values. They can be accessed using the +`.cat.categories` attribute: + +``` +df['InvoiceNo'].cat.categories +``` +You should get the following output: + +![Caption: List of categories (possible values) for the InvoiceNo +categorical variable ](./images/B15019_11_14.jpg) + +Caption: List of categories (possible values) for the InvoiceNo +categorical variable + +`pandas` has identified that there are 25,900 different values +in this column and has listed all of them. Depending on the data type +that\'s assigned to a variable, `pandas` provides different +attributes and methods that are very handy for data transformation or +feature engineering (this will be covered in *Lab 12*, *Feature +Engineering*). + +As a final note, you may be wondering when you would use the first way +of changing the types of certain columns (while loading the dataset). To +find out the current type of each variable, you must load the data +first, so why will you need to reload the data again with new data +types? It will be easier to change the type with the +`astype()` method after the first load. There are a few +reasons why you would use it. One reason could be that you have already +explored the dataset on a different tool, such as Excel, and already +know what the correct data types are. + +The second reason could be that your dataset is big, and you cannot load +it in its entirety. As you may have noticed, by default, +`pandas` use 64-bit encoding for numerical variables. This +requires a lot of memory and may be overkill. + +For example, the `Quantity` column has an int64 data type, +which means that the range of possible values is +-9,223,372,036,854,775,808 to 9,223,372,036,854,775,807. However, in +*Lab 10*, *Analyzing a Dataset* while analyzing the distribution of +this column, you learned that the range of values for this column is +only from -80,995 to 80,995. You don\'t need to use so much space. By +reducing the data type of this variable to int32 (which ranges from +-2,147,483,648 to 2,147,483,647), you may be able to reload the entire +dataset. + + + +Exercise 11.02: Converting Data Types for the Ames Housing Dataset +------------------------------------------------------------------ + +In this exercise, you will prepare a dataset by converting its variables +into the correct data types. + +You will use the Ames Housing dataset to do this, which we also used in +*Lab 10*, *Analyzing a Dataset*. For more information about this +dataset, refer to the following note. Let\'s get started: + + + +1. Open a new Colab notebook. + +2. Import the `pandas` package: + ``` + import pandas as pd + ``` + + +3. Assign the link to the Ames dataset to a variable called + `file_url`: + ``` + file_url = 'https://raw.githubusercontent.com/'\ + 'fenago/data-science/'\ + 'master/Lab10/dataset/ames_iowa_housing.csv' + ``` + + +4. Using the `read_csv` method from the `pandas` + package, load the dataset into a new variable called `df`: + ``` + df = pd.read_csv(file_url) + ``` + + +5. Print the data type of each column using the `dtypes` + attribute: + + ``` + df.dtypes + ``` + + + You should get the following output: + + +![](./images/B15019_11_15.jpg) + + + Caption: List of columns and their assigned data types + + Note + + The preceding output has been truncated. + + From *Lab 10*, *Analyzing a Dataset* you know that the + `Id`, `MSSubClass`, `OverallQual`, and + `OverallCond` columns have been incorrectly classified as + numerical variables. They have a finite number of unique values and + you can\'t perform any mathematical operations on them. For example, + it doesn\'t make sense to add, remove, multiply, or divide two + different values from the `Id` column. Therefore, you need + to convert them into categorical variables. + +6. Using the `astype()` method, convert the `'Id'` + column into a categorical variable, as shown in the following code + snippet: + ``` + df['Id'] = df['Id'].astype('category') + ``` + + +7. Convert the `'MSSubClass'`, `'OverallQual'`, and + `'OverallCond'` columns into categorical variables, like + we did in the previous step: + ``` + df['MSSubClass'] = df['MSSubClass'].astype('category') + df['OverallQual'] = df['OverallQual'].astype('category') + df['OverallCond'] = df['OverallCond'].astype('category') + ``` + + +8. Create a for loop that will iterate through the four categorical + columns + `('Id', 'MSSubClass', 'OverallQual', `and` 'OverallCond'`) + and print their names and categories using the + `.cat.categories` attribute: + + ``` + for col_name in ['Id', 'MSSubClass', 'OverallQual', \ + 'OverallCond']: + print(col_name) + print(df[col_name].cat.categories) + ``` + + + You should get the following output: + + +![](./images/B15019_11_16.jpg) + + + Caption: List of categories for the four newly converted + variables + + Now, these four columns have been converted into categorical + variables. From the output of *Step 5*, we can see that there are a + lot of variables of the `object` type. Let\'s have a look + at them and see if they need to be converted as well. + +9. Create a new DataFrame called `obj_df` that will only + contain variables of the `object` type using the + `select_dtypes` method along with the + `include='object'` parameter: + ``` + obj_df = df.select_dtypes(include='object') + ``` + + +10. Create a new variable called `obj_cols` that contains a + list of column names from the `obj_df` DataFrame using the + `.columns` attribute and display its content: + + ``` + obj_cols = obj_df.columns + obj_cols + ``` + + + You should get the following output: + + +![](./images/B15019_11_17.jpg) + + + Caption: List of variables of the \'object\' type + +11. Like we did in *Step 8*, create a `for` loop that will + iterate through the column names contained in `obj_cols` + and print their names and unique values using the + `unique()` method: + + ``` + for col_name in obj_cols: + print(col_name) + print(df[col_name].unique()) + ``` + + + You should get the following output: + + +![Caption: List of unique values for each variable of the + \'object\' type ](./images/B15019_11_18.jpg) + + + Caption: List of unique values for each variable of the + \'object\' type + + As you can see, all these columns have a finite number of unique + values that are composed of text, which shows us that they are + categorical variables. + +12. Now, create a `for` loop that will iterate through the + column names contained in `obj_cols` and convert each of + them into a categorical variable using the `astype()` + method: + ``` + for col_name in obj_cols: + df[col_name] = df[col_name].astype('category') + ``` + + +13. Print the data type of each column using the `dtypes` + attribute: + + ``` + df.dtypes + ``` + + + You should get the following output: + + +![](./images/B15019_11_19.jpg) + + +Caption: List of variables and their new data types + + +You have successfully converted the columns that have incorrect data +types (numerical or object) into categorical variables. Your dataset is +now one step closer to being prepared for modeling. + +In the next section, we will look at handling incorrect values. + + +Handling Incorrect Values +========================= + + +Let\'s learn how to detect such issues in real life by using the +`Online Retail` dataset. + +First, you need to load the data into a `pandas` DataFrame: + +``` +import pandas as pd +file_url = 'https://github.com/fenago/'\ + 'data-science/blob/'\ + 'master/Lab10/dataset/'\ + 'Online%20Retail.xlsx?raw=true' +df = pd.read_excel(file_url) +``` + +In this dataset, there are two variables that seem to be related to each +other: `StockCode` and `Description`. The first one +contains the identifier code of the items sold and the other one +contains their descriptions. However, if you look at some of the +examples, such as `StockCode 23131`, the +`Description` column has different values: + +``` +df.loc[df['StockCode'] == 23131, 'Description'].unique() +``` +You should get the following output + +![](./images/B15019_11_20.jpg) + +Caption: List of unique values for the Description column and +StockCode 23131 + +There are multiple issues in the preceding output. One issue is that the +word `Mistletoe` has been misspelled so that it reads +`Miseltoe`. The other errors are unexpected values and missing +values, which will be covered in the next section. It seems that the +`Description` column has been used to record comments such as +`had been put aside`. + +Let\'s focus on the misspelling issue. What we need to do here is modify +the incorrect spelling and replace it with the correct value. First, +let\'s create a new column called `StockCodeDescription`, +which is an exact copy of the `Description` column: + +``` +df['StockCodeDescription'] = df['Description'] +``` +You will use this new column to fix the misspelling issue. To do this, +use the subsetting technique you learned about earlier in this lab. +You need to use `.loc` and filter the rows and columns you +want, that is, all rows with `StockCode == 21131` and +`Description == MISELTOE HEART WREATH CREAM` and the +`Description` column: + +``` +df.loc[(df['StockCode'] == 23131) \ + & (df['StockCodeDescription'] \ + == 'MISELTOE HEART WREATH CREAM'), \ + 'StockCodeDescription'] = 'MISTLETOE HEART WREATH CREAM' +``` +If you reprint the value for this issue, you will see that the +misspelling value has been fixed and is not present anymore: + +``` +df.loc[df['StockCode'] == 23131, 'StockCodeDescription'].unique() +``` +You should get the following output: + +![](./images/B15019_11_21.jpg) + +Caption: List of unique values for the Description column and +StockCode 23131 after fixing the first misspelling issue + +As you can see, there are still five different values for this product, +but for one of them, that is, `MISTLETOE`, has been spelled +incorrectly: `MISELTOE`. + +This time, rather than looking at an exact match (a word must be the +same as another one), we will look at performing a partial match (part +of a word will be present in another word). In our case, instead of +looking at the spelling of `MISELTOE`, we will only look at +`MISEL`. The `pandas` package provides a method +called `.str.contains()` that we can use to look for +observations that partially match with a given expression. + +Let\'s use this to see if we have the same misspelling issue +(`MISEL`) in the entire dataset. You will need to add one +additional parameter since this method doesn\'t handle missing values. +You will also have to subset the rows that don\'t have missing values +for the `Description` column. This can be done by providing +the `na=False` parameter to the `.str.contains()` +method: + +``` +df.loc[df['StockCodeDescription']\ + .str.contains('MISEL', na=False),] +``` +You should get the following output: + +![](./images/B15019_11_22.jpg) + +Caption: Displaying all the rows containing the misspelling +\'MISELTOE\' + +This misspelling issue (`MISELTOE`) is not only related to +`StockCode 23131`, but also to other items. You will need to +fix all of these using the `str.replace()` method. This method +takes the string of characters to be replaced and the replacement string +as parameters: + +``` +df['StockCodeDescription'] = df['StockCodeDescription']\ + .str.replace\ + ('MISELTOE', 'MISTLETOE') +``` +Now, if you print all the rows that contain the misspelling of +`MISEL`, you will see that no such rows exist anymore: + +``` +df.loc[df['StockCodeDescription']\ + .str.contains('MISEL', na=False),] +``` +You should get the following output + +![](./images/B15019_11_23.jpg) + + +You just saw how easy it is to clean observations that have incorrect +values using the `.str.contains` and +`.str.replace()` methods that are provided by the +`pandas` package. These methods can only be used for variables +containing strings, but the same logic can be applied to numerical +variables and can also be used to handle extreme values or outliers. You +can use the ==, \>, \<, \>=, or \<= operator to subset the rows you want +and then replace the observations with the correct values. + + + +Exercise 11.03: Fixing Incorrect Values in the State Column +----------------------------------------------------------- + +In this exercise, you will clean the `State` variable in a +modified version of a dataset by listing all the finance officers in the +USA. We are doing this because the dataset contains some incorrect +values. Let\'s get started: + +1. Open a new Colab notebook. + +2. Import the `pandas` package: + ``` + import pandas as pd + ``` + + +3. Assign the link to the dataset to a variable called + `file_url`: + ``` + file_url = 'https://raw.githubusercontent.com/'\ + 'fenago/data-science/'\ + 'master/Lab11/dataset/officers.csv' + ``` + + +4. Using the `read_csv()` method from the `pandas` + package, load the dataset into a new variable called `df`: + ``` + df = pd.read_csv(file_url) + ``` + + +5. Print the first five rows of the DataFrame using the + `.head()` method: + + ``` + df.head() + ``` + + + You should get the following output: + + +![](./images/B15019_11_24.jpg) + + + Caption: The first five rows of the finance officers dataset + +6. Print out all the unique values of the `State` variable: + + ``` + df['State'].unique() + ``` + + + You should get the following output: + + +![](./images/B15019_11_25.jpg) + + + Caption: List of unique values in the State column + + All the states have been encoded into a two-capitalized character + format. As you can see, there are some incorrect values with + non-capitalized characters, such as `il` and + `iL` (they look like spelling errors for Illinois), and + unexpected values such as `8I`, `I`, and + `60`. In the next few steps, you are going to fix these + issues. + +7. Print out the rows that have the `il` value in the + `State` column using the `pandas` + `.str.contains()` method and the subsetting API, that is, + DataFrame \[condition\]. You will also have to set the + `na` parameter to `False` in + `str.contains()` in order to exclude observations with + missing values: + + ``` + df[df['State'].str.contains('il', na=False)] + ``` + + + You should get the following output: + + +![](./images/B15019_11_26.jpg) + + + Caption: Observations with a value of il + + As you can see, all the cities with the `il` value are + from the state of Illinois. So, the correct `State` value + should be `IL`. You may be thinking that the following + values are also referring to Illinois: `Il`, + `iL`, and `Il`. We\'ll have a look at them next. + +8. Now, create a `for` loop that will iterate through the + following values in the `State` column: `Il`, + `iL`, `Il`. Then, print out the values of the + City and State variables using the `pandas` method for + subsetting, that is, `.loc()`: + DataFrame.loc\[row\_condition, column condition\]. Do this for each + observation: + + ``` + for state in ['Il', 'iL', 'Il']: + print(df.loc[df['State'] == state, ['City', 'State']]) + ``` + + + You should get the following output: + + +![](./images/B15019_11_27.jpg) + + + Caption: Observations with the il value + + Note + + The preceding output has been truncated. + + As you can see, all these cities belong to the state of Illinois. + Let\'s replace them with the correct values. + +9. Create a condition mask (`il_mask`) to subset all the rows + that contain the four incorrect values (`il`, + `Il`, `iL`, and `Il`) by using the + `isin()` method and a list of these values as a parameter. + Then, save the result into a variable called `il_mask`: + ``` + il_mask = df['State'].isin(['il', 'Il', 'iL', 'Il']) + ``` + + +10. Print the number of rows that match the condition we set in + `il_mask` using the `.sum()` method. This will + sum all the rows that have a value of `True` (they match + the condition): + + ``` + il_mask.sum() + ``` + + + You should get the following output: + + ``` + 672 + ``` + + +11. Using the `pandas` `.loc()` method, subset the + rows with the `il_mask` condition mask and replace the + value of the `State` column with `IL`: + ``` + df.loc[il_mask, 'State'] = 'IL' + ``` + + +12. Print out all the unique values of the `State` variable + once more: + + ``` + df['State'].unique() + ``` + + + You should get the following output: + + +![](./images/B15019_11_28.jpg) + + + Caption: List of unique values for the \'State\' column + + As you can see, the four incorrect values are not present anymore. + Let\'s have a look at the other remaining incorrect values: + `II`, `I`, `8I`, and `60`. + We will look at dealing `II` in the next step. + + Print out the rows that have a value of `II` into the + `State` column using the `pandas` subsetting + API, that is, DataFrame.loc\[row\_condition, column\_condition\]: + + ``` + df.loc[df['State'] == 'II',] + ``` + + + You should get the following output: + + +![](./images/B15019_11_29.jpg) + + + Caption: Subsetting the rows with a value of IL in the State + column + + There are only two cases where the `II` value has been + used for the `State` column and both have + `Bloomington` as the city, which is in Illinois. Here, the + correct `State` value should be `IL`. + +13. Now, create a `for` loop that iterates through the three + incorrect values (`I`, `8I`, and `60`) + and print out the subsetted rows using the same logic that we used + in *Step 12*. Only display the `City` and + `State` columns: + + ``` + for val in ['I', '8I', '60']: + print(df.loc[df['State'] == val, ['City', 'State']]) + ``` + + + You should get the following output: + + +![](./images/B15019_11_30.jpg) + + + Caption: Observations with incorrect values (I, 8I, and 60) + + All the observations that have incorrect values are cities in + Illinois. Let\'s fix them now. + +14. Create a `for` loop that iterates through the four + incorrect values (`II`, `I`, `8I`, and + `60`) and reuse the subsetting logic from *Step 12* to + replace the value in `State` with `IL`: + ``` + for val in ['II', 'I', '8I', '60']: + df.loc[df['State'] == val, 'State'] = 'IL' + ``` + + +15. Print out all the unique values of the `State` variable: + + ``` + df['State'].unique() + ``` + + + You should get the following output: + + +![](./images/B15019_11_31.jpg) + + + Caption: List of unique values for the State column + + You fixed the issues for the state of Illinois. However, there are + two more incorrect values in this column: `In` and + `ng`. + +16. Repeat *Step 13*, but iterate through the `In` and + `ng` values instead: + + ``` + for val in ['In', 'ng']: + print(df.loc[df['State'] == val, ['City', 'State']]) + ``` + + + You should get the following output: + + +![](./images/B15019_11_32.jpg) + + + Caption: Observations with incorrect values (In, ng) + + The rows that have the `ng` value in `State` are + missing values. We will cover this topic in the next section. The + observation that has `In` as its `State` is a + city in Indiana, so the correct value should be `IN`. + Let\'s fix this. + +17. Subset the rows containing the `In` value in + `State` using the `.loc()` and + `.str.contains()` methods and replace the state value with + `IN`. Don\'t forget to specify the `na=False` + parameter as `.str.contains()`: + + ``` + df.loc[df['State']\ + .str.contains('In', na=False), 'State'] = 'IN' + ``` + + + Print out all the unique values of the `State` variable: + + ``` + df['State'].unique() + ``` + + + You should get the following output: + + +![](./images/B15019_11_31.jpg) + + +Caption: List of unique values for the State column + + +You just fixed all the incorrect values for the `State` +variable using the methods provided by the `pandas` package. +In the next section, we are going to look at handling missing values. + + +Handling Missing Values +======================= + + +So far, you have looked at a variety of issues when it comes to +datasets. Now it is time to discuss another issue that occurs quite +frequently: missing values. As you may have guessed, this type of issue +means that certain values are missing for certain variables. + +The `pandas` package provides a method that we can use to +identify missing values in a DataFrame: `.isna()`. Let\'s see +it in action on the `Online Retail` dataset. First, you need +to import `pandas` and load the data into a DataFrame: + +``` +import pandas as pd +file_url = 'https://github.com/fenago/'\ + 'data-science/blob/'\ + 'master/Lab10/dataset/'\ + 'Online%20Retail.xlsx?raw=true' +df = pd.read_excel(file_url) +``` + +The `.isna()` method returns a `pandas` series with +a binary value for each cell of a DataFrame and states whether it is +missing a value (`True`) or not (`False`): + +``` +df.isna() +``` +You should get the following output: + +![](./images/B15019_11_34.jpg) + +Caption: Output of the .isna() method + +As we saw previously, we can give the output of a binary variable to the +`.sum()` method, which will add all the `True` +values together (cells that have missing values) and provide a summary +for each column: + +``` +df.isna().sum() +``` +You should get the following output: + +![](./images/B15019_11_35.jpg) + +Caption: Summary of missing values for each variable + +As you can see, there are `1454` missing values in the +`Description` column and `135080` in the +`CustomerID` column. Let\'s have a look at the missing value +observations for `Description`. You can use the output of the +`.isna()` method to subset the rows with missing values: + +``` +df[df['Description'].isna()] +``` +You should get the following output: + +![](./images/B15019_11_36.jpg) + +Caption: Subsetting the rows with missing values for Description + +From the preceding output, you can see that all the rows with missing +values have `0.0` as the unit price and are missing the +`CustomerID` column. In a real project, you will have to +discuss these cases with the business and check whether these +transactions are genuine or not. If the business confirms that these +observations are irrelevant, then you will need to remove them from the +dataset. + +The `pandas` package provides a method that we can use to +easily remove missing values: `.dropna()`. This method returns +a new DataFrame without all the rows that have missing values. By +default, it will look at all the columns. You can specify a list of +columns for it to look for with the `subset` parameter: + +``` +df.dropna(subset=['Description']) +``` +This method returns a new DataFrame with no missing values for the +specified columns. If you want to replace the original dataset directly, +you can use the `inplace=True` parameter: + +``` +df.dropna(subset=['Description'], inplace=True) +``` +Now, look at the summary of the missing values for each variable: + +``` +df.isna().sum() +``` +You should get the following output: + +![](./images/B15019_11_37.jpg) + +Caption: Summary of missing values for each variable + +As you can see, there are no more missing values in the +`Description` column. Let\'s have a look at the +`CustomerID` column: + +``` +df[df['CustomerID'].isna()] +``` +You should get the following output: + +![](./images/B15019_11_38.jpg) + +Caption: Rows with missing values in CustomerID + +This time, all the transactions look normal, except they are missing +values for the `CustomerID` column; all the other variables +have been filled in with values that seem genuine. There is no other way +to infer the missing values for the `CustomerID` column. These +rows represent almost 25% of the dataset, so we can\'t remove them. + +However, most algorithms require a value for each observation, so you +need to provide one for these cases. We will use the +`.fillna()` method from `pandas` to do this. Provide +the value to be imputed as `Missing` and use +`inplace=True` as a parameter: + +``` +df['CustomerID'].fillna('Missing', inplace=True) +df[1443:1448] +``` +You should get the following output: + +![Caption: Examples of rows where missing values for CustomerID +have been replaced with Missing ](./images/B15019_11_39.jpg) + +Caption: Examples of rows where missing values for CustomerID have +been replaced with Missing + +Let\'s see if we have any missing values in the dataset: + +``` +df.isna().sum() +``` +You should get the following output: + +![](./images/B15019_11_40.jpg) + +Caption: Summary of missing values for each variable + +You have successfully fixed all the missing values in this dataset. +These methods also work when we want to handle missing numerical +variables. We will look at this in the following exercise. All you need +to do is provide a numerical value when you want to impute a value with +`.fillna()`. + + + +Exercise 11.04: Fixing Missing Values for the Horse Colic Dataset +----------------------------------------------------------------- + +In this exercise, you will be cleaning out all the missing values for +all the numerical variables in the `Horse Colic` dataset. + +Colic is a painful condition that horses can suffer from, and this +dataset contains various pieces of information related to specific cases +of this condition. You can use the link provided in the Note section if +you want to find out more about the dataset\'s attributes. Let\'s get +started: + + +1. Open a new Colab notebook. + +2. Import the `pandas` package: + ``` + import pandas as pd + ``` + + +3. Assign the link to the dataset to a variable called + `file_url`: + ``` + file_url = 'http://raw.githubusercontent.com/'\ + 'fenago/data-science/'\ + 'master/Lab11/dataset/horse-colic.data' + ``` + + +4. Using the `.read_csv()` method from the `pandas` + package, load the dataset into a new variable called `df` + and specify the `header=None`,` sep='\s+'`, + and` prefix='X'` parameters: + ``` + df = pd.read_csv(file_url, header=None, \ + sep='\s+', prefix='X') + ``` + + +5. Print the first five rows of the DataFrame using the + `.head()` method: + + ``` + df.head() + ``` + + + You should get the following output: + + +![](./images/B15019_11_41.jpg) + + + Caption: The first five rows of the Horse Colic dataset + + As you can see, the authors have used the `?` character + for missing values, but the `pandas` package thinks that + this is a normal value. You need to transform them into missing + values. + +6. Reload the dataset into a `pandas` DataFrame using the + `.read_csv()` method, but this time, add the + `na_values='?'` parameter in order to specify that this + value needs to be treated as a missing value: + ``` + df = pd.read_csv(file_url, header=None, sep='\s+', \ + prefix='X', na_values='?') + ``` + + +7. Print the first five rows of the DataFrame using the + `.head()` method: + + ``` + df.head() + ``` + + + You should get the following output: + + +![](./images/B15019_11_42.jpg) + + + Caption: The first five rows of the Horse Colic dataset + + Now, you can see that `pandas` have converted all the + `?` values into missing values. + +8. Print the data type of each column using the `dtypes` + attribute: + + ``` + df.dtypes + ``` + + + You should get the following output: + + +![](./images/B15019_11_43.jpg) + + + Caption: Data type of each column + +9. Print the number of missing values for each column by combining the + `.isna()` and `.sum()` methods: + + ``` + df.isna().sum() + ``` + + + You should get the following output: + + +![](./images/B15019_11_44.jpg) + + + Caption: Number of missing values for each column + +10. Create a condition mask called `x0_mask` so that you can + find the missing values in the `X0` column using the + `.isna()` method: + ``` + x0_mask = df['X0'].isna() + ``` + + +11. Display the number of missing values for this column by using the + `.sum()` method on `x0_mask`: + + ``` + x0_mask.sum() + ``` + + + You should get the following output: + + ``` + 1 + ``` + + + Here, you got the exact same number of missing values for + `X0` that you did in *Step 9*. + +12. Extract the mean of `X0` using the `.median()` + method and store it in a new variable called `x0_median`. + Print its value: + + ``` + x0_median = df['X0'].median() + print(x0_median) + ``` + + + You should get the following output: + + ``` + 1.0 + ``` + + + The median value for this column is `1`. You will replace + all the missing values with this value in the `X0` column. + +13. Replace all the missing values in the `X0` variable with + their median using the `.fillna()` method, along with the + `inplace=True` parameter: + ``` + df['X0'].fillna(x0_median, inplace=True) + ``` + + +14. Print the number of missing values for `X0` by combining + the `.isna()` and `.sum()` methods: + + ``` + df['X0'].isna().sum() + ``` + + + You should get the following output: + + ``` + 0 + ``` + + + There are no more missing values in the variables. + +15. Create a `for` loop that will iterate through all the + columns of the DataFrame. In the for loop, calculate the median for + each and save them into a variable called `col_median`. + Then, impute missing values with this median value using the + `.fillna()` method, along with the + `inplace=True` parameter, and print the name of the column + and its median value: + + ``` + for col_name in df.columns: + col_median = df[col_name].median() + df[col_name].fillna(col_median, inplace=True) + print(col_name) + print(col_median) + ``` + + + You should get the following output: + + +![](./images/B15019_11_45.jpg) + + + Caption: Median values for each column + +16. Print the number of missing values for each column by combining the + `.isna()` and `.sum()` methods: + + ``` + df.isna().sum() + ``` + + + You should get the following output: + + +![](./images/B15019_11_46.jpg) + + +Caption: Number of missing values for each column + + +You have successfully fixed the missing values for all the numerical +variables using the methods provided by the `pandas` package: +`.isna()` and `.fillna()`. + + + +Activity 11.01: Preparing the Speed Dating Dataset +-------------------------------------------------- + +As an entrepreneur, you are planning to launch a new dating app into the +market. The key feature that will differentiate your app from other +competitors will be your high performing user-matching algorithm. Before +building this model, you have partnered with a speed dating company to +collect data from real events. You just received the dataset from your +partner company but realized it is not as clean as you expected; there +are missing and incorrect values. Your task is to fix the main data +quality issues in this dataset. + +The following steps will help you complete this activity: + +1. Download and load the dataset into Python using + `.read_csv()`. + +2. Print out the dimensions of the DataFrame using `.shape`. + +3. Check for duplicate rows by using `.duplicated()` and + `.sum()` on all the columns. + +4. Check for duplicate rows by using `.duplicated() `and + `.sum()` for the identifier columns (`iid`, + `id`, `partner`, and `pid`). + +5. Check for unexpected values for the following numerical variables: + `'imprace', 'imprelig', 'sports', 'tvsports', 'exercise', 'dining', 'museums', 'art', 'hiking', 'gaming', 'clubbing', 'reading', 'tv', 'theater', 'movies', 'concerts', 'music', 'shopping',` + and `'yoga'`. + +6. Replace the identified incorrect values. + +7. Check the data type of the different columns using + `.dtypes`. + +8. Change the data types to categorical for the columns that don\'t + contain numerical values using `.astype()`. + +9. Check for any missing values using `.isna()` and + `.sum()` for each numerical variable. + +10. Replace the missing values for each numerical variable with their + corresponding mean or median values using `.fillna()`, + `.mean()`, and `.median()`. + + + +You should get the following output. The figure represents the number of +rows with unexpected values for `imprace` and a list of +unexpected values: + +![](./images/B15019_11_47.jpg) + + +The following figure illustrates the number of rows with unexpected +values and a list of unexpected values for each column: + +![](./images/B15019_11_48.jpg) + +The following figure illustrates a list of unique values for gaming: + +![](./images/B15019_11_49.jpg) + +Caption: List of unique values for gaming + +The following figure displays the data types of each column: + +![](./images/B15019_11_50.jpg) + +Caption: Data types of each column + +The following figure displays the updated data types of each column: + +![](./images/B15019_11_51.jpg) + +Caption: Data types of each column + +The following figure displays the number of missing values for numerical +variables: + +![](./images/B15019_11_52.jpg) + +Caption: Number of missing values for numerical variables + +The following figure displays the list of unique values for +`int_corr`: + +![](./images/B15019_11_53.jpg) + +Caption: List of unique values for \'int\_corr\' + +The following figure displays the list of unique values for numerical +variables: + +![](./images/B15019_11_54.jpg) + +Caption: List of unique values for numerical variables + +The following figure displays the number of missing values for numerical +variables: + +![](./images/B15019_11_55.jpg) + +Caption: Number of missing values for numerical variables + + +Summary +======= + + +In this lab, you learned how important it is to prepare any given +dataset and fix the main quality issues it has. This is critical because +the cleaner a dataset is, the easier it will be for any machine learning +model to easily learn about the relevant patterns. On top of this, most +algorithms can\'t handle issues such as missing values, so they must be +handled prior to the modeling phase. In this lab, you covered the +most frequent issues that are faced in data science projects: duplicate +rows, incorrect data types, unexpected values, and missing values. diff --git a/lab_guides/Lab_12.md b/lab_guides/Lab_12.md new file mode 100644 index 0000000..4510e4d --- /dev/null +++ b/lab_guides/Lab_12.md @@ -0,0 +1,1749 @@ + +12. Feature Engineering +======================= + + + +Overview + +By the end of this lab, you will be able to merge multiple datasets +together; bin categorical and numerical variables; perform aggregation +on data; and manipulate dates using `pandas`. + +This lab will introduce you to some of the key techniques for +creating new variables on an existing dataset. + + +Merging Datasets +---------------- + + +First, we need to import the Online Retail dataset into a +`pandas` DataFrame: + +``` +import pandas as pd +file_url = 'https://github.com/fenago/'\ + 'data-science/blob/'\ + 'master/Lab12/Dataset/'\ + 'Online%20Retail.xlsx?raw=true' +df = pd.read_excel(file_url) +df.head() +``` +You should get the following output. + +![](./images/B15019_12_01.jpg) + +Caption: First five rows of the Online Retail dataset + +Next, we are going to load all the public holidays in the UK into +another `pandas` DataFrame. From *Lab 10*, *Analyzing a +Dataset* we know the records of this dataset are only for the years 2010 +and 2011. So we are going to extract public holidays for those two +years, but we need to do so in two different steps as the API provided +by `date.nager` is split into single years only. + +Let\'s focus on 2010 first: + +``` +uk_holidays_2010 = pd.read_csv\ + ('https://date.nager.at/PublicHoliday/'\ + 'Country/GB/2010/CSV') +``` +We can print its shape to see how many rows and columns it has: + +``` +uk_holidays_2010.shape +``` +You should get the following output. + +``` +(13, 8) +``` +We can see there were `13` public holidays in that year and +there are `8` different columns. + +Let\'s print the first five rows of this DataFrame: + +``` +uk_holidays_2010.head() +``` +You should get the following output: + +![](./images/B15019_12_02.jpg) + +Caption: First five rows of the UK 2010 public holidays DataFrame + +Now that we have the list of public holidays for 2010, let\'s extract +the ones for 2011: + +``` +uk_holidays_2011 = pd.read_csv\ + ('https://date.nager.at/PublicHoliday/'\ + 'Country/GB/2011/CSV') +uk_holidays_2011.shape +``` +You should get the following output. + +``` +(15, 8) +``` + +There were `15` public holidays in 2011. Now we need to +combine the records of these two DataFrames. We will use the +`.append()` method from `pandas` and assign the +results into a new DataFrame: + +``` +uk_holidays = uk_holidays_2010.append(uk_holidays_2011) +``` +Let\'s check we have the right number of rows after appending the two +DataFrames: + +``` +uk_holidays.shape +``` +You should get the following output: + +``` +(28, 8) +``` +We got `28` records, which corresponds with the total number +of public holidays in 2010 and 2011. + +In order to merge two DataFrames together, we need to have at least one +common column between them, meaning the two DataFrames should have at +least one column that contains the same type of information. In our +example, we are going to merge this DataFrame using the `Date` +column with the Online Retail DataFrame on the `InvoiceDate` +column. We can see that the data format of these two columns is +different: one is a date (`yyyy-mm-dd`) and the other is a +datetime (`yyyy-mm-dd hh:mm:ss`). + +So, we need to transform the `InvoiceDate` column into date +format (`yyyy-mm-dd`). One way to do it (we will see another +one later in this lab) is to transform this column into text and +then extract the first 10 characters for each cell using the +`.str.slice()` method. + +For example, the date 2010-12-01 08:26:00 will first be converted into a +string and then we will keep only the first 10 characters, which will be +2010-12-01. We are going to save these results into a new column called +`InvoiceDay`: + +``` +df['InvoiceDay'] = df['InvoiceDate'].astype(str)\ + .str.slice(stop=10) +df.head() +``` + +The output is as follows: + +![](./images/B15019_12_03.jpg) + +Caption: First five rows after creating InvoiceDay + +Now `InvoiceDay` from the online retail DataFrame and +`Date` from the UK public holidays DataFrame have similar +information, so we can merge these two DataFrames together using +`.merge()` from `pandas`. + +There are multiple ways to join two tables together: + +- The left join +- The right join +- The inner join +- The outer join + + + +### The Left Join + +The left join will keep all the rows from the first DataFrame, which is +the *Online Retail* dataset (the left-hand side) and join it to the +matching rows from the second DataFrame, which is the *UK Public +Holidays* dataset (the right-hand side), as shown in *Figure 12.04*: + +![](./images/B15019_12_04.jpg) + +Caption: Venn diagram for left join + +To perform a left join, we need to specify to the .merge() method the +following parameters: + +- `how = 'left'` for a left join +- `left_on = InvoiceDay` to specify the column used for + merging from the left-hand side (here, the `Invoiceday` + column from the Online Retail DataFrame) +- `right_on = Date` to specify the column used for merging + from the right-hand side (here, the `Date` column from the + UK Public Holidays DataFrame) + +These parameters are clubbed together as shown in the following code +snippet: + +``` +df_left = pd.merge(df, uk_holidays, left_on='InvoiceDay', \ + right_on='Date', how='left') +df_left.shape +``` +You should get the following output: + +``` +(541909, 17) +``` +We got the exact same number of rows as the original Online Retail +DataFrame, which is expected for a left join. Let\'s have a look at the +first five rows: + +``` +df_left.head() +``` +You should get the following output: + +![](./images/B15019_12_05.jpg) + +Caption: First five rows of the left-merged DataFrame + +We can see that the eight columns from the public holidays DataFrame +have been merged to the original one. If no row has been matched from +the second DataFrame (in this case, the public holidays one), +`pandas` will fill all the cells with missing values +(`NaT` or `NaN`), as shown in *Figure 12.05*. + + + +### The Right Join + +The right join is similar to the left join except it will keep all the +rows from the second DataFrame (the right-hand side) and tries to match +it with the first one (the left-hand side), as shown in *Figure 12.06*: + +![](./images/B15019_12_06.jpg) + +Caption: Venn diagram for right join + +We just need to specify the parameters: + +- `how` `= 'right`\' to the `.merge()` + method to perform this type of join. +- We will use the exact same columns used for merging as the previous + example, which is `InvoiceDay` for the Online Retail + DataFrame and `Date` for the UK Public Holidays one. + +These parameters are clubbed together as shown in the following code +snippet: + +``` +df_right = df.merge(uk_holidays, left_on='InvoiceDay', \ + right_on='Date', how='right') +df_right.shape +``` +You should get the following output: + +``` +(9602, 17) +``` +We can see there are fewer rows as a result of the right join, but it +doesn\'t get the same number as for the Public Holidays DataFrame. This +is because there are multiple rows from the Online Retail DataFrame that +match one single date in the public holidays one. + +For instance, looking at the first rows of the merged DataFrame, we can +see there were multiple purchases on January 4, 2011, so all of them +have been matched with the corresponding public holiday. Have a look at +the following code snippet: + +``` +df_right.head() +``` +You should get the following output: + +![](./images/B15019_12_07.jpg) + +Caption: First five rows of the right-merged DataFrame + +There are two other types of merging: inner and outer. + +An inner join will only keep the rows that match between the two tables: + +![](./images/B15019_12_08.jpg) + +Caption: Venn diagram for inner join + +You just need to specify the `how = 'inner'` parameter in the +`.merge()` method. + +These parameters are clubbed together as shown in the following code +snippet: + +``` +df_inner = df.merge(uk_holidays, left_on='InvoiceDay', \ + right_on='Date', how='inner') +df_inner.shape +``` +You should get the following output: + +``` +(9579, 17) +``` +We can see there are only 9,579 observations that happened during a +public holiday in the UK. + +The outer join will keep all rows from both tables (matched and +unmatched), as shown in *Figure 12.09*: + +![](./images/B15019_12_09.jpg) + +Caption: Venn diagram for outer join + +As you may have guessed, you just need to specify the +`how == 'outer'` parameter in the `.merge()` method: + +``` +df_outer = df.merge(uk_holidays, left_on='InvoiceDay', \ + right_on='Date', how='outer') +df_outer.shape +``` +You should get the following output: + +``` +(541932, 17) +``` +Before merging two tables, it is extremely important for you to know +what your focus is. If your objective is to expand the number of +features from an original dataset by adding the columns from another +one, then you will probably use a left or right join. But be aware you +may end up with more observations due to potentially multiple matches +between the two tables. On the other hand, if you are interested in +knowing which observations matched or didn\'t match between the two +tables, you will either use an inner or outer join. + + + +Exercise 12.01: Merging the ATO Dataset with the Postcode Data +-------------------------------------------------------------- + +In this exercise, we will merge the ATO dataset (28 columns) with the +Postcode dataset (150 columns) to get a richer dataset with an increased +number of columns. + + +The following steps will help you complete the exercise: + +1. Open up a new Colab notebook. + +2. Now, begin with the `import` of the `pandas` + package: + ``` + import pandas as pd + ``` + + +3. Assign the link to the ATO dataset to a variable called + `file_url`: + ``` + file_url = 'https://raw.githubusercontent.com/'\ + 'fenago/data-science/'\ + 'master/Lab12/Dataset/taxstats2015.csv' + ``` + + +4. Using the `.read_csv()` method from the `pandas` + package, load the dataset into a new DataFrame called + `df`: + ``` + df = pd.read_csv(file_url) + ``` + + +5. Display the dimensions of this DataFrame using the + `.shape` attribute: + + ``` + df.shape + ``` + + + You should get the following output: + + ``` + (2473, 28) + ``` + + + The ATO dataset contains `2471` rows and `28` + columns. + +6. Display the first five rows of the ATO DataFrame using the + `.head()` method: + + ``` + df.head() + ``` + + + You should get the following output: + + +![](./images/B15019_12_10.jpg) + + + Caption: First five rows of the ATO dataset + + Both DataFrames have a column called `Postcode` containing + postcodes, so we will use it to merge them together. + + Note + + Postcode is the name used in Australia for zip code. It is an + identifier for postal areas. + + We are interested in learning more about each of these postcodes. + Let\'s make sure they are all unique in this dataset. + +7. Display the number of unique values for the `Postcode` + variable using the `.nunique()` method: + + ``` + df['Postcode'].nunique() + ``` + + + You should get the following output: + + ``` + 2473 + ``` + + + There are `2473` unique values in this column and the + DataFrame has `2473` rows, so we are sure the + `Postcode` variable contains only unique values. + +8. Now, assign the link to the second Postcode dataset to a variable + called `postcode_df`: + ``` + postcode_url = 'https://github.com/fenago/'\ + 'data-science/blob/'\ + 'master/Lab12/Dataset/'\ + 'taxstats2016individual06taxablestatusstate'\ + 'territorypostcodetaxableincome%20(2).xlsx?'\ + 'raw=true' + ``` + + +9. Load the second Postcode dataset into a new DataFrame called + `postcode_df` using the `.read_excel()` method. + + We will only load the *Individuals Table 6B* sheet as this is where + the data is located so we need to provide this name to the + `sheet_name` parameter. Also, the header row (containing + the name of the variables) in this spreadsheet is located on the + third row so we need to specify it to the header parameter. + + Note + + Don\'t forget the `index` starts with 0 in Python. + + Have a look at the following code snippet: + + ``` + postcode_df = pd.read_excel(postcode_url, \ + sheet_name='Individuals Table 6B', \ + header=2) + ``` + + +10. Print the dimensions of `postcode_df` using the + `.shape` attribute: + + ``` + postcode_df.shape + ``` + + + You should get the following output: + + ``` + (2567, 150) + ``` + + + This DataFrame contains `2567` rows for `150` + columns. By merging it with the ATO dataset, we will get additional + information for each postcode. + +11. Print the first five rows of `postcode_df` using the + `.head()` method: + + ``` + postcode_df.head() + ``` + + + You should get the following output: + + +![](./images/B15019_12_11.jpg) + + + Caption: First five rows of the Postcode dataset + + We can see that the second column contains the postcode value, and + this is the one we will use to merge on with the ATO dataset. Let\'s + check if they are unique. + +12. Print the number of unique values in this column using the + `.nunique()` method as shown in the following code + snippet: + + ``` + postcode_df['Postcode'].nunique() + ``` + + + You should get the following output: + + ``` + 2567 + ``` + + + There are `2567` unique values, and this corresponds + exactly to the number of rows of this DataFrame, so we\'re + absolutely sure this column contains unique values. This also means + that after merging the two tables, there will be only one-to-one + matches. We won\'t have a case where we get multiple rows from one + of the datasets matching with only one row of the other one. For + instance, postcode `2029` from the ATO DataFrame will have + exactly one match in the second Postcode DataFrame. + +13. Perform a left join on the two DataFrames using the + `.merge()` method and save the results into a new + DataFrame called `merged_df`. Specify the + `how='left'` and `on='Postcode'` parameters: + ``` + merged_df = pd.merge(df, postcode_df, \ + how='left', on='Postcode') + ``` + + +14. Print the dimensions of the new merged DataFrame using the + `.shape` attribute: + + ``` + merged_df.shape + ``` + + + You should get the following output: + + ``` + (2473, 177) + ``` + + + We got exactly `2473` rows after merging, which is what we + expect as we used a left join and there was a one-to-one match on + the `Postcode` column from both original DataFrames. Also, + we now have `177` columns, which is the objective of this + exercise. But before concluding it, we want to see whether there are + any postcodes that didn\'t match between the two datasets. To do so, + we will be looking at one column from the right-hand side DataFrame + (the Postcode dataset) and see if there are any missing values. + +15. Print the total number of missing values from the + `'State/Territory1'` column by combining the + `.isna()` and `.sum()` methods: + + ``` + merged_df['State/ Territory1'].isna().sum() + ``` + + + You should get the following output: + + ``` + 4 + ``` + + + There are four postcodes from the ATO dataset that didn\'t match the + Postcode code. + + Let\'s see which ones they are. + +16. Print the missing postcodes using the `.iloc()` method, as + shown in the following code snippet: + + ``` + merged_df.loc[merged_df['State/ Territory1'].isna(), \ + 'Postcode'] + ``` + + + You should get the following output: + + +![](./images/B15019_12_12.jpg) + + +Caption: List of unmatched postcodes + +The missing postcodes from the Postcode dataset are `3010`, +`4462`, `6068`, and `6758`. In a real +project, you would have to get in touch with your stakeholders or the +data team to see if you are able to get this data. + +We have successfully merged the two datasets of interest and have +expanded the number of features from `28` to `177`. +We now have a much richer dataset and will be able to perform a more +detailed analysis of it. + + +In the next topic, you will be introduced to the binning variables. + + + +Binning Variables +----------------- + +As mentioned earlier, feature engineering is not only about getting +information not present in a dataset. Quite often, you will have to +create new features from existing ones. One example of this is +consolidating values from an existing column to a new list of values. + +For instance, you may have a very high number of unique values for some +of the categorical columns in your dataset, let\'s say over 1,000 values +for each variable. This is actually quite a lot of information that will +require extra computation power for an algorithm to process and learn +the patterns from. This can have a significant impact on the project +cost if you are using cloud computing services or on the delivery time +of the project. + +One possible solution is to not use these columns and drop them, but in +that case, you may lose some very important and critical information for +the business. Another solution is to create a more consolidated version +of these columns by reducing the number of unique values to a smaller +number, let\'s say 100. This would drastically speed up the training +process for the algorithm without losing too much information. This kind +of transformation is called binning and, traditionally, it refers to +numerical variables, but the same logic can be applied to categorical +variables as well. + +Let\'s see how we can achieve this on the Online Retail dataset. First, +we need to load the data: + +``` +import pandas as pd +file_url = 'https://github.com/fenago/'\ + 'data-science/blob/'\ + 'master/Lab12/Dataset/'\ + 'Online%20Retail.xlsx?raw=true' +df = pd.read_excel(file_url) +``` + +In *Lab 10*, *Analyzing a Dataset* we learned that the +`Country` column contains `38` different unique +values: + +``` +df['Country'].unique() +``` +You should get the following output: + +![](./images/B15019_12_13.jpg) + +Caption: List of unique values for the Country column + +We are going to group some of the countries together into regions such +as Asia, the Middle East, and America. We will leave the European +countries as is. + +First, let\'s create a new column called `Country_bin` by +copying the `Country` column: + +``` +df['Country_bin'] = df['Country'] +``` + +Then, we are going to create a list called `asian_countries` +containing the name of Asian countries from the list of unique values +for the `Country` column: + +``` +asian_countries = ['Japan', 'Hong Kong', 'Singapore'] +``` +And finally, using the `.loc()` and `.isin()` +methods from `pandas`, we are going to change the value of +`Country_bin` to `Asia` for all of the countries +that are present in the `asian_countries` list: + +``` +df.loc[df['Country'].isin(asian_countries), \ + 'Country_bin'] = 'Asia' +``` +Now, if we print the list of unique values for this new column, we will +see the three Asian countries (`Japan`, `Hong Kong`, +and `Singapore`) have been replaced by the value +`Asia`: + +``` +df['Country_bin'].unique() +``` +You should get the following output: + +![Caption: List of unique values for the Country\_bin column after +binning Asian countries ](./images/B15019_12_14.jpg) + +Caption: List of unique values for the Country\_bin column after +binning Asian countries + +Let\'s perform the same process for Middle Eastern countries: + +``` +m_east_countries = ['Israel', 'Bahrain', 'Lebanon', \ + 'United Arab Emirates', 'Saudi Arabia'] +df.loc[df['Country'].isin(m_east_countries), \ + 'Country_bin'] = 'Middle East' +df['Country_bin'].unique() +``` +You should get the following output: + +![](./images/B15019_12_15.jpg) + + + +Finally, let\'s group all countries from North and South America +together: + +``` +american_countries = ['Canada', 'Brazil', 'USA'] +df.loc[df['Country'].isin(american_countries), \ + 'Country_bin'] = 'America' +df['Country_bin'].unique() +``` +You should get the following output: + +![Caption: List of unique values for the Country\_bin column after +binning countries from North and South America](./images/B15019_12_16.jpg) + +Caption: List of unique values for the Country\_bin column after +binning countries from North and South America + +``` +df['Country_bin'].nunique() +``` +You should get the following output: + +``` +30 +``` +`30` is the number of unique values for the +`Country_bin` column. So we reduced the number of unique +values in this column from `38` to `30`: + +We just saw how to group categorical values together, but the same +process can be applied to numerical values as well. For instance, it is +quite common to group people\'s ages into bins such as 20s (20 to 29 +years old), 30s (30 to 39), and so on. + +Have a look at *Exercise 12.02*, *Binning the YearBuilt variable from +the AMES Housing dataset*. + + + +Exercise 12.02: Binning the YearBuilt Variable from the AMES Housing Dataset +---------------------------------------------------------------------------- + +In this exercise, we will create a new feature by binning an existing +numerical column in order to reduce the number of unique values from +`112` to `15`. + +Note + +The dataset we will be using in this exercise is the Ames Housing +dataset. +This dataset contains the list of residential home sales in the city of +Ames, Iowa between 2010 and 2016. + + +1. Open up a new Colab notebook. + +2. Import the `pandas` and `altair` packages: + ``` + import pandas as pd + import altair as alt + ``` + + +3. Assign the link to the dataset to a variable called + `file_url`: + ``` + file_url = 'https://raw.githubusercontent.com/'\ + 'fenago/data-science/'\ + 'master/Lab12/Dataset/ames_iowa_housing.csv' + ``` + + +4. Using the `.read_csv()` method from the `pandas` + package, load the dataset into a new DataFrame called + `df`: + ``` + df = pd.read_csv(file_url) + ``` + + +5. Display the first five rows using the` .head()` method: + + ``` + df.head() + ``` + + + You should get the following output: + + +![](./images/B15019_12_17.jpg) + + + Caption: First five rows of the AMES housing DataFrame + +6. Display the number of unique values on the column using + `.nunique()`: + + ``` + df['YearBuilt'].nunique() + ``` + + + You should get the following output: + + ``` + 112 + ``` + + + There are `112` different or unique values in the + `YearBuilt` column: + +7. Print a scatter plot using `altair` to visualize the + number of records built per year. Specify `YearBuilt:O` as + the x-axis and `count()` as the y-axis in the + `.encode()` method: + + ``` + alt.Chart(df).mark_circle().encode(alt.X('YearBuilt:O'),\ + y='count()') + ``` + + + You should get the following output: + + +![](./images/B15019_12_18.jpg) + + + Caption: First five rows of the AMES housing DataFrame + + Note + + The output is not shown on GitHub due to its limitations. If you run + this on your Colab file, the graph will be displayed. + + There weren\'t many properties sold in some of the years. So, you + can group them by decades (groups of 10 years). + +8. Create a list called `year_built` containing all the + unique values in the `YearBuilt `column: + ``` + year_built = df['YearBuilt'].unique() + ``` + + +9. Create another list that will compute the decade for each year in + `year_built`. Use list comprehension to loop through each + year and apply the following formula: + `year - (year % 10)`. + + For example, this formula applied to the year 2015 will give 2015 - + (2015 % 10), which is 2015 -- 5 equals 2010. + + Note + + \% corresponds to the modulo operator and will return the last digit + of each year. + + Have a look at the following code snippet: + + ``` + decade_list = [year - (year % 10) for year in year_built] + ``` + + +10. Create a sorted list of unique values from `decade_list` + and save the result into a new variable called + `decade_built`. To do so, transform + `decade_list` into a set (this will exclude all + duplicates) and then use the `sorted()` function as shown + in the following code snippet: + ``` + decade_built = sorted(set(decade_list)) + ``` + + +11. Print the values of `decade_built`: + + ``` + decade_built + ``` + + + You should get the following output: + + +![](./images/B15019_12_19.jpg) + + + Caption: List of decades + + Now we have the list of decades we are going to bin the + `YearBuilt` column with. + +12. Create a new column on the `df` DataFrame called + `DecadeBuilt` that will bin each value from + `YearBuilt` into a decade. You will use the + `.cut()` method from `pandas` and specify the + `bins=decade_built` parameter: + ``` + df['DecadeBuilt'] = pd.cut(df['YearBuilt'], \ + bins=decade_built) + ``` + + +13. Print the first five rows of the DataFrame but only for the + `'YearBuilt'` and `'DecadeBuilt'` columns: + + ``` + df[['YearBuilt', 'DecadeBuilt']].head() + ``` + + + You should get the following output: + + +![](./images/B15019_12_20.jpg) + + + + +Manipulating Dates +------------------ + + +In *Lab 10*, *Analyzing a Dataset* you were introduced to the +concept of data types in `pandas`. At that time, we mainly +focused on numerical variables and categorical ones but there is another +important one: `datetime`. Let\'s have a look again at the +type of each column from the Online Retail dataset: + +``` +import pandas as pd +file_url = 'https://github.com/fenago/'\ + 'data-science/blob/'\ + 'master/Lab12/Dataset/'\ + 'Online%20Retail.xlsx?raw=true' +df = pd.read_excel(file_url) +df.dtypes +``` +You should get the following output: + +![](./images/B15019_12_21.jpg) + +Caption: Data types for the variables in the Online Retail dataset + +We can see that `pandas` automatically detected that +`InvoiceDate` is of type `datetime`. But for some +other datasets, it may not recognize dates properly. In this case, you +will have to manually convert them using the `.to_datetime()` +method: + +``` +df['InvoiceDate'] = pd.to_datetime(df['InvoiceDate']) +``` +Once the column is converted to `datetime`, pandas provides a +lot of attributes and methods for extracting time-related information. +For instance, if you want to get the year of a date, you use the +`.dt.year` attribute: + +``` +df['InvoiceDate'].dt.year +``` +You should get the following output: + +![](./images/B15019_12_22.jpg) + +Caption: Extracted year for each row for the InvoiceDate column + +As you may have guessed, there are attributes for extracting the month +and day of a date: `.dt.month` and `.dt.day` +respectively. You can get the day of the week from a date using the +`.dt.dayofweek` attribute: + +``` +df['InvoiceDate'].dt.dayofweek +``` +You should get the following output. + +![](./images/B15019_12_23.jpg) + +Caption: Extracted day of the week for each row for the InvoiceDate column + + +With datetime columns, you can also perform some mathematical +operations. We can, for instance, add `3` days to each date by +using pandas time-series offset object, +`pd.tseries.offsets.Day(3)`: + +``` +df['InvoiceDate'] + pd.tseries.offsets.Day(3) +``` +You should get the following output: + +![](./images/B15019_12_24.jpg) + +Caption: InvoiceDate column offset by three days + +You can also offset days by business days using +`pd.tseries.offsets.BusinessDay()`. For instance, if we want +to get the previous business days, we do: + +``` +df['InvoiceDate'] + pd.tseries.offsets.BusinessDay(-1) +``` +You should get the following output: + +![](./images/B15019_12_25.jpg) + +Caption: InvoiceDate column offset by -1 business day + +Another interesting date manipulation operation is to apply a specific +time-frequency using `pd.Timedelta()`. For instance, if you +want to get the first day of the month from a date, you do: + +``` +df['InvoiceDate'] + pd.Timedelta(1, unit='MS') +``` +You should get the following output: + +![](./images/B15019_12_26.jpg) + +Caption: InvoiceDate column transformed to the start of the month + +As you have seen in this section, the `pandas` package +provides a lot of different APIs for manipulating dates. You have +learned how to use a few of the most popular ones. You can now explore +the other ones on your own. + + + +Exercise 12.03: Date Manipulation on Financial Services Consumer Complaints +--------------------------------------------------------------------------- + +In this exercise, we will learn how to extract time-related information +from two existing date columns using `pandas` in order to +create six new columns: + +Note + +The dataset we will be using in this exercise is the Financial Services +Customer Complaints dataset + + +1. Open up a new Colab notebook. + +2. Import the `pandas` package: + ``` + import pandas as pd + ``` + + +3. Assign the link to the dataset to a variable called + `file_url`: + ``` + file_url = 'https://raw.githubusercontent.com/'\ + 'fenago/data-science/'\ + 'master/Lab12/Dataset/Consumer_Complaints.csv' + ``` + + +4. Use the `.read_csv()` method from the `pandas` + package and load the dataset into a new DataFrame called + `df`: + ``` + df = pd.read_csv(file_url) + ``` + + +5. Display the first five rows using the `.head()` method: + + ``` + df.head() + ``` + + + You should get the following output: + + +![](./images/B15019_12_27.jpg) + + + Caption: First five rows of the Customer Complaint DataFrame + +6. Print out the data types for each column using + the` .dtypes` attribute: + + ``` + df.dtypes + ``` + + + You should get the following output: + + +![](./images/B15019_12_28.jpg) + + + Caption: Data types for the Customer Complaint DataFrame + + The `Date received` and `Date sent to company` + columns haven\'t been recognized as datetime, so we need to manually + convert them. + +7. Convert the `Date received` and + `Date sent to company` columns to datetime using the + `pd.to_datetime()` method: + ``` + df['Date received'] = pd.to_datetime(df['Date received']) + df['Date sent to company'] = pd.to_datetime\ + (df['Date sent to company']) + ``` + + +8. Print out the data types for each column using the + `.dtypes` attribute: + + ``` + df.dtypes + ``` + + + You should get the following output: + + +![ ](./images/B15019_12_29.jpg) + + + Caption: Data types for the Customer Complaint DataFrame after + conversion + + Now these two columns have the right data types. Now let\'s create + some new features from these two dates. + +9. Create a new column called `YearReceived`, which will + contain the year of each date from the `Date Received` + column using the `.dt.year` attribute: + ``` + df['YearReceived'] = df['Date received'].dt.year + ``` + + +10. Create a new column called `MonthReceived`, which will + contain the month of each date using the `.dt.month` + attribute: + ``` + df['MonthReceived'] = df['Date received'].dt.month + ``` + + +11. Create a new column called `DayReceived`, which will + contain the day of the month for each date using the + `.dt.day` attribute: + ``` + df['DomReceived'] = df['Date received'].dt.day + ``` + + +12. Create a new column called `DowReceived`, which will + contain the day of the week for each date using the + `.dt.dayofweek` attribute: + ``` + df['DowReceived'] = df['Date received'].dt.dayofweek + ``` + + +13. Display the first five rows using the `.head()` method: + + ``` + df.head() + ``` + + + You should get the following output: + + +![](./images/B15019_12_30.jpg) + + + Caption: First five rows of the Customer Complaint DataFrame + after creating four new features + + We can see we have successfully created four new features: + `YearReceived`, `MonthReceived`, + `DayReceived`, and `DowReceived`. Now let\'s + create another that will indicate whether the date was during a + weekend or not. + +14. Create a new column called `IsWeekendReceived`, which will + contain binary values indicating whether the `DowReceived` + column is over or equal to `5` (`0` corresponds + to Monday, `5` and `6` correspond to Saturday + and Sunday respectively): + ``` + df['IsWeekendReceived'] = df['DowReceived'] >= 5 + ``` + + +15. Display the first `5` rows using the `.head()` + method: + + ``` + df.head() + ``` + + + You should get the following output: + + +![](./images/B15019_12_31.jpg) + + + Caption: First five rows of the Customer Complaint DataFrame + after creating the weekend feature + + We have created a new feature stating whether each complaint was + received during a weekend or not. Now we will feature engineer a new + column with the numbers of days between + `Date sent to company` and `Date received`. + +16. Create a new column called `RoutingDays`, which will + contain the difference between `Date sent to company` and + `Date received`: + ``` + df['RoutingDays'] = df['Date sent to company'] \ + - df['Date received'] + ``` + + +17. Print out the data type of the new `'RoutingDays'` column + using the `.dtype` attribute: + + ``` + df['RoutingDays'].dtype + ``` + + + You should get the following output: + + +![](./images/B15019_12_32.jpg) + + + Caption: Data type of the RoutingDays column + + The result of subtracting two datetime columns is a new datetime + column (`dtype(' 72) \ + & (bankData['balance'] < 448), \ + 'balanceClass'] = 'Quant2' + bankData.loc[(bankData['balance'] > 448) \ + & (bankData['balance'] < 1428), \ + 'balanceClass'] = 'Quant3' + bankData.loc[bankData['balance'] > 1428, \ + 'balanceClass'] = 'Quant4' + bankData.head() + ``` + + + You should get the following output: + + +![](./images/B15019_03_17.jpg) + + + Caption: New features from bank balance data + + We did this is by looking at the quantile thresholds we took in the + *Step 4*, and categorizing the numerical data into the corresponding + quantile class. For example, all values lower than the + 25[th] quantile value, 72, were classified as + `Quant1`, values between 72 and 448 were classified as + `Quant2`, and so on. To store the quantile categories, we + created a new feature in the bank dataset called + `balanceClass` and set its default value to + `Quan1`. After this, based on each value threshold, the + data points were classified to the respective quantile class. + +9. Next, we need to find the propensity of term deposit purchases based + on each quantile the customers fall into. This task is similar to + what we did in *Exercise 3.02*, *Business Hypothesis Testing for Age + versus Propensity for a Term Loan*: + + ``` + # Calculating the customers under each quantile + balanceTot = bankData.groupby(['balanceClass'])['y']\ + .agg(balanceTot='count').reset_index() + balanceTot + ``` + + + You should get the following output: + + +![](./images/B15019_03_18.jpg) + + + Caption: Classification based on quantiles + +10. Calculate the total number of customers categorized by quantile and + propensity classification, as mentioned in the following code + snippet: + + ``` + """ + Calculating the total customers categorised as per quantile + and propensity classification + """ + balanceProp = bankData.groupby(['balanceClass', 'y'])['y']\ + .agg(balanceCat='count').reset_index() + balanceProp + ``` + + + You should get the following output: + + +![](./images/B15019_03_19.jpg) + + + Caption: Total number of customers categorized by quantile and + propensity classification + +11. Now, `merge` both DataFrames: + + ``` + # Merging both the data frames + balanceComb = pd.merge(balanceProp, balanceTot, \ + on = ['balanceClass']) + balanceComb['catProp'] = (balanceComb.balanceCat \ + / balanceComb.balanceTot)*100 + balanceComb + ``` + + + You should get the following output: + + +![](./images/B15019_03_20.jpg) + + +Caption: Propensity versus balance category + + + +In the next exercise, we will use these intuitions to derive a new +feature. + + + +Exercise 3.04: Feature Engineering -- Creating New Features from Existing Ones +------------------------------------------------------------------------------ + +In this exercise, we will combine the individual variables we analyzed +in *Exercise 3.03*, *Feature Engineering -- Exploration of Individual +Features* to derive a new feature called an asset index. One methodology +to create an asset index is by assigning weights based on the asset or +liability of the customer. + +For instance, a higher bank balance or home ownership will have a +positive bearing on the overall asset index and, therefore, will be +assigned a higher weight. In contrast, the presence of a loan will be a +liability and, therefore, will have to have a lower weight. Let\'s give +a weight of 5 if the customer has a house and 1 in its absence. +Similarly, we can give a weight of 1 if the customer has a loan and 5 in +case of no loans: + +1. Open a new Colab notebook. + +2. Import the pandas and numpy package: + ``` + import pandas as pd + import numpy as np + ``` + + +3. Assign the link to the dataset to a variable called \'file\_url\'. + ``` + file_url = 'https://raw.githubusercontent.com'\ + '/fenago/data-science'\ + '/master/Lab03/bank-full.csv' + ``` + + +4. Read the banking dataset using the `.read_csv()` function: + ``` + # Reading the banking data + bankData = pd.read_csv(file_url,sep=";") + ``` + + +5. The first step we will follow is to normalize the numerical + variables. This is implemented using the following code snippet: + ``` + # Normalizing data + from sklearn import preprocessing + x = bankData[['balance']].values.astype(float) + ``` + + +6. As the bank balance dataset contains numerical values, we need to + first normalize the data. The purpose of normalization is to bring + all of the variables that we are using to create the new feature + into a common scale. One effective method we can use here for the + normalizing function is called `MinMaxScaler()`, which + converts all of the numerical data between a scaled range of 0 to 1. + The `MinMaxScaler` function is available within the + `preprocessing` method in `sklearn`: + ``` + minmaxScaler = preprocessing.MinMaxScaler() + ``` + + +7. Transform the balance data by normalizing it with + `minmaxScaler`: + + ``` + bankData['balanceTran'] = minmaxScaler.fit_transform(x) + ``` + + + In this step, we created a new feature called + `'balanceTran'` to store the normalized bank balance + values. + +8. Print the head of the data using the `.head()` function: + + ``` + bankData.head() + ``` + + + You should get the following output: + + +![](./images/B15019_03_21.jpg) + + + Caption: Normalizing the bank balance data + +9. After creating the normalized variable, add a small value of + `0.001` so as to eliminate the 0 values in the variable. + This is mentioned in the following code snippet: + + ``` + # Adding a small numerical constant to eliminate 0 values + bankData['balanceTran'] = bankData['balanceTran'] + 0.00001 + ``` + + + The purpose of adding this small value is because, in the subsequent + steps, we will be multiplying three transformed variables together + to form a composite index. The small value is added to avoid the + variable values becoming 0 during the multiplying operation. + +10. Now, add two additional columns for introducing the transformed + variables for loans and housing, as per the weighting approach + discussed at the start of this exercise: + + ``` + # Let us transform values for loan data + bankData['loanTran'] = 1 + # Giving a weight of 5 if there is no loan + bankData.loc[bankData['loan'] == 'no', 'loanTran'] = 5 + bankData.head() + ``` + + + You should get the following output: + + +![](./images/B15019_03_22.jpg) + + + Caption: Additional columns with the transformed variables + + We transformed values for the loan data as per the weighting + approach. When a customer has a loan, it is given a weight of + `1`, and when there\'s no loan, the weight assigned is + `5`. The value of `1` and `5` are + intuitive weights we are assigning. What values we assign can vary + based on the business context you may be provided with. + +11. Now, transform values for the `Housing data`, as mentioned + here: + ``` + # Let us transform values for Housing data + bankData['houseTran'] = 5 + ``` + + +12. Give a weight of `1` if the customer has a house and print + the results, as mentioned in the following code snippet: + + ``` + bankData.loc[bankData['housing'] == 'no', 'houseTran'] = 1 + print(bankData.head()) + ``` + + + You should get the following output: + + +![](./images/B15019_03_23.jpg) + + + Caption: Transforming loan and housing data + + Once all the transformed variables are created, we can multiply all + of the transformed variables together to create a new index called + `assetIndex`. This is a composite index that represents + the combined effect of all three variables. + +13. Now, create a new variable, which is the product of all of the + transformed variables: + + ``` + """ + Let us now create the new variable which is a product of all + these + """ + bankData['assetIndex'] = bankData['balanceTran'] \ + * bankData['loanTran'] \ + * bankData['houseTran'] + bankData.head() + ``` + + + You should get the following output: + + +![](./images/B15019_03_24.jpg) + + + Caption: Creating a composite index + +14. Explore the propensity with respect to the composite index. + + We observe the relationship between the asset index and the + propensity of term deposit purchases. We adopt a similar strategy of + converting the numerical values of the asset index into ordinal + values by taking the quantiles and then mapping the quantiles to the + propensity of term deposit purchases, as mentioned in *Exercise + 3.03*, *Feature Engineering -- Exploration of Individual Features*: + + ``` + # Finding the quantile + np.quantile(bankData['assetIndex'],[0.25,0.5,0.75]) + ``` + + + You should get the following output: + + +![](./images/B15019_03_25.jpg) + + + Caption: Conversion of numerical values into ordinal values + +15. Next, create quantiles from the `assetindex` data, as + mentioned in the following code snippet: + + ``` + bankData['assetClass'] = 'Quant1' + bankData.loc[(bankData['assetIndex'] > 0.38) \ + & (bankData['assetIndex'] < 0.57), \ + 'assetClass'] = 'Quant2' + bankData.loc[(bankData['assetIndex'] > 0.57) \ + & (bankData['assetIndex'] < 1.9), \ + 'assetClass'] = 'Quant3' + bankData.loc[bankData['assetIndex'] > 1.9, \ + 'assetClass'] = 'Quant4' + bankData.head() + bankData.assetClass[bankData['assetIndex'] > 1.9] = 'Quant4' + bankData.head() + ``` + + + You should get the following output: + + +![](./images/B15019_03_26.jpg) + + + Caption: Quantiles for the asset index + +16. Calculate the total of each asset class and the category-wise + counts, as mentioned in the following code snippet: + ``` + # Calculating total of each asset class + assetTot = bankData.groupby('assetClass')['y']\ + .agg(assetTot='count').reset_index() + # Calculating the category wise counts + assetProp = bankData.groupby(['assetClass', 'y'])['y']\ + .agg(assetCat='count').reset_index() + ``` + + +17. Next, merge both DataFrames: + + ``` + # Merging both the data frames + assetComb = pd.merge(assetProp, assetTot, on = ['assetClass']) + assetComb['catProp'] = (assetComb.assetCat \ + / assetComb.assetTot)*100 + assetComb + ``` + + + You should get the following output: + + +![](./images/B15019_03_27.jpg) + + +Caption: Composite index relationship mapping + + + +A Quick Peek at Data Types and a Descriptive Summary +---------------------------------------------------- + +Looking at the data types such as categorical or numeric and then +deriving summary statistics is a good way to take a quick peek into data +before you do some of the downstream feature engineering steps. Let\'s +take a look at an example from our dataset: + +``` +# Looking at Data types +print(bankData.dtypes) +# Looking at descriptive statistics +print(bankData.describe()) +``` +You should get the following output: + +![](./images/B15019_03_28.jpg) + +Caption: Output showing the different data types in the dataset + +In the preceding output, you see the different types of information in +the dataset and its corresponding data types. For instance, +`age` is an integer and so is `day`. + +The following output is that of a descriptive summary statistic, which +displays some of the basic measures such as `mean`, +`standard deviation`, `count`, and the +`quantile values` of the respective features: + +![](./images/B15019_03_29.jpg) + +Caption: Data types and a descriptive summary + +The purpose of a descriptive summary is to get a quick feel of the data +with respect to the distribution and some basic statistics such as mean +and standard deviation. Getting a perspective on the summary statistics +is critical for thinking about what kind of transformations are required +for each variable. + +For instance, in the earlier exercises, we converted the numerical data +into categorical variables based on the quantile values. Intuitions for +transforming variables would come from the quick summary statistics that +we can derive from the dataset. + +In the following sections, we will be looking at the correlation matrix +and visualization. + + +Correlation Matrix and Visualization +==================================== + + +Correlation, as you know, is a measure that indicates how two variables +fluctuate together. Any correlation value of 1, or near 1, indicates +that those variables are highly correlated. Highly correlated variables +can sometimes be damaging for the veracity of models and, in many +circumstances, we make the decision to eliminate such variables or to +combine them to form composite or interactive variables. + +Let\'s look at how data correlation can be generated and then visualized +in the following exercise. + + + +Exercise 3.05: Finding the Correlation in Data to Generate a Correlation Plot Using Bank Data +--------------------------------------------------------------------------------------------- + +In this exercise, we will be creating a correlation plot and analyzing +the results of the bank dataset. + +The following steps will help you to complete the exercise: + +1. Open a new Colab notebook, install the `pandas` packages + and load the banking data: + ``` + import pandas as pd + file_url = 'https://raw.githubusercontent.com'\ + '/fenago/data-science'\ + '/master/Lab03/bank-full.csv' + bankData = pd.read_csv(file_url, sep=";") + ``` + + +2. Now, `import` the `set_option` library from + `pandas`, as mentioned here: + + ``` + from pandas import set_option + ``` + + + The `set_option` function is used to define the display + options for many operations. + +3. Next, create a variable that would store numerical variables such as + `'age','balance','day','duration','campaign','pdays','previous', `as + mentioned in the following code snippet. A correlation plot can be + extracted only with numerical data. This is why the numerical data + has to be extracted separately: + ``` + bankNumeric = bankData[['age','balance','day','duration',\ + 'campaign','pdays','previous']] + ``` + + +4. Now, use the `.corr()` function to find the correlation + matrix for the dataset: + + ``` + set_option('display.width',150) + set_option('precision',3) + bankCorr = bankNumeric.corr(method = 'pearson') + bankCorr + ``` + + + You should get the following output: + + +![](./images/B15019_03_30.jpg) + + + Caption: Correlation matrix + + The method we use for correlation is the **Pearson** correlation + coefficient. We can see from the correlation matrix that the + diagonal elements have a correlation of 1. This is because the + diagonals are a correlation of a variable with itself, which will + always be 1. This is the Pearson correlation coefficient. + +5. Now, plot the data: + + ``` + from matplotlib import pyplot + corFig = pyplot.figure() + figAxis = corFig.add_subplot(111) + corAx = figAxis.matshow(bankCorr,vmin=-1,vmax=1) + corFig.colorbar(corAx) + pyplot.show() + ``` + + + You should get the following output: + + +![](./images/B15019_03_31.jpg) + + +Caption: Correlation plot + + +Skewness of Data +---------------- + +Another area for feature engineering is skewness. Skewed data means data +that is shifted in one direction or the other. Skewness can cause +machine learning models to underperform. Many machine learning models +assume normally distributed data or data structures to follow the +Gaussian structure. Any deviation from the assumed Gaussian structure, +which is the popular bell curve, can affect model performance. A very +effective area where we can apply feature engineering is by looking at +the skewness of data and then correcting the skewness through +normalization of the data. Skewness can be visualized by plotting the +data using histograms and density plots. We will investigate each of +these techniques. + +Let\'s take a look at the following example. Here, we use the +`.skew()` function to find the skewness in data. For instance, +to find the skewness of data in our `bank-full.csv` dataset, +we perform the following: + +``` +# Skewness of numeric attributes +bankNumeric.skew() +``` +Note + +This code refers to the `bankNumeric` data, so you should +ensure you are working in the same notebook as the previous exercise. + +You should get the following output: + +![](./images/B15019_03_32.jpg) + +Caption: Degree of skewness + +The preceding matrix is the skewness index. Any value closer to 0 +indicates a low degree of skewness. Positive values indicate right skew +and negative values, left skew. Variables that show higher values of +right skew and left skew are candidates for further feature engineering +by normalization. Let\'s now visualize the skewness by plotting +histograms and density plots. + + + +Histograms +---------- + +Histograms are an effective way to plot the distribution of data and to +identify skewness in data, if any. The histogram outputs of two columns +of `bankData` are listed here. The histogram is plotted with +the `pyplot` package from `matplotlib` using the +`.hist()` function. The number of subplots we want to include +is controlled by the `.subplots()` function. `(1,2)` +in subplots would mean one row and two columns. The titles are set by +the `set_title()` function: + +``` +# Histograms +from matplotlib import pyplot as plt +fig, axs = plt.subplots(1,2) +axs[0].hist(bankNumeric['age']) +axs[0].set_title('Distribution of age') +axs[1].hist(bankNumeric['balance']) +axs[1].set_title('Distribution of Balance') +# Ensure plots do not overlap +plt.tight_layout() +``` +You should get the following output: + +![](./images/B15019_03_33.jpg) + +Caption: Code showing the generation of histograms + +From the histogram, we can see that the `age` variable has a +distribution closer to the bell curve with a lower degree of skewness. +In contrast, the asset index shows a relatively higher right skew, which +makes it a more probable candidate for normalization. + + + +Density Plots +------------- + +Density plots help in visualizing the distribution of data. A density +plot can be created using the `kind = 'density'` parameter: + +``` +from matplotlib import pyplot as plt +# Density plots +bankNumeric['age'].plot(kind = 'density', subplots = False, \ + layout = (1,1)) +plt.title('Age Distribution') +plt.xlabel('Age') +plt.ylabel('Normalised age distribution') +pyplot.show() +``` +You should get the following output: + +![](./images/B15019_03_34.jpg) + +Caption: Code showing the generation of a density plot + +Density plots help in getting a smoother visualization of the +distribution of the data. From the density plot of Age, we can see that +it has a distribution similar to a bell curve. + + + +Other Feature Engineering Methods +--------------------------------- + +So far, we were looking at various descriptive statistics and +visualizations that are precursors for applying many feature engineering +techniques on data structures. We investigated one such feature +engineering technique in *Exercise 3.02*, *Business Hypothesis Testing +for Age versus Propensity for a Term Loan* where we applied the **min +max** scaler for normalizing data. + +We will now look into two other similar data transformation techniques, +namely, standard scaler and normalizer. Standard scaler standardizes +data to a mean of 0 and standard deviation of 1. The mean is the average +of the data and the standard deviation is a measure of the spread of +data. By standardizing to the same mean and standard deviation, +comparison across different distributions of data is enabled. + +The normalizer function normalizes the length of data. This means that +each value in a row is divided by the normalization of the row vector to +normalize the row. The normalizer function is applied on the rows while +standard scaler is applied columnwise. The normalizer and standard +scaler functions are important feature engineering steps that are +applied to the data before downstream modeling steps. Let\'s look at +both of these techniques: + +``` +# Standardize data (0 mean, 1 stdev) +from sklearn.preprocessing import StandardScaler +from numpy import set_printoptions +scaling = StandardScaler().fit(bankNumeric) +rescaledNum = scaling.transform(bankNumeric) +set_printoptions(precision = 3) +print(rescaledNum) +``` +You should get the following output: + +![](./images/B15019_03_35.jpg) + +Caption: Output from standardizing the data + +The following code uses the normalizer data transmission techniques: + +``` +# Normalizing Data (Length of 1) +from sklearn.preprocessing import Normalizer +normaliser = Normalizer().fit(bankNumeric) +normalisedNum = normaliser.transform(bankNumeric) +set_printoptions(precision = 3) +print(normalisedNum) +``` +You should get the following output: + +![](./images/B15019_03_36.jpg) + +Figure 3.36 Output by the normalizer + +The output from standard scaler is normalized along the columns. The +output would have 11 columns corresponding to 11 numeric columns (age, +balance, day, duration, and so on). If we observe the output, we can see +that each value along a column is normalized so as to have a mean of 0 +and standard deviation of 1. By transforming data in this way, we can +easily compare across columns. + +For instance, in the `age` variable, we have data ranging from +18 up to 95. In contrast, for the balance data, we have data ranging +from -8,019 to 102,127. We can see that both of these variables have +different ranges of data that cannot be compared. The standard scaler +function converts these data points at very different scales into a +common scale so as to compare the distribution of data. Normalizer +rescales each row so as to have a vector with a length of 1. + +The big question we have to think about is why do we have to standardize +or normalize data? Many machine learning algorithms converge faster when +the features are of a similar scale or are normally distributed. +Standardizing is more useful in algorithms that assume input variables +to have a Gaussian structure. Algorithms such as linear regression, +logistic regression, and linear discriminate analysis fall under this +genre. Normalization techniques would be more congenial for sparse +datasets (datasets with lots of zeros) when using algorithms such as +k-nearest neighbor or neural networks. + + + +Summarizing Feature Engineering +------------------------------- + +In this section, we investigated the process of feature engineering from +a business perspective and data structure perspective. Feature +engineering is a very important step in the life cycle of a data science +project and helps determine the veracity of the models that we build. As +seen in *Exercise 3.02*, *Business Hypothesis Testing for Age versus +Propensity for a Term Loan* we translated our understanding of the +domain and our intuitions to build intelligent features. Let\'s +summarize the processes that we followed: + +1. We obtain intuitions from a business perspective through EDA +2. Based on the business intuitions, we devised a new feature that is a + combination of three other variables. +3. We verified the influence of constituent variables of the new + feature and devised an approach for weights to be applied. +4. Converted ordinal data into corresponding weights. +5. Transformed numerical data by normalizing them using an + appropriate normalizer. +6. Combined all three variables into a new feature. +7. Observed the relationship between the composite index and the + propensity to purchase term deposits and derived our intuitions. +8. Explored techniques for visualizing and extracting summary + statistics from data. +9. Identified techniques for transforming data into feature engineered + data structures. + +Now that we have completed the feature engineering step, the next +question is where do we go from here and what is the relevance of the +new feature we created? As you will see in the subsequent sections, the +new features that we created will be used for the modeling process. The +preceding exercises are an example of a trail we can follow in creating +new features. There will be multiple trails like these, which should be +thought of as based on more domain knowledge and understanding. The +veracity of the models that we build will be dependent on all such +intelligent features we can build by translating business knowledge into +data. + + + +Building a Binary Classification Model Using the Logistic Regression Function +----------------------------------------------------------------------------- + +The essence of data science is about mapping a business problem into its +data elements and then transforming those data elements to get our +desired business outcomes. In the previous sections, we discussed how we +do the necessary transformation on the data elements. The right +transformation of the data elements can highly influence the generation +of the right business outcomes by the downstream modeling process. + +Let\'s look at the business outcome generation process from the +perspective of our use case. The desired business outcome, in our use +case, is to identify those customers who are likely to buy a term +deposit. To correctly identify which customers are likely to buy a term +deposit, we first need to learn the traits or features that, when +present in a customer, helps in the identification process. This +learning of traits is what is achieved through machine learning. + +By now, you may have realized that the goal of machine learning is to +estimate a mapping function (*f*) between an output variable and input +variables. In mathematical form, this can be written as follows: + +![](./images/B15019_03_37.jpg) + +Caption: A mapping function in mathematical form + +Let\'s look at this equation from the perspective of our use case. + +*Y* is the dependent variable, which is our prediction as to whether a +customer has the probability to buy a term deposit or not. + +*X* is the independent variable(s), which are those attributes such as +age, education, and marital status and are part of the dataset. + +*f()* is a function that connects various attributes of the data to the +probability or whether a customer will buy a term deposit or not. This +function is learned during the machine learning process. This function +is a combination of different coefficients or parameters applied to each +of the attributes to get the probability of term deposit purchases. +Let\'s unravel this concept using a simple example of our bank data +use case. + +For simplicity, let\'s assume that we have only two attributes, age and +bank balance. Using these, we have to predict whether a customer is +likely to buy a term deposit or not. Let the age be 40 years and the +balance \$1,000. With all of these attribute values, let\'s assume that +the mapping equation is as follows: + +![](./images/B15019_03_38.jpg) + +Caption: Updated mapping equation + +Using the preceding equation, we get the following: + +*Y = 0.1 + 0.4 \* 40 + 0.002 \* 1000* + +*Y = 18.1* + +Now, you might be wondering, we are getting a real number and how does +this represent a decision of whether a customer will buy a term deposit +or not? This is where the concept of a decision boundary comes in. +Let\'s also assume that, on analyzing the data, we have also identified +that if the value of *Y* goes above 15 (an assumed value in this case), +then the customer is likely to buy the term deposit, otherwise they will +not buy a term deposit. This means that, as per this example, the +customer is likely to buy a term deposit. + +Let\'s now look at the dynamics in this example and try to decipher the +concepts. The values such as 0.1, 0.4, and 0.002, which are applied to +each of the attributes, are the coefficients. These coefficients, along +with the equation connecting the coefficients and the variables, are the +functions that we are learning from the data. The essence of machine +learning is to learn all of these from the provided data. All of these +coefficients along with the functions can also be called by another +common name called the **model**. A model is an approximation of the +data generation process. During machine learning, we are trying to get +as close to the real model that has generated the data we are analyzing. +To learn or estimate the data generating models, we use different +machine learning algorithms. + +Machine learning models can be broadly classified into two types, +parametric models and non-parametric models. Parametric models are where +we assume the form of the function we are trying to learn and then learn +the coefficients from the training data. By assuming a form for the +function, we simplify the learning process. + +To understand the concept better, let\'s take the example of a linear +model. For a linear model, the mapping function takes the following +form: + +![](./images/B15019_03_39.jpg) + +Caption: Linear model mapping function + +The terms *C0*, *M1*, and *M2* are the coefficients of the line that +influences the intercept and slope of the line. *X1* and *X2* are the +input variables. What we are doing here is that we assume that the data +generating model is a linear model and then, using the data, we estimate +the coefficients, which will enable the generation of the predictions. +By assuming the data generating model, we have simplified the whole +learning process. However, these simple processes also come with their +pitfalls. Only if the underlying function is linear or similar to linear +will we get good results. If the assumptions about the form are wrong, +we are bound to get bad results. + +Some examples of parametric models include: + +- Linear and logistic regression +- Naïve Bayes +- Linear support vector machines +- Perceptron + +Machine learning models that do not make strong assumptions on the +function are called non-parametric models. In the absence of an assumed +form, non-parametric models are free to learn any functional form from +the data. Non-parametric models generally require a lot of training data +to estimate the underlying function. Some examples of non-parametric +models include the following: + +- Decision trees +- K --nearest neighbors +- Neural networks +- Support vector machines with Gaussian kernels + + + +Logistic Regression Demystified +------------------------------- + +Logistic regression is a linear model similar to the linear regression +that was covered in the previous lab. At the core of logistic +regression is the sigmoid function, which quashes any real-valued number +to a value between 0 and 1, which renders this function ideal for +predicting probabilities. The mathematical equation for a logistic +regression function can be written as follows: + +![](./images/B15019_03_40.jpg) + +Caption: Logistic regression function + +Here, *Y* is the probability of whether a customer is likely to buy a +term deposit or not. + +The terms *C0 + M1 \* X1 + M2 \* X2* are very similar to the ones we +have seen in the linear regression function, covered in an earlier +lab. As you would have learned by now, a linear regression function +gives a real-valued output. To transform the real-valued output into a +probability, we use the logistic function, which has the following form: + +![Caption: An expression to transform the real-valued output to a +probability ](./images/B15019_03_41.jpg) + +Caption: An expression to transform the real-valued output to a +probability + +Here, *e* is the natural logarithm. We will not dive deep into the math +behind this; however, let\'s realize that, using the logistic function, +we can transform the real-valued output into a probability function. + +Let\'s now look at the logistic regression function from the business +problem that we are trying to solve. In the business problem, we are +trying to predict the probability of whether a customer would buy a term +deposit or not. To do that, let\'s return to the example we derived from +the problem statement: + +![](./images/B15019_03_42.jpg) + +Caption: The logistic regression function updated with the business +problem statement + +Adding the following values, we get *Y = 0.1 + 0.4 \* 40 + 0.002 \* +100*. + +To get the probability, we must transform this problem statement using +the logistic function, as follows: + +![Caption: Transformed problem statement to find the probability of +using the logistic function ](./images/B15019_03_43.jpg) + +Caption: Transformed problem statement to find the probability of +using the logistic function + +In applying this, we get a value of *Y = 1*, which is a 100% probability +that the customer will buy the term deposit. As discussed in the +previous example, the coefficients of the model such as 0.1, 0.4, and +0.002 are what we learn using the logistic regression algorithm during +the training process. + + + +Metrics for Evaluating Model Performance +---------------------------------------- + +As a data scientist, you always have to make decisions on the models you +build. These evaluations are done based on various metrics on the +predictions. In this section, we introduce some of the important metrics +that are used for evaluating the performance of models. + +Note + +Model performance will be covered in much more detail in *Lab 6*, +*How to Assess Performance*. This section provides you with an +introduction to work with classification models. + + + +Confusion Matrix +---------------- + +As you will have learned, we evaluate a model based on its performance +on a test set. A test set will have its labels, which we call the ground +truth, and, using the model, we also generate predictions for the test +set. The evaluation of model performance is all about comparison of the +ground truth and the predictions. Let\'s see this in action with a dummy +test set: + +![](./images/B15019_03_44.jpg) + +Caption: Confusion matrix generation + +The preceding table shows a dummy dataset with seven examples. The +second column is the ground truth, which are the actual labels, and the +third column contains the results of our predictions. From the data, we +can see that four have been correctly classified and three were +misclassified. + +A confusion matrix generates the resultant comparison between prediction +and ground truth, as represented in the following table: + +![](./images/B15019_03_45.jpg) + +Caption: Confusion matrix + +As you can see from the table, there are five examples whose labels +(ground truth) are` Yes` and the balance is two examples that +have the labels of` No`. + +The first row of the confusion matrix is the evaluation of the label +`Yes`. `True positive` shows those examples whose +ground truth and predictions are `Yes` (examples 1, 3, and 5). +`False negative` shows those examples whose ground truth is +`Yes` and who have been wrongly predicted as `No` +(examples 2 and 7). + +Similarly, the second row of the confusion matrix evaluates the +performance of the label `No`. `False positive` are +those examples whose ground truth is `No` and who have been +wrongly classified as `Yes` (example 6). +`True negative` examples are those examples whose ground truth +and predictions are both `No` (example 4). + +The generation of a confusion matrix is used for calculating many of the +matrices such as accuracy and classification reports, which are +explained later. It is based on metrics such as accuracy or other +detailed metrics shown in the classification report such as precision or +recall the models for testing. We generally pick models where these +metrics are the highest. + + + +Accuracy +-------- + +Accuracy is the first level of evaluation, which we will resort to in +order to have a quick check on model performance. Referring to the +preceding table, accuracy can be represented as follows: + +![](./images/B15019_03_46.jpg) + +Caption: A function that represents accuracy + +Accuracy is the proportion of correct predictions out of all of the +predictions. + + + +Classification Report +--------------------- + +A classification report outputs three key metrics: **precision**, +**recall**, and the **F1 score**. + +Precision is the ratio of true positives to the sum of true positives +and false positives: + +![](./images/B15019_03_47.jpg) + +Caption: The precision ratio + +Precision is the indicator that tells you, out of all of the positives +that were predicted, how many were true positives. + +Recall is the ratio of true positives to the sum of true positives and +false negatives: + +![](./images/B15019_03_48.jpg) + +Caption: The recall ratio + +Recall manifests the ability of the model to identify all true +positives. + +The F1 score is a weighted score of both precision and recall. An F1 +score of 1 indicates the best performance and 0 indicates the worst +performance. + +In the next section, let\'s take a look at data preprocessing, which is +an important process to work with data and come to conclusions in data +analysis. + + + +Data Preprocessing +------------------ + +Data preprocessing has an important role to play in the life cycle of +data science projects. These processes are often the most time-consuming +part of the data science life cycle. Careful implementation of the +preprocessing steps is critical and will have a strong bearing on the +results of the data science project. + +The various preprocessing steps include the following: + +- **Data loading**: This involves loading the data from different + sources into the notebook. + +- **Data cleaning**: Data cleaning process entails removing anomalies, + for instance, special characters, duplicate data, and identification + of missing data from the available dataset. Data cleaning is one of + the most time-consuming steps in the data science process. + +- **Data imputation**: Data imputation is filling missing data with + new data points. + +- **Converting data types**: Datasets will have different types of + data such as numerical data, categorical data, and character data. + Running models will necessitate the transformation of data types. + + Note + + Data processing will be covered in depth in the following labs + of this book. + +We will implement some of these preprocessing steps in the subsequent +sections and in *Exercise 3.06*, *A Logistic Regression Model for +Predicting the Propensity of Term Deposit Purchases in a Bank*. + + + +Exercise 3.06: A Logistic Regression Model for Predicting the Propensity of Term Deposit Purchases in a Bank +------------------------------------------------------------------------------------------------------------ + +In this exercise, we will build a logistic regression model, which will +be used for predicting the propensity of term deposit purchases. This +exercise will have three parts. The first part will be the preprocessing +of the data, the second part will deal with the training process, and +the last part will be spent on prediction, analysis of metrics, and +deriving strategies for further improvement of the model. + +You begin with data preprocessing. + +In this part, we will first load the data, convert the ordinal data into +dummy data, and then split the data into training and test sets for the +subsequent training phase: + +1. Open a Colab notebook, mount the drives, install necessary packages, + and load the data, as in previous exercises: + ``` + import pandas as pd + import altair as alt + file_url = 'https://raw.githubusercontent.com'\ + '/fenago/data-science'\ + '/master/Lab03/bank-full.csv' + bankData = pd.read_csv(file_url, sep=";") + ``` + + +2. Now, load the library functions and data: + ``` + from sklearn.linear_model import LogisticRegression + from sklearn.model_selection import train_test_split + ``` + + +3. Now, find the data types: + + ``` + bankData.dtypes + ``` + + + You should get the following output: + + +![](./images/B15019_03_49.jpg) + + + Caption: Data types + +4. Convert the ordinal data into dummy data. + + As you can see in the dataset, we have two types of data: the + numerical data and the ordinal data. Machine learning algorithms + need numerical representation of data and, therefore, we must + convert the ordinal data into a numerical form by creating dummy + variables. The dummy variable will have values of either 1 or 0 + corresponding to whether that category is present or not. The + function we use for converting ordinal data into numerical form is + `pd.get_dummies()`. This function converts the data + structure into a long form or horizontal form. So, if there are + three categories in a variable, there will be three new variables + created as dummy variables corresponding to each of the categories. + + The value against each variable would be either 1 or 0, depending on + whether that category was present in the variable as an example. + Let\'s look at the code for doing that: + + ``` + """ + Converting all the categorical variables to dummy variables + """ + bankCat = pd.get_dummies\ + (bankData[['job','marital',\ + 'education','default','housing',\ + 'loan','contact','month','poutcome']]) + bankCat.shape + ``` + + + You should get the following output: + + ``` + (45211, 44) + ``` + + + We now have a new subset of the data corresponding to the + categorical data that was converted into numerical form. Also, we + had some numerical variables in the original dataset, which did not + need any transformation. The transformed categorical data and the + original numerical data have to be combined to get all of the + original features. To combine both, let\'s first extract the + numerical data from the original DataFrame. + +5. Now, separate the numerical variables: + + ``` + bankNum = bankData[['age','balance','day','duration',\ + 'campaign','pdays','previous']] + bankNum.shape + ``` + + + You should get the following output: + + ``` + (45211, 7) + ``` + + +6. Now, prepare the `X` and `Y` variables and print + the `Y` shape. The `X` variable is the + concatenation of the transformed categorical variable and the + separated numerical data: + + ``` + # Preparing the X variables + X = pd.concat([bankCat, bankNum], axis=1) + print(X.shape) + # Preparing the Y variable + Y = bankData['y'] + print(Y.shape) + X.head() + ``` + + + The output shown below is truncated: + + +![](./images/B15019_03_50.jpg) + + + Figure 3.50 Combining categorical and numerical DataFrames + + Once the DataFrame is created, we can split the data into training + and test sets. We specify the proportion in which the DataFrame must + be split into training and test sets. + +7. Split the data into training and test sets: + + ``` + # Splitting the data into train and test sets + X_train, X_test, y_train, y_test = train_test_split\ + (X, Y, test_size=0.3, \ + random_state=123) + ``` + + + Now, the data is all prepared for the modeling task. Next, we begin + with modeling. + + In this part, we will train the model using the training set we + created in the earlier step. First, we call the + `logistic regression `function and then fit the model with + the training set data. + +8. Define the `LogisticRegression` function: + + ``` + bankModel = LogisticRegression() + bankModel.fit(X_train, y_train) + ``` + + + You should get the following output: + + +![](./images/B15019_03_51.jpg) + + + Caption: Parameters of the model that fits + +9. Now, that the model is created, use it for predicting on the test + sets and then getting the accuracy level of the predictions: + + ``` + pred = bankModel.predict(X_test) + print('Accuracy of Logistic regression model' \ + 'prediction on test set: {:.2f}'\ + .format(bankModel.score(X_test, y_test))) + ``` + + + You should get the following output: + + +![](./images/B15019_03_52.jpg) + + + Caption: Prediction with the model + +10. From an initial look, an accuracy metric of 90% gives us the + impression that the model has done a decent job of approximating the + data generating process. Or is it otherwise? Let\'s take a closer + look at the details of the prediction by generating the metrics for + the model. We will use two metric-generating functions, the + confusion matrix and classification report: + + ``` + # Confusion Matrix for the model + from sklearn.metrics import confusion_matrix + confusionMatrix = confusion_matrix(y_test, pred) + print(confusionMatrix) + ``` + + + You should get the following output in the following format; + however, the values can vary as the modeling task will involve + variability: + + +![](./images/B15019_03_53.jpg) + + + Caption: Generation of the confusion matrix + + Note + + The end results that you get will be different from what you see + here as it depends on the system you are using. This is because the + modeling part is stochastic in nature and there will always be + differences. + +11. Next, let\'s generate a `classification_report`: + + ``` + from sklearn.metrics import classification_report + print(classification_report(y_test, pred)) + ``` + + + You should get a similar output; however, with different values due + to variability in the modeling process: + + +![](./images/B15019_03_54.jpg) + + + +From the metrics, we can see that, out of the total 11,998 examples of +`no`, 11,754 were correctly classified as `no` and +the balance, 244, were classified as `yes`. This gives a +recall value of *11,754/11,998*, which is nearly 98%. From a precision +perspective, out of the total 12,996 examples that were predicted as +`no`, only 11,754 of them were really `no`, which +takes our precision to 11,754/12,996 or 90%. + +However, the metrics for `yes` give a different picture. Out +of the total 1,566 cases of `yes`, only 324 were correctly +identified as `yes`. This gives us a recall of *324/1,566 = +21%*. The precision is *324 / (324 + 244) = 57%*. + +From an overall accuracy level, this can be calculated as follows: +correctly classified *examples / total examples = (11754 + 324) / 13564 += 89%*. + +The metrics might seem good when you look only at the accuracy level. +However, looking at the details, we can see that the classifier, in +fact, is doing a poor job of classifying the `yes` cases. The +classifier has been trained to predict mostly `no` values, +which from a business perspective is useless. From a business +perspective, we predominantly want the `yes` estimates, so +that we can target those cases for focused marketing to try to sell term +deposits. However, with the results we have, we don\'t seem to have done +a good job in helping the business to increase revenue from term deposit +sales. + +In this exercise, we have preprocessed data, then we performed the +training process, and finally, we found useful prediction, analysis of +metrics, and deriving strategies for further improvement of the model. + +What we have now built is the first model or a benchmark model. The next +step is to try to improve on the benchmark model through different +strategies. One such strategy is to feature engineer variables and build +new models with new features. Let\'s achieve that in the next activity. + + + +Activity 3.02: Model Iteration 2 -- Logistic Regression Model with Feature Engineered Variables +----------------------------------------------------------------------------------------------- + +As the data scientist of the bank, you created a benchmark model to +predict which customers are likely to buy a term deposit. However, +management wants to improve the results you got in the benchmark model. +In *Exercise 3.04*, *Feature Engineering -- Creating New Features from +Existing Ones,* you discussed the business scenario with the marketing +and operations teams and created a new variable, `assetIndex`, +by feature engineering three raw variables. You are now fitting another +logistic regression model on the feature engineered variables and are +trying to improve the results. + +In this activity, you will be feature engineering some of the variables +to verify their effects on the predictions. + +The steps are as follows: + +1. Open the Colab notebook used for the feature engineering in + *Exercise 3.04*, *Feature Engineering -- Creating New Features from + Existing Ones,* and execute all of the steps from that exercise. + +2. Create dummy variables for the categorical variables using the + `pd.get_dummies()` function. Exclude original raw + variables such as loan and housing, which were used to create the + new variable, `assetIndex`. + +3. Select the numerical variables including the new feature engineered + variable, `assetIndex`, that was created. + +4. Transform some of the numerical variables by normalizing them using + the `MinMaxScaler()` function. + +5. Concatenate the numerical variables and categorical variables using + the `pd.concat()` function and then create `X` + and `Y` variables. + +6. Split the dataset using the `train_test_split()` function + and then fit a new model using the `LogisticRegression()` + model on the new features. + +7. Analyze the results after generating the confusion matrix and + classification report. + + You should get the following output: + + +![](./images/B15019_03_55.jpg) + + +Caption: Expected output with the classification report + + +Summary +======= + + +In this lab, we learned about binary classification using logistic +regression from the perspective of solving a use case. Let\'s summarize +our learnings in this lab. We were introduced to classification +problems and specifically binary classification problems. We also looked +at the classification problem from the perspective of predicting term +deposit propensity through a business discovery process. In the business +discovery process, we identified different business drivers that +influence business outcomes. \ No newline at end of file diff --git a/lab_guides/Lab_4.md b/lab_guides/Lab_4.md new file mode 100644 index 0000000..a0bc0bb --- /dev/null +++ b/lab_guides/Lab_4.md @@ -0,0 +1,1767 @@ + +4. Multiclass Classification with RandomForest +============================================== + + + +Overview + +This lab will show you how to train a multiclass classifier using +the Random Forest algorithm. You will also see how to evaluate the +performance of multiclass models. + +By the end of the lab, you will be able to implement a Random Forest +classifier, as well as tune hyperparameters in order to improve model +performance. + + + + +Training a Random Forest Classifier +=================================== + + + +Let\'s see how we can train a Random Forest classifier on this dataset. +First, we need to load the data from the GitHub repository using +`pandas` and then we will print its first five rows using the +`head()` method. + +Note + +All the example code given outside of Exercises in this lab relates +to this Activity Recognition dataset. It is recommended that all code +from these examples is entered and run in a single Google Colab +Notebook, and kept separate from your Exercise Notebooks. + +``` +import pandas as pd +file_url = 'https://raw.githubusercontent.com/fenago'\ + '/data-science/master/Lab04/'\ + 'Dataset/activity.csv' +df = pd.read_csv(file_url) +df.head() +``` + +The output will be as follows: + +![](./images/B15019_04_01.jpg) + +Caption: First five rows of the dataset + +Each row represents an activity that was performed by a person and the +name of the activity is stored in the `Activity` column. There +are seven different activities in this variable: `bending1`, +`bending2`, `cycling`, `lying`, +`sitting`, `standing`, and `Walking`. The +other six columns are different measurements taken from sensor data. + +In this example, you will accurately predict the target variable +(`'Activity'`) from the features (the six other columns) using +Random Forest. For example, for the first row of the preceding example, +the model will receive the following features as input and will predict +the `'bending1'` class: + +![](./images/B15019_04_02.jpg) + +Caption: Features for the first row of the dataset + +But before that, we need to do a bit of data preparation. The +`sklearn` package (we will use it to train Random Forest +model) requires the target variable and the features to be separated. +So, we need to extract the response variable using the +`.pop()` method from `pandas`. The +`.pop()` method extracts the specified column and removes it +from the DataFrame: + +``` +target = df.pop('Activity') +``` +Now the response variable is contained in the variable called +`target` and all the features are in the DataFrame called +`df`. + +Now we are going to split the dataset into training and testing sets. +The model uses the training set to learn relevant parameters in +predicting the response variable. The test set is used to check whether +a model can accurately predict unseen data. We say the model is +overfitting when it has learned the patterns relevant only to the +training set and makes incorrect predictions about the testing set. In +this case, the model performance will be much higher for the training +set compared to the testing one. Ideally, we want to have a very similar +level of performance for the training and testing sets. This topic will +be covered in more depth in *Lab 7*, *The Generalization of Machine +Learning Models*. + +The `sklearn` package provides a function called +`train_test_split()` to randomly split the dataset into two +different sets. We need to specify the following parameters for this +function: the feature and target variables, the ratio of the testing set +(`test_size`), and `random_state` in order to get +reproducible results if we have to run the code again: + +``` +from sklearn.model_selection import train_test_split +X_train, X_test, y_train, y_test = train_test_split\ + (df, target, test_size=0.33, \ + random_state=42) +``` + +There are four different outputs to the `train_test_split()` +function: the features for the training set, the target variable for the +training set, the features for the testing set, and its target variable. + +Now that we have got our training and testing sets, we are ready for +modeling. Let\'s first import the `RandomForestClassifier` +class from `sklearn.ensemble`: + +``` +from sklearn.ensemble import RandomForestClassifier +``` +Now we can instantiate the Random Forest classifier with some +hyperparameters. Remember from *Lab 1, Introduction to Data Science +in Python*, a hyperparameter is a type of parameter the model can\'t +learn but is set by data scientists to tune the model\'s learning +process. This topic will be covered more in depth in *Lab 8, +Hyperparameter Tuning*. For now, we will just specify the +`random_state` value. We will walk you through some of the key +hyperparameters in the following sections: + +``` +rf_model = RandomForestClassifier(random_state=1, \ + n_estimators=10) +``` + +The next step is to train (also called fit) the model with the training +data. During this step, the model will try to learn the relationship +between the response variable and the independent variables and save the +parameters learned. We need to specify the features and target variables +as parameters: + +``` +rf_model.fit(X_train, y_train) +``` + +The output will be as follows: + +![](./images/B15019_04_03.jpg) + +Caption: Logs of the trained RandomForest + +Now that the model has completed its training, we can use the parameters +it learned to make predictions on the input data we will provide. In the +following example, we are using the features from the training set: + +``` +preds = rf_model.predict(X_train) +``` +Now we can print these predictions: + +``` +preds +``` + +The output will be as follows: + +![Caption: Predictions of the RandomForest algorithm on the training +set ](./images/B15019_04_04.jpg) + +Caption: Predictions of the RandomForest algorithm on the training +set + +This output shows us the model predicted, respectively, the values +`lying`, `bending1`, and `cycling` for the +first three observations and `cycling`, `bending1`, +and `standing` for the last three observations. Python, by +default, truncates the output for a long list of values. This is why it +shows only six values here. + +These are basically the key steps required for training a Random Forest +classifier. This was quite straightforward, right? Training a machine +learning model is incredibly easy but getting meaningful and accurate +results is where the challenges lie. In the next section, we will learn +how to assess the performance of a trained model. + + +Evaluating the Model\'s Performance +=================================== + + +Now that we know how to train a Random Forest classifier, it is time to +check whether we did a good job or not. What we want is to get a model +that makes extremely accurate predictions, so we need to assess its +performance using some kind of metric. + +For a classification problem, multiple metrics can be used to assess the +model\'s predictive power, such as F1 score, precision, recall, or ROC +AUC. Each of them has its own specificity and depending on the projects +and datasets, you may use one or another. + +In this lab, we will use a metric called **accuracy score**. It +calculates the ratio between the number of correct predictions and the +total number of predictions made by the model: + +![](./images/B15019_04_05.jpg) + +Caption: Formula for accuracy score + +For instance, if your model made 950 correct predictions out of 1,000 +cases, then the accuracy score would be 950/1000 = 0.95. This would mean +that your model was 95% accurate on that dataset. The +`sklearn` package provides a function to calculate this score +automatically and it is called `accuracy_score()`. We need to +import it first: + +``` +from sklearn.metrics import accuracy_score +``` + +Then, we just need to provide the list of predictions for some +observations and the corresponding true value for the target variable. +Using the previous example, we will use the `y_train` and +`preds` variables, which respectively contain the response +variable (also known as the target) for the training set and the +corresponding predictions made by the Random Forest model. We will reuse +the predictions from the previous section -- `preds`: + +``` +accuracy_score(y_train, preds) +``` + +The output will be as follows: + +![](./images/B15019_04_06.jpg) + +Caption: Accuracy score on the training set + +We achieved an accuracy score of 0.988 on our training data. This means +we accurately predicted more than `98%` of these cases. +Unfortunately, this doesn\'t mean you will be able to achieve such a +high score for new, unseen data. Your model may have just learned the +patterns that are only relevant to this training set, and in that case, +the model will overfit. + +If we take the analogy of a student learning a subject for a semester, +they could memorize by heart all the textbook exercises but when given a +similar but unseen exercise, they wouldn\'t be able to solve it. +Ideally, the student should understand the underlying concepts of the +subject and be able to apply that learning to any similar exercises. +This is exactly the same for our model: we want it to learn the generic +patterns that will help it to make accurate predictions even on unseen +data. + +But how can we assess the performance of a model for unseen data? Is +there a way to get that kind of assessment? The answer to these +questions is yes. + +Remember, in the last section, we split the dataset into training and +testing sets. We used the training set to fit the model and assess its +predictive power on it. But it hasn\'t seen the observations from the +testing set at all, so we can use it to assess whether our model is +capable of generalizing unseen data. Let\'s calculate the accuracy score +for the testing set: + +``` +test_preds = rf_model.predict(X_test) +accuracy_score(y_test, test_preds) +``` + +The output will be as follows: + +![](./images/B15019_04_07.jpg) + +Caption: Accuracy score on the testing set + +OK. Now the accuracy has dropped drastically to `0.77`. The +difference between the training and testing sets is quite big. This +tells us our model is actually overfitting and learned only the patterns +relevant to the training set. In an ideal case, the performance of your +model should be equal or very close to equal for those two sets. + +In the next sections, we will look at tuning some Random Forest +hyperparameters in order to reduce overfitting. + + + +Exercise 4.01: Building a Model for Classifying Animal Type and Assessing Its Performance +----------------------------------------------------------------------------------------- + +In this exercise, we will train a Random Forest classifier to predict +the type of an animal based on its attributes and check its accuracy +score: + + +1. Open a new Colab notebook. + +2. Import the `pandas` package: + ``` + import pandas as pd + ``` + + +3. Create a variable called `file_url` that contains the URL + of the dataset: + ``` + file_url = 'https://raw.githubusercontent.com'\ + '/fenago/data-science'\ + '/master/Lab04/Dataset'\ + '/openml_phpZNNasq.csv' + ``` + + +4. Load the dataset into a DataFrame using the `.read_csv()` + method from pandas: + ``` + df = pd.read_csv(file_url) + ``` + + +5. Print the first five rows of the DataFrame: + + ``` + df.head() + ``` + + + You should get the following output: + + +![](./images/B15019_04_08.jpg) + + + Caption: First five rows of the DataFrame + + We will be using the `type` column as our target variable. + We will need to remove the `animal` column from the + DataFrame and only use the remaining columns as features. + +6. Remove the `'animal'` column using the `.drop()` + method from `pandas` and specify the + `columns='animal'` and `inplace=True` parameters + (to directly update the original DataFrame): + ``` + df.drop(columns='animal', inplace=True) + ``` + + +7. Extract the `'type'` column using the `.pop()` + method from `pandas`: + ``` + y = df.pop('type') + ``` + + +8. Print the first five rows of the updated DataFrame: + + ``` + df.head() + ``` + + + You should get the following output: + + +![](./images/B15019_04_09.jpg) + + + Caption: First five rows of the DataFrame + +9. Import the `train_test_split` function from + `sklearn.model_selection`: + ``` + from sklearn.model_selection import train_test_split + ``` + + +10. Split the dataset into training and testing sets with the + `df`, `y`, `test_size=0.4`, and + `random_state=188` parameters: + ``` + X_train, X_test, y_train, y_test = train_test_split\ + (df, y, test_size=0.4, \ + random_state=188) + ``` + + +11. Import `RandomForestClassifier` from + `sklearn.ensemble`: + ``` + from sklearn.ensemble import RandomForestClassifier + ``` + + +12. Instantiate the `RandomForestClassifier` object with + `random_state` equal to `42`. Set the + `n-estimators` value to an initial default value of + `10`. We\'ll discuss later how changing this value affects + the result. + ``` + rf_model = RandomForestClassifier(random_state=42, \ + n_estimators=10) + ``` + + +13. Fit `RandomForestClassifier` with the training set: + + ``` + rf_model.fit(X_train, y_train) + ``` + + + You should get the following output: + + +![](./images/B15019_04_10.jpg) + + + Caption: Logs of RandomForestClassifier + +14. Predict the outcome of the training set with the + `.predict()`method, save the results in a variable called + \'`train_preds`\', and print its value: + + ``` + train_preds = rf_model.predict(X_train) + train_preds + ``` + + + You should get the following output: + + +![](./images/B15019_04_11.jpg) + + + Caption: Predictions on the training set + +15. Import the `accuracy_score` function from + `sklearn.metrics`: + ``` + from sklearn.metrics import accuracy_score + ``` + + +16. Calculate the accuracy score on the training set, save the result in + a variable called `train_acc`, and print its value: + + ``` + train_acc = accuracy_score(y_train, train_preds) + print(train_acc) + ``` + + + You should get the following output: + + +![](./images/B15019_04_12.jpg) + + + Caption: Accuracy score on the training set + + Our model achieved an accuracy of `1` on the training set, + which means it perfectly predicted the target variable on all of + those observations. Let\'s check the performance on the testing set. + +17. Predict the outcome of the testing set with the + `.predict()` method and save the results into a variable + called `test_preds`: + ``` + test_preds = rf_model.predict(X_test) + ``` + + +18. Calculate the accuracy score on the testing set, save the result in + a variable called `test_acc`, and print its value: + + ``` + test_acc = accuracy_score(y_test, test_preds) + print(test_acc) + ``` + + + You should get the following output: + + +![](./images/B15019_04_13.jpg) + + + +Number of Trees Estimator +------------------------- + +Now that we know how to fit a Random Forest classifier and assess its +performance, it is time to dig into the details. In the coming sections, +we will learn how to tune some of the most important hyperparameters for +this algorithm. As mentioned in *Lab 1, Introduction to Data Science +in Python*, hyperparameters are parameters that are not learned +automatically by machine learning algorithms. Their values have to be +set by data scientists. These hyperparameters can have a huge impact on +the performance of a model, its ability to generalize to unseen data, +and the time taken to learn patterns from the data. + +The first hyperparameter you will look at in this section is called +`n_estimators`. This hyperparameter is responsible for +defining the number of trees that will be trained by the +`RandomForest` algorithm. + +Before looking at how to tune this hyperparameter, we need to understand +what a tree is and why it is so important for the +`RandomForest` algorithm. + +A tree is a logical graph that maps a decision and its outcomes at each +of its nodes. Simply speaking, it is a series of yes/no (or true/false) +questions that lead to different outcomes. + +A leaf is a special type of node where the model will make a prediction. +There will be no split after a leaf. A single node split of a tree may +look like this: + +![](./images/B15019_04_14.jpg) + +Caption: Example of a single tree node + +A tree node is composed of a question and two outcomes depending on +whether the condition defined by the question is met or not. In the +preceding example, the question is `is avg_rss12 > 41?` If the +answer is yes, the outcome is the `bending_1` leaf and if not, +it will be the `sitting` leaf. + +A tree is just a series of nodes and leaves combined together: + +![](./images/B15019_04_15.jpg) + +Caption: Example of a tree + +In the preceding example, the tree is composed of three nodes with +different questions. Now, for an observation to be predicted as +`sitting`, it will need to meet the conditions: +`avg_rss13 <= 41`, `var_rss > 0.7`, and +`avg_rss13 <= 16.25`. + +The `RandomForest` algorithm will build this kind of tree +based on the training data it sees. We will not go through the +mathematical details about how it defines the split for each node but, +basically, it will go through every column of the dataset and see which +split value will best help to separate the data into two groups of +similar classes. Taking the preceding example, the first node with the +`avg_rss13 > 41` condition will help to get the group of data +on the left-hand side with mostly the `bending_1` class. The +`RandomForest` algorithm usually builds several of this kind +of tree and this is the reason why it is called a forest. + +As you may have guessed now, the `n_estimators` hyperparameter +is used to specify the number of trees the `RandomForest` +algorithm will build. For example (as in the previous exercise), say we +ask it to build 10 trees. For a given observation, it will ask each tree +to make a prediction. Then, it will average those predictions and use +the result as the final prediction for this input. For instance, if, out +of 10 trees, 8 of them predict the outcome `sitting`, then the +`RandomForest` algorithm will use this outcome as the final +prediction. + +Note + +If you don\'t pass in a specific `n_estimators` +hyperparameter, it will use the default value. The default depends on +the version of scikit-learn you\'re using. In early versions, the +default value is 10. From version 0.22 onwards, the default is 100. You +can find out which version you are using by executing the following +code: + +`import sklearn` + +`sklearn.__version__` + +For more information, see here: + + +In general, the higher the number of trees is, the better the +performance you will get. Let\'s see what happens with +`n_estimators = 2` on the Activity Recognition dataset: + +``` +rf_model2 = RandomForestClassifier(random_state=1, \ + n_estimators=2) +rf_model2.fit(X_train, y_train) +preds2 = rf_model2.predict(X_train) +test_preds2 = rf_model2.predict(X_test) +print(accuracy_score(y_train, preds2)) +print(accuracy_score(y_test, test_preds2)) +``` + +The output will be as follows: + +![](./images/B15019_04_16.jpg) + +Caption: Accuracy of RandomForest with n\_estimators = 2 + +As expected, the accuracy is significantly lower than the previous +example with `n_estimators = 10`. Let\'s now try with +`50` trees: + +``` +rf_model3 = RandomForestClassifier(random_state=1, \ + n_estimators=50) +rf_model3.fit(X_train, y_train) +preds3 = rf_model3.predict(X_train) +test_preds3 = rf_model3.predict(X_test) +print(accuracy_score(y_train, preds3)) +print(accuracy_score(y_test, test_preds3)) +``` + +The output will be as follows: + +![](./images/B15019_04_17.jpg) + +Caption: Accuracy of RandomForest with n\_estimators = 50 + +With `n_estimators = 50`, we respectively gained +`1%` and `2%` on the accuracy scored for the +training and testing sets, which is great. But the main drawback of +increasing the number of trees is that it requires more computational +power. So, it will take more time to train a model. In a real project, +you will need to find the right balance between performance and training +duration. + + + +Exercise 4.02: Tuning n\_estimators to Reduce Overfitting +--------------------------------------------------------- + +In this exercise, we will train a Random Forest classifier to predict +the type of an animal based on its attributes and will try two different +values for the `n_estimators` hyperparameter: + +We will be using the same zoo dataset as in the previous exercise. + +1. Open a new Colab notebook. + +2. Import the `pandas `package, `train_test_split`, + `RandomForestClassifier`, and `accuracy_score` + from `sklearn`: + ``` + import pandas as pd + from sklearn.model_selection import train_test_split + from sklearn.ensemble import RandomForestClassifier + from sklearn.metrics import accuracy_score + ``` + + +3. Create a variable called `file_url` that contains the URL + to the dataset: + ``` + file_url = 'https://raw.githubusercontent.com'\ + '/fenago/data-science'\ + '/master/Lab04/Dataset'\ + '/openml_phpZNNasq.csv' + ``` + + +4. Load the dataset into a DataFrame using the `.read_csv()` + method from `pandas`: + ``` + df = pd.read_csv(file_url) + ``` + + +5. Remove the `animal` column using `.drop()` and + then extract the `type` target variable into a new + variable called `y` using `.pop()`: + ``` + df.drop(columns='animal', inplace=True) + y = df.pop('type') + ``` + + +6. Split the data into training and testing sets with + `train_test_split()` and the `test_size=0.4` and + `random_state=188` parameters: + ``` + X_train, X_test, y_train, y_test = train_test_split\ + (df, y, test_size=0.4, \ + random_state=188) + ``` + + +7. Instantiate `RandomForestClassifier` with + `random_state=42` and `n_estimators=1`, and then + fit the model with the training set: + + ``` + rf_model = RandomForestClassifier(random_state=42, \ + n_estimators=1) + rf_model.fit(X_train, y_train) + ``` + + + You should get the following output: + + +![](./images/B15019_04_18.jpg) + + + Caption: Logs of RandomForestClassifier + +8. Make predictions on the training and testing sets with + `.predict()` and save the results into two new variables + called `train_preds` and `test_preds`: + ``` + train_preds = rf_model.predict(X_train) + test_preds = rf_model.predict(X_test) + ``` + + +9. Calculate the accuracy score for the training and testing sets and + save the results in two new variables called `train_acc` + and `test_acc`: + ``` + train_acc = accuracy_score(y_train, train_preds) + test_acc = accuracy_score(y_test, test_preds) + ``` + + +10. Print the accuracy scores: `train_acc` and + `test_acc`: + + ``` + print(train_acc) + print(test_acc) + ``` + + + You should get the following output: + + +![](./images/B15019_04_19.jpg) + + + Caption: Accuracy scores for the training and testing sets + + The accuracy score decreased for both the training and testing sets. + But now the difference is smaller compared to the results from + *Exercise 4.01*, *Building a Model for Classifying Animal Type and + Assessing Its Performance*. + +11. Instantiate another `RandomForestClassifier` with + `random_state=42` and `n_estimators=30`, and + then fit the model with the training set: + + ``` + rf_model2 = RandomForestClassifier(random_state=42, \ + n_estimators=30) + rf_model2.fit(X_train, y_train) + ``` + + + You should get the following output: + + +![](./images/B15019_04_20.jpg) + + + Caption: Logs of RandomForest with n\_estimators = 30 + +12. Make predictions on the training and testing sets with + `.predict()` and save the results into two new variables + called `train_preds2` and `test_preds2`: + ``` + train_preds2 = rf_model2.predict(X_train) + test_preds2 = rf_model2.predict(X_test) + ``` + + +13. Calculate the accuracy score for the training and testing sets and + save the results in two new variables called `train_acc2` + and `test_acc2`: + ``` + train_acc2 = accuracy_score(y_train, train_preds2) + test_acc2 = accuracy_score(y_test, test_preds2) + ``` + + +14. Print the accuracy scores: `train_acc` and + `test_acc`: + + ``` + print(train_acc2) + print(test_acc2) + ``` + + + You should get the following output: + + +![](./images/B15019_04_21.jpg) + + +Caption: Accuracy scores for the training and testing sets + + + +Maximum Depth +============= + + +In the previous section, we learned how Random Forest builds multiple +trees to make predictions. Increasing the number of trees does improve +model performance but it usually doesn\'t help much to decrease the risk +of overfitting. Our model in the previous example is still performing +much better on the training set (data it has already seen) than on the +testing set (unseen data). + +So, we are not confident enough yet to say the model will perform well +in production. There are different hyperparameters that can help to +lower the risk of overfitting for Random Forest and one of them is +called `max_depth`. + +This hyperparameter defines the depth of the trees built by Random +Forest. Basically, it tells Random Forest model, how many nodes +(questions) it can create before making predictions. But how will that +help to reduce overfitting, you may ask. Well, let\'s say you built a +single tree and set the `max_depth` hyperparameter to +`50`. This would mean that there would be some cases where you +could ask 49 different questions (the value `c` includes the +final leaf node) before making a prediction. So, the logic would be +`IF X1 > value1 AND X2 > value2 AND X1 <= value3 AND … AND X3 > value49 THEN predict class A`. + +As you can imagine, this is a very specific rule. In the end, it may +apply to only a few observations in the training set, with this case +appearing very infrequently. Therefore, your model would be overfitting. +By default, the value of this `max_depth` parameter is +`None`, which means there is no limit set for the depth of the +trees. + +What you really want is to find some rules that are generic enough to be +applied to bigger groups of observations. This is why it is recommended +to not create deep trees with Random Forest. Let\'s try several values +for this hyperparameter on the Activity Recognition dataset: +`3`, `10`, and `50`: + +``` +rf_model4 = RandomForestClassifier(random_state=1, \ + n_estimators=50, max_depth=3) +rf_model4.fit(X_train, y_train) +preds4 = rf_model4.predict(X_train) +test_preds4 = rf_model4.predict(X_test) +print(accuracy_score(y_train, preds4)) +print(accuracy_score(y_test, test_preds4)) +``` +You should get the following output: + +![Caption: Accuracy scores for the training and testing sets and a +max\_depth of 3 ](./images/B15019_04_22.jpg) + +Caption: Accuracy scores for the training and testing sets and a +max\_depth of 3 + +For a `max_depth` of `3`, we got extremely similar +results for the training and testing sets but the overall performance +decreased drastically to `0.61`. Our model is not overfitting +anymore, but it is now underfitting; that is, it is not predicting the +target variable very well (only in `61%` of cases). Let\'s +increase `max_depth` to `10`: + +``` +rf_model5 = RandomForestClassifier(random_state=1, \ + n_estimators=50, \ + max_depth=10) +rf_model5.fit(X_train, y_train) +preds5 = rf_model5.predict(X_train) +test_preds5 = rf_model5.predict(X_test) +print(accuracy_score(y_train, preds5)) +print(accuracy_score(y_test, test_preds5)) +``` +![Caption: Accuracy scores for the training and testing sets and a +max\_depth of 10 ](./images/B15019_04_23.jpg) + +Caption: Accuracy scores for the training and testing sets and a +max\_depth of 10 + +The accuracy of the training set increased and is relatively close to +the testing set. We are starting to get some good results, but the model +is still slightly overfitting. Now we will see the results for +`max_depth = 50`: + +``` +rf_model6 = RandomForestClassifier(random_state=1, \ + n_estimators=50, \ + max_depth=50) +rf_model6.fit(X_train, y_train) +preds6 = rf_model6.predict(X_train) +test_preds6 = rf_model6.predict(X_test) +print(accuracy_score(y_train, preds6)) +print(accuracy_score(y_test, test_preds6)) +``` + +The output will be as follows: + +![Caption: Accuracy scores for the training and testing sets and a +max\_depth of 50 ](./images/B15019_04_24.jpg) + +Caption: Accuracy scores for the training and testing sets and a +max\_depth of 50 + +The accuracy jumped to `0.99` for the training set but it +didn\'t improve much for the testing set. So, the model is overfitting +with `max_depth = 50`. It seems the sweet spot to get good +predictions and not much overfitting is around `10` for the +`max_depth` hyperparameter in this dataset. + + + +Exercise 4.03: Tuning max\_depth to Reduce Overfitting +------------------------------------------------------ + +In this exercise, we will keep tuning our RandomForest classifier that +predicts animal type by trying two different values for the +`max_depth` hyperparameter: + +We will be using the same zoo dataset as in the previous exercise. + +1. Open a new Colab notebook. + +2. Import the `pandas` package, `train_test_split`, + `RandomForestClassifier`, and `accuracy_score` + from `sklearn`: + ``` + import pandas as pd + from sklearn.model_selection import train_test_split + from sklearn.ensemble import RandomForestClassifier + from sklearn.metrics import accuracy_score + ``` + + +3. Create a variable called `file_url` that contains the URL + to the dataset: + ``` + file_url = 'https://raw.githubusercontent.com'\ + 'fenago/data-science'\ + '/master/Lab04/Dataset'\ + '/openml_phpZNNasq.csv' + ``` + + +4. Load the dataset into a DataFrame using the `.read_csv()` + method from `pandas`: + ``` + df = pd.read_csv(file_url) + ``` + + +5. Remove the `animal` column using `.drop()` and + then extract the `type` target variable into a new + variable called `y` using `.pop()`: + ``` + df.drop(columns='animal', inplace=True) + y = df.pop('type') + ``` + + +6. Split the data into training and testing sets with + `train_test_split()` and the parameters + `test_size=0.4` and `random_state=188`: + ``` + X_train, X_test, y_train, y_test = train_test_split\ + (df, y, test_size=0.4, \ + random_state=188) + ``` + + +7. Instantiate `RandomForestClassifier` with + `random_state=42`, `n_estimators=30`, and + `max_depth=5`, and then fit the model with the training + set: + + ``` + rf_model = RandomForestClassifier(random_state=42, \ + n_estimators=30, \ + max_depth=5) + rf_model.fit(X_train, y_train) + ``` + + + You should get the following output: + + +![](./images/B15019_04_25.jpg) + + + Caption: Logs of RandomForest + +8. Make predictions on the training and testing sets with + `.predict()` and save the results into two new variables + called `train_preds` and `test_preds`: + ``` + train_preds = rf_model.predict(X_train) + test_preds = rf_model.predict(X_test) + ``` + + +9. Calculate the accuracy score for the training and testing sets and + save the results in two new variables called `train_acc` + and `test_acc`: + ``` + train_acc = accuracy_score(y_train, train_preds) + test_acc = accuracy_score(y_test, test_preds) + ``` + + +10. Print the accuracy scores: `train_acc` and + `test_acc`: + + ``` + print(train_acc) + print(test_acc) + ``` + + + You should get the following output: + + +![](./images/B15019_04_26.jpg) + + + Caption: Accuracy scores for the training and testing sets + + We got the exact same accuracy scores as for the best result we + obtained in the previous exercise. This value for the + `max_depth` hyperparameter hasn\'t impacted the model\'s + performance. + +11. Instantiate another `RandomForestClassifier` with + `random_state=42`, `n_estimators=30`, and + `max_depth=2`, and then fit the model with the training + set: + + ``` + rf_model2 = RandomForestClassifier(random_state=42, \ + n_estimators=30, \ + max_depth=2) + rf_model2.fit(X_train, y_train) + ``` + + + You should get the following output: + + +![](./images/B15019_04_27.jpg) + + + Caption: Logs of RandomForestClassifier with max\_depth = 2 + +12. Make predictions on the training and testing sets with + `.predict()` and save the results into two new variables + called `train_preds2 `and `test_preds2`: + ``` + train_preds2 = rf_model2.predict(X_train) + test_preds2 = rf_model2.predict(X_test) + ``` + + +13. Calculate the accuracy scores for the training and testing sets and + save the results in two new variables called `train_acc2` + and `test_acc2`: + ``` + train_acc2 = accuracy_score(y_train, train_preds2) + test_acc2 = accuracy_score(y_test, test_preds2) + ``` + + +14. Print the accuracy scores: `train_acc` and + `test_acc`: + + ``` + print(train_acc2) + print(test_acc2) + ``` + + + You should get the following output: + + +![](./images/B15019_04_28.jpg) + + + + +Minimum Sample in Leaf +====================== + + +It would be great if we could let the model know to not create such +specific rules that happen quite infrequently. Luckily, +`RandomForest` has such a hyperparameter and, you guessed it, +it is `min_samples_leaf`. This hyperparameter will specify the +minimum number of observations (or samples) that will have to fall under +a leaf node to be considered in the tree. For instance, if we set +`min_samples_leaf` to `3`, then +`RandomForest` will only consider a split that leads to at +least three observations on both the left and right leaf nodes. If this +condition is not met for a split, the model will not consider it and +will exclude it from the tree. The default value in `sklearn` +for this hyperparameter is `1`. Let\'s try to find the optimal +value for `min_samples_leaf` for the Activity Recognition +dataset: + +``` +rf_model7 = RandomForestClassifier(random_state=1, \ + n_estimators=50, \ + max_depth=10, \ + min_samples_leaf=3) +rf_model7.fit(X_train, y_train) +preds7 = rf_model7.predict(X_train) +test_preds7 = rf_model7.predict(X_test) +print(accuracy_score(y_train, preds7)) +print(accuracy_score(y_test, test_preds7)) +``` + +The output will be as follows: + +![](./images/B15019_04_29.jpg) + +Caption: Accuracy scores for the training and testing sets for +min\_samples\_leaf=3 + +With `min_samples_leaf=3`, the accuracy for both the training +and testing sets didn\'t change much compared to the best model we found +in the previous section. Let\'s try increasing it to `10`: + +``` +rf_model8 = RandomForestClassifier(random_state=1, \ + n_estimators=50, \ + max_depth=10, \ + min_samples_leaf=10) +rf_model8.fit(X_train, y_train) +preds8 = rf_model8.predict(X_train) +test_preds8 = rf_model8.predict(X_test) +print(accuracy_score(y_train, preds8)) +print(accuracy_score(y_test, test_preds8)) +``` + +The output will be as follows: + +![Caption: Accuracy scores for the training and testing sets for +min\_samples\_leaf=10 ](./images/B15019_04_30.jpg) + +Caption: Accuracy scores for the training and testing sets for +min\_samples\_leaf=10 + +Now the accuracy of the training set dropped a bit but increased for the +testing set and their difference is smaller now. So, our model is +overfitting less. Let\'s try another value for this hyperparameter -- +`25`: + +``` +rf_model9 = RandomForestClassifier(random_state=1, \ + n_estimators=50, \ + max_depth=10, \ + min_samples_leaf=25) +rf_model9.fit(X_train, y_train) +preds9 = rf_model9.predict(X_train) +test_preds9 = rf_model9.predict(X_test) +print(accuracy_score(y_train, preds9)) +print(accuracy_score(y_test, test_preds9)) +``` + +The output will be as follows: + +![Caption: Accuracy scores for the training and testing sets for +min\_samples\_leaf=25 ](./images/B15019_04_31.jpg) + +Caption: Accuracy scores for the training and testing sets for +min\_samples\_leaf=25 + +Both accuracies for the training and testing sets decreased but they are +quite close to each other now. So, we will keep this value +(`25`) as the optimal one for this dataset as the performance +is still OK and we are not overfitting too much. + +When choosing the optimal value for this hyperparameter, you need to be +careful: a value that\'s too low will increase the chance of the model +overfitting, but on the other hand, setting a very high value will lead +to underfitting (the model will not accurately predict the right +outcome). + +For instance, if you have a dataset of `1000` rows, if you set +`min_samples_leaf` to `400`, then the model will not +be able to find good splits to predict `5` different classes. +In this case, the model can only create one single split and the model +will only be able to predict two different classes instead of +`5`. It is good practice to start with low values first and +then progressively increase them until you reach satisfactory +performance. + + + +Exercise 4.04: Tuning min\_samples\_leaf +---------------------------------------- + +In this exercise, we will keep tuning our Random Forest classifier that +predicts animal type by trying two different values for the +`min_samples_leaf` hyperparameter: + +We will be using the same zoo dataset as in the previous exercise. + +1. Open a new Colab notebook. + +2. Import the `pandas` package, `train_test_split`, + `RandomForestClassifier`, and `accuracy_score` + from `sklearn`: + ``` + import pandas as pd + from sklearn.model_selection import train_test_split + from sklearn.ensemble import RandomForestClassifier + from sklearn.metrics import accuracy_score + ``` + + +3. Create a variable called `file_url` that contains the URL + to the dataset: + ``` + file_url = 'https://raw.githubusercontent.com'\ + '/fenago/data-science'\ + '/master/Lab04/Dataset/openml_phpZNNasq.csv' + ``` + + +4. Load the dataset into a DataFrame using the `.read_csv()` + method from `pandas`: + ``` + df = pd.read_csv(file_url) + ``` + + +5. Remove the `animal` column using `.drop()` and + then extract the `type` target variable into a new + variable called `y` using `.pop()`: + ``` + df.drop(columns='animal', inplace=True) + y = df.pop('type') + ``` + + +6. Split the data into training and testing sets with + `train_test_split()` and the parameters + `test_size=0.4` and `random_state=188`: + ``` + X_train, X_test, \ + y_train, y_test = train_test_split(df, y, test_size=0.4, \ + random_state=188) + ``` + + +7. Instantiate `RandomForestClassifier` with + `random_state=42`, `n_estimators=30`, + `max_depth=2`, and `min_samples_leaf=3`, and + then fit the model with the training set: + + ``` + rf_model = RandomForestClassifier(random_state=42, \ + n_estimators=30, \ + max_depth=2, \ + min_samples_leaf=3) + rf_model.fit(X_train, y_train) + ``` + + + You should get the following output: + + +![](./images/B15019_04_32.jpg) + + + Caption: Logs of RandomForest + +8. Make predictions on the training and testing sets with + `.predict()` and save the results into two new variables + called `train_preds` and `test_preds`: + ``` + train_preds = rf_model.predict(X_train) + test_preds = rf_model.predict(X_test) + ``` + + +9. Calculate the accuracy score for the training and testing sets and + save the results in two new variables called `train_acc` + and `test_acc`: + ``` + train_acc = accuracy_score(y_train, train_preds) + test_acc = accuracy_score(y_test, test_preds) + ``` + + +10. Print the accuracy score -- `train_acc` and + `test_acc`: + + ``` + print(train_acc) + print(test_acc) + ``` + + + You should get the following output: + + +![](./images/B15019_04_33.jpg) + + + Caption: Accuracy scores for the training and testing sets + + The accuracy score decreased for both the training and testing sets + compared to the best result we got in the previous exercise. Now the + difference between the training and testing sets\' accuracy scores + is much smaller so our model is overfitting less. + +11. Instantiate another `RandomForestClassifier` with + `random_state=42`, `n_estimators=30`, + `max_depth=2`, and `min_samples_leaf=7`, and + then fit the model with the training set: + + ``` + rf_model2 = RandomForestClassifier(random_state=42, \ + n_estimators=30, \ + max_depth=2, \ + min_samples_leaf=7) + rf_model2.fit(X_train, y_train) + ``` + + + You should get the following output: + + +![](./images/B15019_04_34.jpg) + + + Caption: Logs of RandomForest with max\_depth=2 + +12. Make predictions on the training and testing sets with + `.predict()` and save the results into two new variables + called `train_preds2` and `test_preds2`: + ``` + train_preds2 = rf_model2.predict(X_train) + test_preds2 = rf_model2.predict(X_test) + ``` + + +13. Calculate the accuracy score for the training and testing sets and + save the results in two new variables called `train_acc2` + and `test_acc2`: + ``` + train_acc2 = accuracy_score(y_train, train_preds2) + test_acc2 = accuracy_score(y_test, test_preds2) + ``` + + +14. Print the accuracy scores: `train_acc` and + `test_acc`: + + ``` + print(train_acc2) + print(test_acc2) + ``` + + + You should get the following output: + + +![](./images/B15019_04_35.jpg) + + + + +Maximum Features +================ + + +We are getting close to the end of this lab. You have already +learned how to tune several of the most important hyperparameters for +RandomForest. In this section, we will present you with another +extremely important one: `max_features`. + +Earlier, we learned that `RandomForest` builds multiple trees +and takes the average to make predictions. This is why it is called a +forest, but we haven\'t really discussed the \"random\" part yet. Going +through this lab, you may have asked yourself: how does building +multiple trees help to get better predictions, and won\'t all the trees +look the same given that the input data is the same? + +Before answering these questions, let\'s use the analogy of a court +trial. In some countries, the final decision of a trial is either made +by a judge or a jury. A judge is a person who knows the law in detail +and can decide whether a person has broken the law or not. On the other +hand, a jury is composed of people from different backgrounds who don\'t +know each other or any of the parties involved in the trial and have +limited knowledge of the legal system. In this case, we are asking +random people who are not expert in the law to decide the outcome of a +case. This sounds very risky at first. The risk of one person making the +wrong decision is very high. But in fact, the risk of 10 or 20 people +all making the wrong decision is relatively low. + +But there is one condition that needs to be met for this to work: +randomness. If all the people in the jury come from the same background, +work in the same industry, or live in the same area, they may share the +same way of thinking and make similar decisions. For instance, if a +group of people were raised in a community where you only drink hot +chocolate at breakfast and one day you ask them if it is OK to drink +coffee at breakfast, they would all say no. + +On the other hand, say you got another group of people from different +backgrounds with different habits: some drink coffee, others tea, a few +drink orange juice, and so on. If you asked them the same question, you +would end up with the majority of them saying yes. Because we randomly +picked these people, they have less bias as a group, and this therefore +lowers the risk of them making a wrong decision. + +RandomForest actually applies the same logic: it builds a number of +trees independently of each other by randomly sampling the data. A tree +may see `60%` of the training data, another one +`70%`, and so on. By doing so, there is a high chance that the +trees are absolutely different from each other and don\'t share the same +bias. This is the secret of RandomForest: building multiple random trees +leads to higher accuracy. + +But it is not the only way RandomForest creates randomness. It does so +also by randomly sampling columns. Each tree will only see a subset of +the features rather than all of them. And this is exactly what the +`max_features` hyperparameter is for: it will set the maximum +number of features a tree is allowed to see. + +In `sklearn`, you can specify the value of this hyperparameter +as: + +- The maximum number of features, as an integer. +- A ratio, as the percentage of allowed features. +- The `sqrt` function (the default value in + `sklearn`, which stands for square root), which will use + the square root of the number of features as the maximum value. If, + for a dataset, there are `25` features, its square root + will be `5` and this will be the value for + `max_features`. +- The `log2` function, which will use the log base, + `2`, of the number of features as the maximum value. If, + for a dataset, there are eight features, its `log2` will + be `3` and this will be the value for + `max_features`. +- The `None` value, which means Random Forest will use all + the features available. + +Let\'s try three different values on the activity dataset. First, we +will specify the maximum number of features as two: + +``` +rf_model10 = RandomForestClassifier(random_state=1, \ + n_estimators=50, \ + max_depth=10, \ + min_samples_leaf=25, \ + max_features=2) +rf_model10.fit(X_train, y_train) +preds10 = rf_model10.predict(X_train) +test_preds10 = rf_model10.predict(X_test) +print(accuracy_score(y_train, preds10)) +print(accuracy_score(y_test, test_preds10)) +``` + +The output will be as follows: + +![Caption: Accuracy scores for the training and testing sets for +max\_features=2 ](./images/B15019_04_36.jpg) + +Caption: Accuracy scores for the training and testing sets for +max\_features=2 + +We got results similar to those of the best model we trained in the +previous section. This is not really surprising as we were using the +default value of `max_features` at that time, which is +`sqrt`. The square root of `2` equals +`1.45`, which is quite close to `2`. This time, +let\'s try with the ratio `0.7`: + +``` +rf_model11 = RandomForestClassifier(random_state=1, \ + n_estimators=50, \ + max_depth=10, \ + min_samples_leaf=25, \ + max_features=0.7) +rf_model11.fit(X_train, y_train) +preds11 = rf_model11.predict(X_train) +test_preds11 = rf_model11.predict(X_test) +print(accuracy_score(y_train, preds11)) +print(accuracy_score(y_test, test_preds11)) +``` + +The output will be as follows: + +![Caption: Accuracy scores for the training and testing sets for +max\_features=0.7 ](./images/B15019_04_37.jpg) + +Caption: Accuracy scores for the training and testing sets for +max\_features=0.7 + +With this ratio, both accuracy scores increased for the training and +testing sets and the difference between them is less. Our model is +overfitting less now and has slightly improved its predictive power. +Let\'s give it a shot with the `log2` option: + +``` +rf_model12 = RandomForestClassifier(random_state=1, \ + n_estimators=50, \ + max_depth=10, \ + min_samples_leaf=25, \ + max_features='log2') +rf_model12.fit(X_train, y_train) +preds12 = rf_model12.predict(X_train) +test_preds12 = rf_model12.predict(X_test) +print(accuracy_score(y_train, preds12)) +print(accuracy_score(y_test, test_preds12)) +``` + +The output will be as follows: + +![Caption: Accuracy scores for the training and testing sets for +max\_features=\'log2\' ](./images/B15019_04_38.jpg) + +Caption: Accuracy scores for the training and testing sets for +max\_features=\'log2\' + +We got similar results as for the default value (`sqrt`) and +`2`. Again, this is normal as the `log2` of +`6` equals `2.58`. So, the optimal value we found +for the `max_features` hyperparameter is `0.7` for +this dataset. + + + +Exercise 4.05: Tuning max\_features +----------------------------------- + +In this exercise, we will keep tuning our RandomForest classifier that +predicts animal type by trying two different values for the +`max_features` hyperparameter: + +We will be using the same zoo dataset as in the previous exercise. + +1. Open a new Colab notebook. + +2. Import the `pandas` package, `train_test_split`, + `RandomForestClassifier`, and `accuracy_score` + from `sklearn`: + ``` + import pandas as pd + from sklearn.model_selection import train_test_split + from sklearn.ensemble import RandomForestClassifier + from sklearn.metrics import accuracy_score + ``` + + +3. Create a variable called `file_url` that contains the URL + to the dataset: + ``` + file_url = 'https://raw.githubusercontent.com'\ + '/fenago/data-science'\ + '/master/Lab04/Dataset/openml_phpZNNasq.csv' + ``` + + +4. Load the dataset into a DataFrame using the `.read_csv()` + method from `pandas`: + ``` + df = pd.read_csv(file_url) + ``` + + +5. Remove the `animal` column using `.drop()` and + then extract the `type` target variable into a new + variable called `y` using `.pop()`: + ``` + df.drop(columns='animal', inplace=True) + y = df.pop('type') + ``` + + +6. Split the data into training and testing sets with + `train_test_split()` and the parameters + `test_size=0.4` and `random_state=188`: + ``` + X_train, X_test, \ + y_train, y_test = train_test_split(df, y, test_size=0.4, \ + random_state=188) + ``` + + +7. Instantiate `RandomForestClassifier` with + `random_state=42`, `n_estimators=30`, + `max_depth=2`, `min_samples_leaf=7`, and + `max_features=10`, and then fit the model with the + training set: + + ``` + rf_model = RandomForestClassifier(random_state=42, \ + n_estimators=30, \ + max_depth=2, \ + min_samples_leaf=7, \ + max_features=10) + rf_model.fit(X_train, y_train) + ``` + + + You should get the following output: + + +![](./images/B15019_04_39.jpg) + + + Caption: Logs of RandomForest + +8. Make predictions on the training and testing sets with + `.predict()` and save the results into two new variables + called `train_preds` and `test_preds`: + ``` + train_preds = rf_model.predict(X_train) + test_preds = rf_model.predict(X_test) + ``` + + +9. Calculate the accuracy scores for the training and testing sets and + save the results in two new variables called `train_acc` + and `test_acc`: + ``` + train_acc = accuracy_score(y_train, train_preds) + test_acc = accuracy_score(y_test, test_preds) + ``` + + +10. Print the accuracy scores: `train_acc` and + `test_acc`: + + ``` + print(train_acc) + print(test_acc) + ``` + + + You should get the following output: + + +![](./images/B15019_04_40.jpg) + + + Caption: Accuracy scores for the training and testing sets + +11. Instantiate another `RandomForestClassifier` with + `random_state=42`, `n_estimators=30`, + `max_depth=2`, `min_samples_leaf=7`, and + `max_features=0.2`, and then fit the model with the + training set: + + ``` + rf_model2 = RandomForestClassifier(random_state=42, \ + n_estimators=30, \ + max_depth=2, \ + min_samples_leaf=7, \ + max_features=0.2) + rf_model2.fit(X_train, y_train) + ``` + + + You should get the following output: + + +![](./images/B15019_04_41.jpg) + + + Caption: Logs of RandomForest with max\_features = 0.2 + +12. Make predictions on the training and testing sets with + `.predict()` and save the results into two new variables + called `train_preds2` and `test_preds2`: + ``` + train_preds2 = rf_model2.predict(X_train) + test_preds2 = rf_model2.predict(X_test) + ``` + + +13. Calculate the accuracy score for the training and testing sets and + save the results in two new variables called `train_acc2` + and `test_acc2`: + ``` + train_acc2 = accuracy_score(y_train, train_preds2) + test_acc2 = accuracy_score(y_test, test_preds2) + ``` + + +14. Print the accuracy scores: `train_acc` and + `test_acc`: + + ``` + print(train_acc2) + print(test_acc2) + ``` + + + You should get the following output: + + +![](./images/B15019_04_42.jpg) + + + + + +Activity 4.01: Train a Random Forest Classifier on the ISOLET Dataset +--------------------------------------------------------------------- + +You are working for a technology company and they are planning to launch +a new voice assistant product. You have been tasked with building a +classification model that will recognize the letters spelled out by a +user based on the signal frequencies captured. Each sound can be +captured and represented as a signal composed of multiple frequencies. + + +The following steps will help you to complete this activity: + +1. Download and load the dataset using `.read_csv()` from + `pandas`. +2. Extract the response variable using `.pop()` from + `pandas`. +3. Split the dataset into training and test sets using + `train_test_split()` from + `sklearn.model_selection`. +4. Create a function that will instantiate and fit a + `RandomForestClassifier` using `.fit()` from + `sklearn.ensemble`. +5. Create a function that will predict the outcome for the training and + testing sets using `.predict()`. +6. Create a function that will print the accuracy score for the + training and testing sets using `accuracy_score()` from + `sklearn.metrics`. +7. Train and get the accuracy score for a range of different + hyperparameters. Here are some options you can try: + - `n_estimators = 20` and `50` + - `max_depth = 5` and `10` + - `min_samples_leaf = 10` and `50` + - `max_features = 0.5` and `0.3` +8. Select the best hyperparameter value. + +These are the accuracy scores for the best model we trained: + +![](./images/B15019_04_43.jpg) + + + + +Summary +======= + + +We have finally reached the end of this lab on multiclass +classification with Random Forest. We learned that multiclass +classification is an extension of binary classification: instead of +predicting only two classes, target variables can have many more values. +We saw how we can train a Random Forest model in just a few lines of +code and assess its performance by calculating the accuracy score for +the training and testing sets. Finally, we learned how to tune some of +its most important hyperparameters: `n_estimators`, +`max_depth`, `min_samples_leaf`, and +`max_features`. We also saw how their values can have a +significant impact on the predictive power of a model but also on its +ability to generalize to unseen data. diff --git a/lab_guides/Lab_5.md b/lab_guides/Lab_5.md new file mode 100644 index 0000000..25bba0e --- /dev/null +++ b/lab_guides/Lab_5.md @@ -0,0 +1,2228 @@ + +5. Performing Your First Cluster Analysis +========================================= + + + +Overview + +This lab will introduce you to unsupervised learning tasks, where +algorithms have to automatically learn patterns from data by themselves +as no target variables are defined beforehand. We will focus +specifically on the k-means algorithm, and see how to standardize and +process data for use in cluster analysis. + +By the end of this lab, you will be able to load and visualize data +and clusters with scatter plots; prepare data for cluster analysis; +perform centroid clustering with k-means; interpret clustering results +and determine the optimal number of clusters for a given dataset. + + +Clustering with k-means +======================= + + +We will perform cluster analysis on this dataset for two specific +variables (or columns): `Average net tax` and +`Average total deductions`. Our objective is to find groups +(or clusters) of postcodes sharing similar patterns in terms of tax +received and money deducted. Here is a scatter plot of these two +variables: + +![](./images/B15019_05_03.jpg) + +Caption: Scatter plot of the ATO dataset + + + +Exercise 5.01: Performing Your First Clustering Analysis on the ATO Dataset +--------------------------------------------------------------------------- + +In this exercise, we will be using k-means clustering on the ATO dataset +and observing the different clusters that the dataset divides itself +into, after which we will conclude by analyzing the output: + +1. Open a new Colab notebook. + +2. Next, load the required Python packages: `pandas` and + `KMeans` from `sklearn.cluster`. + + We will be using the `import` function from Python: + + Note + + You can create short aliases for the packages you will be calling + quite often in your script with the function mentioned in the + following code snippet. + + ``` + import pandas as pd + from sklearn.cluster import KMeans + ``` + + + Note + + We will be looking into `KMeans` (from + `sklearn.cluster`), which you have used in the code here, + later in the lab for a more detailed explanation of it. + +3. Next, create a variable containing the link to the file. We will + call this variable `file_url`: + + ``` + file_url = 'https://raw.githubusercontent.com'\ + '/fenago/data-science'\ + '/master/Lab05/DataSet/taxstats2015.csv' + ``` + + + In the next step, we will use the `pandas` package to load + our data into a DataFrame (think of it as a table, like on an Excel + spreadsheet, with a row index and column names). + + Our input file is in `CSV` format, and `pandas` + has a method that can directly read this format, which is + `.read_csv()`. + +4. Use the `usecols` parameter to subset only the columns we + need rather than loading the entire dataset. We just need to provide + a list of the column names we are interested in, which are mentioned + in the following code snippet: + + ``` + df = pd.read_csv(file_url, \ + usecols=['Postcode', \ + 'Average net tax', \ + 'Average total deductions']) + ``` + + + Now we have loaded the data into a `pandas` DataFrame. + +5. Next, let\'s display the first 5 rows of this DataFrame , using the + method `.head()`: + + ``` + df.head() + ``` + + + You should get the following output: + + +![](./images/B15019_05_04.jpg) + + + Caption: The first five rows of the ATO DataFrame + +6. Now, to output the last 5 rows, we use `.tail()`: + + ``` + df.tail() + ``` + + + You should get the following output: + + +![](./images/B15019_05_05.jpg) + + + Caption: The last five rows of the ATO DataFrame + + Now that we have our data, let\'s jump straight to what we want to + do: find clusters. + + As you saw in the previous labs, `sklearn` provides + the exact same APIs for training different machine learning + algorithms, such as: + + - Instantiate an algorithm with the specified hyperparameters + (here it will be KMeans(hyperparameters)). + + - Fit the model with the training data with the method + `.fit()`. + + - Predict the result with the given input data with the method + `.predict()`. + + Note + + Here, we will use all the default values for the k-means + hyperparameters except for the `random_state` one. + Specifying a fixed random state (also called a **seed**) will + help us to get reproducible results every time we have to rerun + our code. + +7. Instantiate k-means with a random state of `42` and save + it into a variable called `kmeans`: + ``` + kmeans = KMeans(random_state=42) + ``` + + +8. Now feed k-means with our training data. To do so, we need to get + only the variables (or columns) used for fitting the model. In our + case, the variables are `'Average net tax'` and + `'Average total deductions'`, and they are saved in a new + variable called `X`: + ``` + X = df[['Average net tax', 'Average total deductions']] + ``` + + +9. Now fit `kmeans` with this training data: + + ``` + kmeans.fit(X) + ``` + + + You should get the following output: + + +![](./images/B15019_05_06.jpg) + + + Caption: Summary of the fitted kmeans and its hyperparameters + + We just ran our first clustering algorithm in just a few lines of + code. + +10. See which cluster each data point belongs to by using the + `.predict()` method: + + ``` + y_preds = kmeans.predict(X) + y_preds + ``` + + + You should get the following output: + + +![](./images/B15019_05_07.jpg) + + + Caption: Output of the k-means predictions + + Note + + Although we set a `random_state` value, you may still get + an output with different cluster numbers than the one shown above. + This will depend on the version of scikit-learn you are using. The + output above was generated using version 0.22.2. You can find out + which version you are using by executing the following code: + + `import sklearn` + + `sklearn.__version__` + +11. Now, add these predictions into the original DataFrame and take a + look at the first five postcodes: + + ``` + df['cluster'] = y_preds + df.head() + ``` + + + Note + + The predictions from the sklearn `predict()` method are in + the exact same order as the input data. So, the first prediction + will correspond to the first row of your DataFrame. + + You should get the following output: + + +![](./images/B15019_05_08.jpg) + + +Caption: Cluster number assigned to the first five postcodes + + +Interpreting k-means Results +============================ + + +After training our k-means algorithm, we will likely be interested in +analyzing its results in more detail. Remember, the objective of cluster +analysis is to group observations with similar patterns together. But +how can we see whether the groupings found by the algorithm are +meaningful? We will be looking at this in this section by using the +dataset results we just generated. + +One way of investigating this is to analyze the dataset row by row with +the assigned cluster for each observation. This can be quite tedious, +especially if the size of your dataset is quite big, so it would be +better to have a kind of summary of the cluster results. + +If you are familiar with Excel spreadsheets, you are probably thinking +about using a pivot table to get the average of the variables for each +cluster. In SQL, you would have probably used a `GROUP BY` +statement. If you are not familiar with either of these, you may think +of grouping each cluster together and then calculating the average for +each of them. The good news is that this can be easily achieved with the +`pandas` package in Python. Let\'s see how this can be done +with an example. + +To create a pivot table similar to an Excel one, we will be using the +`pivot_table()` method from `pandas`. We need to +specify the following parameters for this method: + +- `values`: This parameter corresponds to the numerical + columns you want to calculate summaries for (or aggregations), such + as getting averages or counts. In an Excel pivot table, it is also + called `values`. In our dataset, we will use the + `Average net tax` and `Average total deductions` + variables. + +- `index`: This parameter is used to specify the columns you + want to see summaries for. In our case, it will be the + `cluster` column. In a pivot table in Excel, this + corresponds with the `Rows` field. + +- `aggfunc`: This is where you will specify the aggregation + functions you want to summarize the data with, such as getting + averages or counts. In Excel, this is the `Summarize by` + option in the `values` field. An example of how to use the + `aggfunc` method is shown below. + + Note + + Run the code below in the same notebook as you used for the previous + exercise. + +``` +import numpy as np +df.pivot_table(values=['Average net tax', \ + 'Average total deductions'], \ + index='cluster', aggfunc=np.mean) +``` +Note + +We will be using the `numpy` implementation of +`mean()` as it is more optimized for pandas DataFrames. + +![](./images/B15019_05_09.jpg) + +Caption: Output of the pivot\_table function + +In this summary, we can see that the algorithm has grouped the data into +eight clusters (clusters 0 to 7). Cluster 0 has the lowest average net +tax and total deductions amounts among all the clusters, while cluster 4 +has the highest values. With this pivot table, we are able to compare +clusters between them using their summarized values. + +Using an aggregated view of clusters is a good way of seeing the +difference between them, but it is not the only way. Another possibility +is to visualize clusters in a graph. This is exactly what we are going +to do now. + +You may have heard of different visualization packages, such as +`matplotlib`, `seaborn`, and `bokeh`, but +in this lab, we will be using the `altair` package because +it is quite simple to use (its API is very similar to +`sklearn`). Let\'s import it first: + +``` +import altair as alt +``` + +Then, we will instantiate a `Chart()` object with our +DataFrame and save it into a variable called `chart`: + +``` +chart = alt.Chart(df) +``` +Now we will specify the type of graph we want, a scatter plot, with the +`.mark_circle()` method and will save it into a new variable +called `scatter_plot`: + +``` +scatter_plot = chart.mark_circle() +``` +Finally, we need to configure our scatter plot by specifying the names +of the columns that will be our `x`- and `y`-axes on +the graph. We also tell the scatter plot to color each point according +to its cluster value with the `color` option: + +``` +scatter_plot.encode(x='Average net tax', \ + y='Average total deductions', \ + color='cluster:N') +``` +Note + +You may have noticed that we added `:N` at the end of the +`cluster` column name. This extra parameter is used in +`altair` to specify the type of value for this column. +`:N` means the information contained in this column is +categorical. `altair` automatically defines the color scheme +to be used depending on the type of a column. + +You should get the following output: + +![](./images/B15019_05_10.jpg) + +Caption: Scatter plot of the clusters + + + +Let\'s say we want to add a tooltip that will display the values for the +two columns of interest: the postcode and the assigned cluster. With +`altair`, we just need to add a parameter called +`tooltip` in the `encode()` method with a list of +corresponding column names and call the `interactive()` method +just after, as seen in the following code snippet: + +``` +scatter_plot.encode(x='Average net tax', \ + y='Average total deductions', \ + color='cluster:N', \ + tooltip=['Postcode', \ + 'cluster', 'Average net tax', \ + 'Average total deductions'])\ + .interactive() +``` +You should get the following output: + +![](./images/B15019_05_11.jpg) + +Caption: Interactive scatter plot of the clusters with tooltip + +Now we can easily hover over and inspect the data points near the +cluster boundaries and find out that the threshold used to differentiate +the purple cluster (6) from the red one (2) is close to 32,000 in +`'Average Net Tax'`. + + + +Exercise 5.02: Clustering Australian Postcodes by Business Income and Expenses +------------------------------------------------------------------------------ + +In this exercise, we will learn how to perform clustering analysis with +k-means and visualize its results based on postcode values sorted by +business income and expenses. The following steps will help you complete +this exercise: + +1. Open a new Colab notebook for this exercise. + +2. Now `import` the required packages (`pandas`, + `sklearn`, `altair`, and `numpy`): + ``` + import pandas as pd + from sklearn.cluster import KMeans + import altair as alt + import numpy as np + ``` + + +3. Assign the link to the ATO dataset to a variable called + `file_url`: + ``` + file_url = 'https://raw.githubusercontent.com'\ + '/fenago/data-science'\ + '/master/Lab05/DataSet/taxstats2015.csv' + ``` + + +4. Using the `read_csv` method from the pandas package, load + the dataset with only the following columns with the + `use_cols` parameter: `'Postcode'`, + `'Average total business income'`, and + `'Average total business expenses'`: + ``` + df = pd.read_csv(file_url, \ + usecols=['Postcode', \ + 'Average total business income', \ + 'Average total business expenses']) + ``` + + +5. Display the last 10 rows from the ATO dataset using the + `.tail()` method from pandas: + + ``` + df.tail(10) + ``` + + + You should get the following output: + + +![](./images/B15019_05_12.jpg) + + + Caption: The last 10 rows of the ATO dataset + +6. Extract the `'Average total business income'` and + `'Average total business expenses'` columns using the + following pandas column subsetting syntax: + `dataframe_name[]`. Then, save them into + a new variable called `X`: + ``` + X = df[['Average total business income', \ + 'Average total business expenses']] + ``` + + +7. Now fit `kmeans` with this new variable using a value of + `8` for the `random_state` hyperparameter: + + ``` + kmeans = KMeans(random_state=8) + kmeans.fit(X) + ``` + + + You should get the following output: + + +![](./images/B15019_05_13.jpg) + + + Caption: Summary of the fitted kmeans and its hyperparameters + +8. Using the `predict` method from the `sklearn` + package, predict the clustering assignment from the input variable, + `(X)`, save the results into a new variable called + `y_preds`, and display the last `10` + predictions: + + ``` + y_preds = kmeans.predict(X) + y_preds[-10:] + ``` + + + You should get the following output: + + +![Caption: Results of the clusters assigned to the last 10 + observations ](./images/B15019_05_14.jpg) + + + Caption: Results of the clusters assigned to the last 10 + observations + +9. Save the predicted clusters back to the DataFrame by creating a new + column called `'cluster'` and print the last + `10` rows of the DataFrame using the `.tail()` + method from the `pandas` package: + + ``` + df['cluster'] = y_preds + df.tail(10) + ``` + + + You should get the following output: + + +![Caption: The last 10 rows of the ATO dataset with the added + cluster column ](./images/B15019_05_15.jpg) + + + Caption: The last 10 rows of the ATO dataset with the added + cluster column + +10. Generate a pivot table with the averages of the two columns for each + cluster value using the `pivot_table` method from the + `pandas` package with the following parameters: + + Provide the names of the columns to be aggregated, + `'Average total business income'` + and` 'Average total business expenses'`, to the parameter + values. + + Provide the name of the column to be grouped, `'cluster'`, + to the parameter index. + + Use the `.mean` method from NumPy (`np`) as the + aggregation function for the `aggfunc` parameter: + + ``` + df.pivot_table(values=['Average total business income', \ + 'Average total business expenses'], \ + index='cluster', aggfunc=np.mean) + ``` + + + You should get the following output: + + +![](./images/B15019_05_16.jpg) + + + Caption: Output of the pivot\_table function + +11. Now let\'s plot the clusters using an interactive scatter plot. + First, use `Chart()` and `mark_circle()` from + the `altair` package to instantiate a scatter plot graph: + ``` + scatter_plot = alt.Chart(df).mark_circle() + ``` + + +12. Use the `encode` and `interactive` methods from + `altair` to specify the display of the scatter plot and + its interactivity options with the following parameters: + + Provide the name of the `'Average total business income'` + column to the `x` parameter (the x-axis). + + Provide the name of the + `'Average total business expenses'` column to the + `y` parameter (the y-axis). + + Provide the name of the `cluster:N` column to the + `color` parameter (providing a different color for each + group). + + Provide these column names -- `'Postcode'`, + `'cluster'`, `'Average total business income'`, + and `'Average total business expenses'` -- to the + `'tooltip'` parameter (this being the information + displayed by the tooltip): + + ``` + scatter_plot.encode(x='Average total business income', \ + y='Average total business expenses', \ + color='cluster:N', tooltip = ['Postcode', \ + 'cluster', \ + 'Average total business income', \ + 'Average total business expenses'])\ + .interactive() + ``` + + + You should get the following output: + + +![](./images/B15019_05_17.jpg) + + +Caption: Interactive scatter plot of the clusters + + + +Choosing the Number of Clusters +=============================== + + +In the previous sections, we saw how easy it is to fit the k-means +algorithm on a given dataset. In our ATO dataset, we found 8 different +clusters that were mainly defined by the values of the +`Average net tax` variable. + +But you may have asked yourself: \"*Why 8 clusters? Why not 3 or 15 +clusters?*\" These are indeed excellent questions. The short answer is +that we used k-means\' default value for the hyperparameter +`n_cluster`, defining the number of clusters to be found, as +8. + +As you will recall from *Lab 2*, *Regression*, and *Lab 4*, +*Multiclass Classification with RandomForest*, the value of a +hyperparameter isn\'t learned by the algorithm but has to be set +arbitrarily by you prior to training. For k-means, `n_cluster` +is one of the most important hyperparameters you will have to tune. +Choosing a low value will lead k-means to group many data points +together, even though they are very different from each other. On the +other hand, choosing a high value may force the algorithm to split close +observations into multiple ones, even though they are very similar. + +Looking at the scatter plot from the ATO dataset, eight clusters seems +to be a lot. On the graph, some of the clusters look very close to each +other and have similar values. Intuitively, just by looking at the plot, +you could have said that there were between two and four different +clusters. As you can see, this is quite suggestive, and it would be +great if there was a function that could help us to define the right +number of clusters for a dataset. Such a method does indeed exist, and +it is called the **Elbow** method. + +This method assesses the compactness of clusters, the objective being to +minimize a value known as **inertia**. More details and an explanation +about this will be provided later in this lab. For now, think of +inertia as a value that says, for a group of data points, how far from +each other or how close to each other they are. + +Let\'s apply this method to our ATO dataset. First, we will define the +range of cluster numbers we want to evaluate (between 1 and 10) and save +them in a DataFrame called `clusters`. We will also create an +empty list called `inertia`, where we will store our +calculated values. + +Note + +Open the notebook you were using for *Exercise 5.01*, *Performing Your +First Clustering Analysis on the ATO Dataset*, execute the code you +already entered, and then continue at the end of the notebook with the +following code. + +``` +clusters = pd.DataFrame() +clusters['cluster_range'] = range(1, 10) +inertia = [] +``` +Next, we will create a `for` loop that will iterate over the +range, fit a k-means model with the specified number of +`clusters`, extract the `inertia` value, and store +it in our list, as in the following code snippet: + +``` +for k in clusters['cluster_range']: + kmeans = KMeans(n_clusters=k, random_state=8).fit(X) + inertia.append(kmeans.inertia_) +``` +Now we can use our list of `inertia` values in the +`clusters` DataFrame: + +``` +clusters['inertia'] = inertia +clusters +``` +You should get the following output: + +![](./images/B15019_05_18.jpg) + +Caption: Dataframe containing inertia values for our clusters + +Then, we need to plot a line chart using `altair` with the +`mark_line()` method. We will specify the +`'cluster_range'` column as our x-axis and +`'inertia'` as our y-axis, as in the following code snippet: + +``` +alt.Chart(clusters).mark_line()\ + .encode(x='cluster_range', y='inertia') +``` +You should get the following output: + +![](./images/B15019_05_19.jpg) + +Caption: Plotting the Elbow method + +Note + +You don\'t have to save each of the `altair` objects in a +separate variable; you can just append the methods one after the other +with \"`.".` + +Now that we have plotted the inertia value against the number of +clusters, we need to find the optimal number of clusters. What we need +to do is to find the inflection point in the graph, where the inertia +value starts to decrease more slowly (that is, where the slope of the +line almost reaches a 45-degree angle). Finding the right **inflection +point** can be a bit tricky. If you picture this line chart as an arm, +what we want is to find the center of the Elbow (now you know where the +name for this method comes from). So, looking at our example, we will +say that the optimal number of clusters is three. If we kept adding more +clusters, the inertia would not decrease drastically and add any value. +This is the reason why we want to find the middle of the Elbow as the +inflection point. + +Now let\'s retrain our `Kmeans` with this hyperparameter and +plot the clusters as shown in the following code snippet: + +``` +kmeans = KMeans(random_state=42, n_clusters=3) +kmeans.fit(X) +df['cluster2'] = kmeans.predict(X) +scatter_plot.encode(x='Average net tax', \ + y='Average total deductions', \ + color='cluster2:N', \ + tooltip=['Postcode', 'cluster', \ + 'Average net tax', \ + 'Average total deductions'])\ + .interactive() +``` +You should get the following output: + +![](./images/B15019_05_20.jpg) + +Caption: Scatter plot of the three clusters + +This is very different compared to our initial results. Looking at the +three clusters, we can see that: + +- The first cluster (red) represents postcodes with low values for + both average net tax and total deductions. + +- The second cluster (blue) is for medium average net tax and low + average total deductions. + +- The third cluster (orange) is grouping all postcodes with average + net tax values above 35,000. + + Note + + It is worth noticing that the data points are more spread in the + third cluster; this may indicate that there are some outliers in + this group. + +This example showed us how important it is to define the right number of +clusters before training a k-means algorithm if we want to get +meaningful groups from data. We used a method called the Elbow method to +find this optimal number. + + + +Exercise 5.03: Finding the Optimal Number of Clusters +----------------------------------------------------- + +In this exercise, we will apply the Elbow method to the same data as in +*Exercise 5.02*, *Clustering Australian Postcodes by Business Income and +Expenses*, to find the optimal number of clusters, before fitting a +k-means model: + +1. Open a new Colab notebook for this exercise. + +2. Now `import` the required packages (`pandas`, + `sklearn`, and `altair`): + + ``` + import pandas as pd + from sklearn.cluster import KMeans + import altair as alt + ``` + + + Next, we will load the dataset and select the same columns as in + *Exercise 5.02*, *Clustering Australian Postcodes by Business Income + and Expenses*, and print the first five rows. + +3. Assign the link to the ATO dataset to a variable called + `file_url`: + ``` + file_url = 'https://raw.githubusercontent.com'\ + '/fenago/data-science'\ + '/master/Lab05/DataSet/taxstats2015.csv' + ``` + + +4. Using the `.read_csv()` method from the pandas package, + load the dataset with only the following columns using the + `use_cols` parameter: `'Postcode'`, + `'Average total business income'`, and + `'Average total business expenses'`: + ``` + df = pd.read_csv(file_url, \ + usecols=['Postcode', \ + 'Average total business income', \ + 'Average total business expenses']) + ``` + + +5. Display the first five rows of the DataFrame with the + `.head()` method from the pandas package: + + ``` + df.head() + ``` + + + You should get the following output: + + +![](./images/B15019_05_21.jpg) + + + Caption: The first five rows of the ATO DataFrame + +6. Assign the `'Average total business income'` and + `'Average total business expenses'` columns to a new + variable called `X`: + ``` + X = df[['Average total business income', \ + 'Average total business expenses']] + ``` + + +7. Create an empty pandas DataFrame called `clusters` and an + empty list called `inertia`: + + ``` + clusters = pd.DataFrame() + inertia = [] + ``` + + + Now, use the `range` function to generate a list + containing the range of cluster numbers, from `1` to + `15`, and assign it to a new column called + `'cluster_range'` from the `'clusters'` + DataFrame: + + ``` + clusters['cluster_range'] = range(1, 15) + ``` + + +8. Create a `for` loop to go through each cluster number and + fit a k-means model accordingly, then append the `inertia` + values using the `'inertia_'` parameter with the + `'inertia'` list: + ``` + for k in clusters['cluster_range']: + kmeans = KMeans(n_clusters=k).fit(X) + inertia.append(kmeans.inertia_) + ``` + + +9. Assign the `inertia` list to a new column called + `'inertia'` from the `clusters` DataFrame and + display its content: + + ``` + clusters['inertia'] = inertia + clusters + ``` + + + You should get the following output: + + +![](./images/B15019_05_22.jpg) + + + Caption: Plotting the Elbow method + +10. Now use `mark_line()` and `encode()` from the + `altair` package to plot the Elbow graph with + `'cluster_range'` as the x-axis and `'inertia'` + as the y-axis: + + ``` + alt.Chart(clusters).mark_line()\ + .encode(alt.X('cluster_range'), alt.Y('inertia')) + ``` + + + You should get the following output: + + +![](./images/B15019_05_23.jpg) + + + Caption: Plotting the Elbow method + +11. Looking at the Elbow plot, identify the optimal number of clusters, + and assign this value to a variable called + `optim_cluster`: + ``` + optim_cluster = 4 + ``` + + +12. Train a k-means model with this number of clusters and a + `random_state` value of `42` using the + `fit` method from `sklearn`: + ``` + kmeans = KMeans(random_state=42, n_clusters=optim_cluster) + kmeans.fit(X) + ``` + + +13. Now, using the `predict` method from `sklearn`, + get the predicted assigned cluster for each data point contained in + the `X` variable and save the results into a new column + called `'cluster2'` from the `df` DataFrame: + ``` + df['cluster2'] = kmeans.predict(X) + ``` + + +14. Display the first five rows of the `df` DataFrame using + the `head` method from the `pandas` package: + + ``` + df.head() + ``` + + + You should get the following output: + + +![](./images/B15019_05_24.jpg) + + + Caption: The first five rows with the cluster predictions + +15. Now plot the scatter plot using the `mark_circle()` and + `encode()` methods from the `altair` package. + Also, to add interactiveness, use the `tooltip` parameter + and the `interactive()` method from the `altair` + package as shown in the following code snippet: + + ``` + alt.Chart(df).mark_circle()\ + .encode\ + (x='Average total business income', \ + y='Average total business expenses', \ + color='cluster2:N', \ + tooltip=['Postcode', 'cluster2', \ + 'Average total business income',\ + 'Average total business expenses'])\ + .interactive() + ``` + + + You should get the following output: + + +![](./images/B15019_05_25.jpg) + + + + +Initializing Clusters +===================== + + +Since the beginning of this lab, we\'ve been referring to k-means +every time we\'ve fitted our clustering algorithms. But you may have +noticed in each model summary that there was a hyperparameter called +`init` with the default value as k-means++. We were, in fact, +using k-means++ all this time. + +The difference between k-means and k-means++ is in how they initialize +clusters at the start of the training. k-means randomly chooses the +center of each cluster (called the **centroid**) and then assigns each +data point to its nearest cluster. If this cluster initialization is +chosen incorrectly, this may lead to non-optimal grouping at the end of +the training process. For example, in the following graph, we can +clearly see the three natural groupings of the data, but the algorithm +didn\'t succeed in identifying them properly: + +![](./images/B15019_05_26.jpg) + +Caption: Example of non-optimal clusters being found + +k-means++ is an attempt to find better clusters at initialization time. +The idea behind it is to choose the first cluster randomly and then pick +the next ones, those further away, using a probability distribution from +the remaining data points. Even though k-means++ tends to get better +results compared to the original k-means, in some cases, it can still +lead to non-optimal clustering. + +Another hyperparameter data scientists can use to lower the risk of +incorrect clusters is `n_init`. This corresponds to the number +of times k-means is run with different initializations, the final model +being the best run. So, if you have a high number for this +hyperparameter, you will have a higher chance of finding the optimal +clusters, but the downside is that the training time will be longer. So, +you have to choose this value carefully, especially if you have a large +dataset. + +Let\'s try this out on our ATO dataset by having a look at the following +example. + +Note + +Open the notebook you were using for *Exercise 5.01*, *Performing Your +First Clustering Analysis on the ATO Dataset,* and earlier examples. +Execute the code you already entered, and then continue at the end of +the notebook with the following code. + +First, let\'s run only one iteration using random initialization: + +``` +kmeans = KMeans(random_state=14, n_clusters=3, \ + init='random', n_init=1) +kmeans.fit(X) +``` +As usual, we want to visualize our clusters with a scatter plot, as +defined in the following code snippet: + +``` +df['cluster3'] = kmeans.predict(X) +alt.Chart(df).mark_circle()\ + .encode(x='Average net tax', \ + y='Average total deductions', \ + color='cluster3:N', \ + tooltip=['Postcode', 'cluster', \ + 'Average net tax', \ + 'Average total deductions']) \ + .interactive() +``` +You should get the following output: + +![](./images/B15019_05_27.jpg) + +Caption: Clustering results with n\_init as 1 and init as random + +Overall, the result is very close to that of our previous run. It is +worth noticing that the boundaries between the clusters are slightly +different. + +Now let\'s try with five iterations (using the `n_init` +hyperparameter) and k-means++ initialization (using the `init` +hyperparameter): + +``` +kmeans = KMeans(random_state=14, n_clusters=3, \ + init='k-means++', n_init=5) +kmeans.fit(X) +df['cluster4'] = kmeans.predict(X) +alt.Chart(df).mark_circle()\ + .encode(x='Average net tax', \ + y='Average total deductions', \ + color='cluster4:N', \ + tooltip=['Postcode', 'cluster', \ + 'Average net tax', \ + 'Average total deductions'])\ + .interactive() +``` +You should get the following output: + +![Caption: Clustering results with n\_init as 5 and init as +k-means++ ](./images/B15019_05_28.jpg) + +Caption: Clustering results with n\_init as 5 and init as k-means++ + +Here, the results are very close to the original run with 10 iterations. +This means that we didn\'t have to run so many iterations for k-means to +converge and could have saved some time with a lower number. + + + +Exercise 5.04: Using Different Initialization Parameters to Achieve a Suitable Outcome +-------------------------------------------------------------------------------------- + +In this exercise, we will use the same data as in *Exercise 5.02*, +*Clustering Australian Postcodes by Business Income and Expenses*, and +try different values for the `init` and `n_init` +hyperparameters and see how they affect the final clustering result: + +1. Open a new Colab notebook. + +2. Import the required packages, which are `pandas`, + `sklearn`, and `altair`: + ``` + import pandas as pd + from sklearn.cluster import KMeans + import altair as alt + ``` + + +3. Assign the link to the ATO dataset to a variable called + `file_url`: + ``` + file_url = 'https://raw.githubusercontent.com'\ + '/fenago/data-science'\ + '/master/Lab05/DataSet/taxstats2015.csv' + ``` + + +4. Load the dataset and select the same columns as in *Exercise 5.02*, + *Clustering Australian Postcodes by Business Income and Expenses*, + and *Exercise 5.03*, *Finding the Optimal Number of Clusters*, using + the `read_csv()` method from the `pandas` + package: + ``` + df = pd.read_csv(file_url, \ + usecols=['Postcode', \ + 'Average total business income', \ + 'Average total business expenses']) + ``` + + +5. Assign the `'Average total business income'` and + `'Average total business expenses'` columns to a new + variable called `X`: + ``` + X = df[['Average total business income', \ + 'Average total business expenses']] + ``` + + +6. Fit a k-means model with `n_init` equal to `1` + and a random `init`: + ``` + kmeans = KMeans(random_state=1, n_clusters=4, \ + init='random', n_init=1) + kmeans.fit(X) + ``` + + +7. Using the `predict` method from the `sklearn` + package, predict the clustering assignment from the input variable, + `(X)`, and save the results into a new column called + `'cluster3'` in the DataFrame: + ``` + df['cluster3'] = kmeans.predict(X) + ``` + + +8. Plot the clusters using an interactive scatter plot. First, use + `Chart()` and `mark_circle()` from the + `altair` package to instantiate a scatter plot graph, as + shown in the following code snippet: + ``` + scatter_plot = alt.Chart(df).mark_circle() + ``` + + +9. Use the `encode` and `interactive` methods from + `altair` to specify the display of the scatter plot and + its interactivity options with the following parameters: + + Provide the name of the `'Average total business income'` + column to the `x` parameter (x-axis). + + Provide the name of the + `'Average total business expenses'` column to the + `y` parameter (y-axis). + + Provide the name of the `'cluster3:N'` column to the + `color` parameter (which defines the different colors for + each group). + + Provide these column names -- `'Postcode'`, + `'cluster3'`, `'Average total business income'`, + and `'Average total business expenses'` -- to the + `tooltip` parameter: + + ``` + scatter_plot.encode(x='Average total business income', \ + y='Average total business expenses', \ + color='cluster3:N', \ + tooltip=['Postcode', 'cluster3', \ + 'Average total business income', \ + 'Average total business expenses'])\ + .interactive() + ``` + + + You should get the following output: + + +![Caption: Clustering results with n\_init as 1 and init as + random ](./images/B15019_05_29.jpg) + + + Caption: Clustering results with n\_init as 1 and init as random + +10. Repeat *Steps 5* to *8* but with different k-means hyperparameters, + `n_init=10` and random `init`, as shown in the + following code snippet: + + ``` + kmeans = KMeans(random_state=1, n_clusters=4, \ + init='random', n_init=10) + kmeans.fit(X) + df['cluster4'] = kmeans.predict(X) + scatter_plot = alt.Chart(df).mark_circle() + scatter_plot.encode(x='Average total business income', \ + y='Average total business expenses', \ + color='cluster4:N', + tooltip=['Postcode', 'cluster4', \ + 'Average total business income', \ + 'Average total business expenses'])\ + .interactive() + ``` + + + You should get the following output: + + +![Caption: Clustering results with n\_init as 10 and init as + random ](./images/B15019_05_30.jpg) + + + Caption: Clustering results with n\_init as 10 and init as + random + +11. Again, repeat *Steps 5* to *8* but with different k-means + hyperparameters -- `n_init=100` and random + `init`: + + ``` + kmeans = KMeans(random_state=1, n_clusters=4, \ + init='random', n_init=100) + kmeans.fit(X) + df['cluster5'] = kmeans.predict(X) + scatter_plot = alt.Chart(df).mark_circle() + scatter_plot.encode(x='Average total business income', \ + y='Average total business expenses', \ + color='cluster5:N', \ + tooltip=['Postcode', 'cluster5', \ + 'Average total business income', \ + 'Average total business expenses'])\ + .interactive() + ``` + + + You should get the following output: + +![](./images/B15019_05_31.jpg) + +Caption: Clustering results with n\_init as 10 and init as random + + + +Calculating the Distance to the Centroid +======================================== + + +We\'ve talked a lot about similarities between data points in the +previous sections, but we haven\'t really defined what this means. You +have probably guessed that it has something to do with how close or how +far observations are from each other. You are heading in the right +direction. It has to do with some sort of distance measure between two +points. The one used by k-means is called **squared Euclidean distance** +and its formula is: + +![](./images/B15019_05_32.jpg) + +Caption: The squared Euclidean distance formula + +If you don\'t have a statistical background, this formula may look +intimidating, but it is actually very simple. It is the sum of the +squared difference between the data coordinates. Here, *x* and *y* are +two data points and the index, *i*, represents the number of +coordinates. If the data has two dimensions, *i* equals 2. Similarly, if +there are three dimensions, then *i* will be 3. + +Let\'s apply this formula to the ATO dataset. + +First, we will grab the values needed -- that is, the coordinates from +the first two observations -- and print them: + +Note + +Open the notebook you were using for *Exercise 5.01*, *Performing Your +First Clustering Analysis on the ATO Dataset*, and earlier examples. +Execute the code you already entered, and then continue at the end of +the notebook with the following code. + +``` +x = X.iloc[0,].values +y = X.iloc[1,].values +print(x) +print(y) +``` +You should get the following output: + +![Caption: Extracting the first two observations from the ATO +dataset ](./images/B15019_05_33.jpg) + +Caption: Extracting the first two observations from the ATO dataset + +Note + +In pandas, the `iloc` method is used to subset the rows or +columns of a DataFrame by index. For instance, if we wanted to grab row +number 888 and column number 6, we would use the following syntax: +`dataframe.iloc[888, 6]`. + +The coordinates for `x` are `(27555, 2071)` and the +coordinates for `y` are `(28142, 3804)`. Here, the +formula is telling us to calculate the squared difference between each +axis of the two data points and sum them: + +``` +squared_euclidean = (x[0] - y[0])**2 + (x[1] - y[1])**2 +print(squared_euclidean) +``` +You should get the following output: + +``` +3347858 +``` +k-means uses this metric to calculate the distance between each data +point and the center of its assigned cluster (also called the centroid). +Here is the basic logic behind this algorithm: + +1. Choose the centers of the clusters (the centroids) randomly. +2. Assign each data point to the nearest centroid using the squared + Euclidean distance. +3. Update each centroid\'s coordinates to the newly calculated center + of the data points assigned to it. +4. Repeat *Steps 2* and *3* until the clusters converge (that is, until + the cluster assignment doesn\'t change anymore) or until the maximum + number of iterations has been reached. + +That\'s it. The k-means algorithm is as simple as that. We can extract +the centroids after fitting a k-means model with +`cluster_centers_`. + +Let\'s see how we can plot the centroids in an example. + +First, we fit a k-means model as shown in the following code snippet: + +``` +kmeans = KMeans(random_state=42, n_clusters=3, \ + init='k-means++', n_init=5) +kmeans.fit(X) +df['cluster6'] = kmeans.predict(X) +``` +Now extract the `centroids` into a DataFrame and print them: + +``` +centroids = kmeans.cluster_centers_ +centroids = pd.DataFrame(centroids, \ + columns=['Average net tax', \ + 'Average total deductions']) +print(centroids) +``` +You should get the following output: + +![](./images/B15019_05_34.jpg) + +Caption: Coordinates of the three centroids + +We will plot the usual scatter plot but will assign it to a variable +called `chart1`: + +``` +chart1 = alt.Chart(df).mark_circle()\ + .encode(x='Average net tax', \ + y='Average total deductions', \ + color='cluster6:N', \ + tooltip=['Postcode', 'cluster6', \ + 'Average net tax', \ + 'Average total deductions'])\ + .interactive() +chart1 +``` +You should get the following output: + +![](./images/B15019_05_35.jpg) + +Caption: Scatter plot of the clusters + +Now, to create a second scatter plot only for the centroids called +`chart2`: + +``` +chart2 = alt.Chart(centroids).mark_circle(size=100)\ + .encode(x='Average net tax', \ + y='Average total deductions', \ + color=alt.value('black'), \ + tooltip=['Average net tax', \ + 'Average total deductions'])\ + .interactive() +chart2 +``` +You should get the following output: + +![](./images/B15019_05_36.jpg) + +Caption: Scatter plot of the centroids + +And now we combine the two charts, which is extremely easy with +`altair`: + +``` +chart1 + chart2 +``` +You should get the following output: + +![](./images/B15019_05_37.jpg) + +Caption: Scatter plot of the clusters and their centroids + +Now we can easily see which centroids the observations are closest to. + + + +Exercise 5.05: Finding the Closest Centroids in Our Dataset +----------------------------------------------------------- + +In this exercise, we will be coding the first iteration of k-means in +order to assign data points to their closest cluster centroids. The +following steps will help you complete the exercise: + +1. Open a new Colab notebook. + +2. Now `import` the required packages, which are + `pandas`, `sklearn`, and `altair`: + ``` + import pandas as pd + from sklearn.cluster import KMeans + import altair as alt + ``` + + +3. Load the dataset and select the same columns as in *Exercise 5.02*, + *Clustering Australian Postcodes by Business Income and Expenses*, + using the `read_csv()` method from the `pandas` + package: + ``` + file_url = 'https://raw.githubusercontent.com/'\ + 'fenago/data-science/'\ + 'master/Lab05/DataSet/taxstats2015.csv' + df = pd.read_csv(file_url, \ + usecols=['Postcode', \ + 'Average total business income', \ + 'Average total business expenses']) + ``` + + +4. Assign the `'Average total business income'` and + `'Average total business expenses'` columns to a new + variable called `X`: + ``` + X = df[['Average total business income', \ + 'Average total business expenses']] + ``` + + +5. Now, calculate the minimum and maximum using the `min()` + and `max()` values of the + `'Average total business income'` and + `'Average total business income'` variables, as shown in + the following code snippet: + ``` + business_income_min = df['Average total business income'].min() + business_income_max = df['Average total business income'].max() + business_expenses_min = df['Average total business expenses']\ + .min() + business_expenses_max = df['Average total business expenses']\ + .max() + ``` + + +6. Print the values of these four variables, which are the minimum and + maximum values of the two variables: + + ``` + print(business_income_min) + print(business_income_max) + print(business_expenses_min) + print(business_expenses_max) + ``` + + + You should get the following output: + + ``` + 0 + 876324 + 0 + 884659 + ``` + + +7. Now import the `random` package and use the + `seed()` method to set a seed of `42`, as shown + in the following code snippet: + ``` + import random + random.seed(42) + ``` + + +8. Create an empty pandas DataFrame and assign it to a variable called + `centroids`: + ``` + centroids = pd.DataFrame() + ``` + + +9. Generate four random values using the `sample()` method + from the `random` package with possible values between the + minimum and maximum values of the + `'Average total business expenses'` column using + `range()` and store the results in a new column called + `'Average total business income'` from the + `centroids` DataFrame: + ``` + centroids\ + ['Average total business income'] = random.sample\ + (range\ + (business_income_min, \ + business_income_max), 4) + ``` + + +10. Repeat the same process to generate `4` random values for + `'Average total business expenses'`: + ``` + centroids\ + ['Average total business expenses'] = random.sample\ + (range\ + (business_expenses_min,\ + business_expenses_max), 4) + ``` + + +11. Create a new column called `'cluster'` from the + `centroids` DataFrame using the + `.index `attributes from the pandas package and print this + DataFrame: + + ``` + centroids['cluster'] = centroids.index + centroids + ``` + + + You should get the following output: + + +![](./images/B15019_05_38.jpg) + + + Caption: Coordinates of the four random centroids + +12. Create a scatter plot with the `altair` package to display + the data contained in the `df` DataFrame and save it in a + variable called `'chart1'`: + ``` + chart1 = alt.Chart(df.head()).mark_circle()\ + .encode(x='Average total business income', \ + y='Average total business expenses', \ + color=alt.value('orange'), \ + tooltip=['Postcode', \ + 'Average total business income', \ + 'Average total business expenses'])\ + .interactive() + ``` + + +13. Now create a second scatter plot using the `altair` + package to display the centroids and save it in a variable called + `'chart2'`: + ``` + chart2 = alt.Chart(centroids).mark_circle(size=100)\ + .encode(x='Average total business income', \ + y='Average total business expenses', \ + color=alt.value('black'), \ + tooltip=['cluster', \ + 'Average total business income',\ + 'Average total business expenses'])\ + .interactive() + ``` + + +14. Display the two charts together using the altair syntax: + ` + `: + + ``` + chart1 + chart2 + ``` + + + You should get the following output: + + +![Caption: Scatter plot of the random centroids and the first + five observations ](./images/B15019_05_39.jpg) + + + Caption: Scatter plot of the random centroids and the first five + observations + +15. Define a function that will calculate the + `squared_euclidean` distance and return its value. This + function will take the `x` and `y` coordinates + of a data point and a centroid: + ``` + def squared_euclidean(data_x, data_y, \ + centroid_x, centroid_y, ): + return (data_x - centroid_x)**2 + (data_y - centroid_y)**2 + ``` + + +16. Using the `.at` method from the pandas package, extract + the first row\'s `x` and `y` coordinates and + save them in two variables called `data_x` and + `data_y`: + ``` + data_x = df.at[0, 'Average total business income'] + data_y = df.at[0, 'Average total business expenses'] + ``` + + +17. Using a `for` loop or list comprehension, calculate the + `squared_euclidean` distance of the first observation + (using its `data_x` and `data_y` coordinates) + against the `4` different centroids contained in + `centroids`, save the result in a variable called + `distance`, and display it: + + ``` + distances = [squared_euclidean\ + (data_x, data_y, centroids.at\ + [i, 'Average total business income'], \ + centroids.at[i, \ + 'Average total business expenses']) \ + for i in range(4)] + distances + ``` + + + You should get the following output: + + ``` + [215601466600, 10063365460, 34245932020, 326873037866] + ``` + + +18. Use the `index` method from the list containing the + `squared_euclidean` distances to find the cluster with the + shortest distance, as shown in the following code snippet: + ``` + cluster_index = distances.index(min(distances)) + ``` + + +19. Save the `cluster` index in a column called + `'cluster'` from the `df` DataFrame for the + first observation using the `.at` method from the pandas + package: + ``` + df.at[0, 'cluster'] = cluster_index + ``` + + +20. Display the first five rows of `df` using the + `head()` method from the `pandas` package: + + ``` + df.head() + ``` + + + You should get the following output: + + +![Caption: The first five rows of the ATO DataFrame with the + assigned cluster number for the first row](./images/B15019_05_40.jpg) + + + Caption: The first five rows of the ATO DataFrame with the + assigned cluster number for the first row + +21. Repeat *Steps 15* to *19* for the next `4` rows to + calculate their distances from the centroids and find the cluster + with the smallest distance value: + + ``` + distances = [squared_euclidean\ + (df.at[1, 'Average total business income'], \ + df.at[1, 'Average total business expenses'], \ + centroids.at[i, 'Average total business income'],\ + centroids.at[i, \ + 'Average total business expenses'])\ + for i in range(4)] + df.at[1, 'cluster'] = distances.index(min(distances)) + distances = [squared_euclidean\ + (df.at[2, 'Average total business income'], \ + df.at[2, 'Average total business expenses'], \ + centroids.at[i, 'Average total business income'],\ + centroids.at[i, \ + 'Average total business expenses'])\ + for i in range(4)] + df.at[2, 'cluster'] = distances.index(min(distances)) + distances = [squared_euclidean\ + (df.at[3, 'Average total business income'], \ + df.at[3, 'Average total business expenses'], \ + centroids.at[i, 'Average total business income'],\ + centroids.at[i, \ + 'Average total business expenses'])\ + for i in range(4)] + df.at[3, 'cluster'] = distances.index(min(distances)) + distances = [squared_euclidean\ + (df.at[4, 'Average total business income'], \ + df.at[4, 'Average total business expenses'], \ + centroids.at[i, \ + 'Average total business income'], \ + centroids.at[i, \ + 'Average total business expenses']) \ + for i in range(4)] + df.at[4, 'cluster'] = distances.index(min(distances)) + df.head() + ``` + + + You should get the following output: + + +![](./images/B15019_05_41.jpg) + + + Caption: The first five rows of the ATO DataFrame and their + assigned clusters + +22. Finally, plot the centroids and the first `5` rows of the + dataset using the `altair` package as in *Steps 12* to + *13*: + + ``` + chart1 = alt.Chart(df.head()).mark_circle()\ + .encode(x='Average total business income', \ + y='Average total business expenses', \ + color='cluster:N', \ + tooltip=['Postcode', 'cluster', \ + 'Average total business income', \ + 'Average total business expenses'])\ + .interactive() + chart2 = alt.Chart(centroids).mark_circle(size=100)\ + .encode(x='Average total business income', \ + y='Average total business expenses', \ + color=alt.value('black'), \ + tooltip=['cluster', \ + 'Average total business income',\ + 'Average total business expenses'])\ + .interactive() + chart1 + chart2 + ``` + + + You should get the following output: + +![Caption: Scatter plot of the random centroids and the first five](./images/B15019_05_42.jpg) + +Caption: Scatter plot of the random centroids and the first fiveobservations + + +Standardizing Data +================== + + +You\'ve already learned a lot about the k-means algorithm, and we are +close to the end of this lab. In this final section, we will not +talk about another hyperparameter (you\'ve already been through the main +ones) but a very important topic: **data processing**. + +Fitting a k-means algorithm is extremely easy. The trickiest part is +making sure the resulting clusters are meaningful for your project, and +we have seen how we can tune some hyperparameters to ensure this. But +handling input data is as important as all the steps you have learned +about so far. If your dataset is not well prepared, even if you find the +best hyperparameters, you will still get some bad results. + +Let\'s have another look at our ATO dataset. In the previous section, +*Calculating the Distance to the Centroid*, we found three different +clusters, and they were mainly defined by the +`'Average net tax'` variable. It was as if k-means didn\'t +take into account the second variable, +`'Average total deductions'`, at all. This is in fact due to +these two variables having very different ranges of values and the way +that squared Euclidean distance is calculated. + +Squared Euclidean distance is weighted more toward high-value variables. +Let\'s take an example to illustrate this point with two data points +called A and B with respective x and y coordinates of (1, 50000) and +(100, 100000). The squared Euclidean distance between A and B will be +(100000 - 50000)\^2 + (100 - 1)\^2. We can clearly see that the result +will be mainly driven by the difference between 100,000 and 50,000: +50,000\^2. The difference of 100 minus 1 (99\^2) will account for very +little in the final result. + +But if you look at the ratio between 100,000 and 50,000, it is a factor +of 2 (100,000 / 50,000 = 2), while the ratio between 100 and 1 is a +factor of 100 (100 / 1 = 100). Does it make sense for the higher-value +variable to \"dominate\" the clustering result? It really depends on +your project, and this situation may be intended. But if you want things +to be fair between the different axes, it\'s preferable to bring them +all into a similar range of values before fitting a k-means model. This +is the reason why you should always consider standardizing your data +before running your k-means algorithm. + +There are multiple ways to standardize data, and we will have a look at +the two most popular ones: **min-max scaling** and **z-score**. Luckily +for us, the `sklearn` package has an implementation for both +methods. + +The formula for min-max scaling is very simple: on each axis, you need +to remove the minimum value for each data point and divide the result by +the difference between the maximum and minimum values. The scaled data +will have values ranging between 0 and 1: + +![](./images/B15019_05_43.jpg) + +Caption: Min-max scaling formula + +Let\'s look at min-max scaling with `sklearn` in the following +example. + +Note + +Open the notebook you were using for *Exercise 5.01*, *Performing Your +First Clustering Analysis on the ATO Dataset*, and earlier examples. +Execute the code you already entered, and then continue at the end of +the notebook with the following code. + +First, we import the relevant class and instantiate an object: + +``` +from sklearn.preprocessing import MinMaxScaler +min_max_scaler = MinMaxScaler() +``` + +Then, we fit it to our dataset: + +``` +min_max_scaler.fit(X) +``` +You should get the following output: + +![](./images/B15019_05_44.jpg) + +Caption: Min-max scaling summary + +And finally, call the `transform()` method to standardize the +data: + +``` +X_min_max = min_max_scaler.transform(X) +X_min_max +``` +You should get the following output: + +![](./images/B15019_05_45.jpg) + +Caption: Min-max-scaled data + +Now we print the minimum and maximum values of the min-max-scaled data +for both axes: + +``` +X_min_max[:,0].min(), X_min_max[:,0].max(), \ +X_min_max[:,1].min(), X_min_max[:,1].max() +``` +You should get the following output: + +![](./images/B15019_05_46.jpg) + +Caption: Minimum and maximum values of the min-max-scaled data + +We can see that both axes now have their values sitting between 0 and 1. + +The **z-score** is calculated by removing the overall average from the +data point and dividing the result by the standard deviation for each +axis. The distribution of the standardized data will have a mean of 0 +and a standard deviation of 1: + +![](./images/B15019_05_47.jpg) + +Caption: Z-score formula + +To apply it with `sklearn`, first, we have to import the +relevant `StandardScaler` class and instantiate an object: + +``` +from sklearn.preprocessing import StandardScaler +standard_scaler = StandardScaler() +``` +This time, instead of calling `fit()` and then +`transform()`, we use the `fit_transform()` method: + +``` +X_scaled = standard_scaler.fit_transform(X) +X_scaled +``` +You should get the following output: + +![](./images/B15019_05_48.jpg) + +Caption: Z-score-standardized data + +Now we\'ll look at the minimum and maximum values for each axis: + +``` +X_scaled[:,0].min(), X_scaled[:,0].max(), \ +X_scaled[:,1].min(), X_scaled[:,1].max() +``` +You should get the following output: + +![Caption: Minimum and maximum values of the z-score-standardized +data ](./images/B15019_05_49.jpg) + +Caption: Minimum and maximum values of the z-score-standardized data + +The value ranges for both axes are much lower now and we can see that +their maximum values are around 9 and 18, which indicates that there are +some extreme outliers in the data. + +Now, to fit a k-means model and plot a scatter plot on the +z-score-standardized data with the following code snippet: + +``` +kmeans = KMeans(random_state=42, n_clusters=3, \ + init='k-means++', n_init=5) +kmeans.fit(X_scaled) +df['cluster7'] = kmeans.predict(X_scaled) +alt.Chart(df).mark_circle()\ + .encode(x='Average net tax', \ + y='Average total deductions', \ + color='cluster7:N', \ + tooltip=['Postcode', 'cluster7', \ + 'Average net tax', \ + 'Average total deductions'])\ + .interactive() +``` +You should get the following output: + +![](./images/B15019_05_50.jpg) + +Caption: Scatter plot of the standardized data + +k-means results are very different from the standardized data. Now we +can see that there are two main clusters (blue and red) and their +boundaries are not straight vertical lines anymore but diagonal. So, +k-means is actually taking into consideration both axes now. The orange +cluster contains much fewer data points compared to previous iterations, +and it seems it is grouping all the extreme outliers with high values +together. If your project was about detecting anomalies, you would have +found a way here to easily separate outliers from \"normal\" +observations. + + + +Exercise 5.06: Standardizing the Data from Our Dataset +------------------------------------------------------ + +In this final exercise, we will standardize the data using min-max +scaling and the z-score and fit a k-means model for each method and see +their impact on k-means: + +1. Open a new Colab notebook. + +2. Now import the required `pandas`, `sklearn`, and + `altair` packages: + ``` + import pandas as pd + from sklearn.cluster import KMeans + import altair as alt + ``` + + +3. Load the dataset and select the same columns as in *Exercise 5.02*, + *Clustering Australian Postcodes by Business Income and Expenses*, + using the `read_csv()` method from the `pandas` + package: + ``` + file_url = 'https://raw.githubusercontent.com'\ + '/fenago/data-science'\ + '/master/Lab05/DataSet/taxstats2015.csv' + df = pd.read_csv(file_url, \ + usecols=['Postcode', \ + 'Average total business income', \ + 'Average total business expenses']) + ``` + + +4. Assign the `'Average total business income'` and + `'Average total business expenses'` columns to a new + variable called `X`: + ``` + X = df[['Average total business income', \ + 'Average total business expenses']] + ``` + + +5. Import the `MinMaxScaler` and `StandardScaler` + classes from `sklearn`: + ``` + from sklearn.preprocessing import MinMaxScaler + from sklearn.preprocessing import StandardScaler + ``` + + +6. Instantiate and fit `MinMaxScaler` with the data: + + ``` + min_max_scaler = MinMaxScaler() + min_max_scaler.fit(X) + ``` + + + You should get the following output: + + +![](./images/B15019_05_51.jpg) + + + Caption: Summary of the min-max scaler + +7. Perform the min-max scaling transformation and save the data into a + new variable called `X_min_max`: + + ``` + X_min_max = min_max_scaler.transform(X) + X_min_max + ``` + + + You should get the following output: + + +![](./images/B15019_05_52.jpg) + + + Caption: Min-max-scaled data + +8. Fit a k-means model on the scaled data with the following + hyperparameters: `random_state=1`, + `n_clusters=4, init='k-means++', n_init=5`, as shown in + the following code snippet: + ``` + kmeans = KMeans(random_state=1, n_clusters=4, \ + init='k-means++', n_init=5) + kmeans.fit(X_min_max) + ``` + + +9. Assign the k-means predictions of each value of `X` in a + new column called `'cluster8'` in the `df` + DataFrame: + ``` + df['cluster8'] = kmeans.predict(X_min_max) + ``` + + +10. Plot the k-means results into a scatter plot using the + `altair` package: + + ``` + scatter_plot = alt.Chart(df).mark_circle() + scatter_plot.encode(x='Average total business income', \ + y='Average total business expenses',\ + color='cluster8:N',\ + tooltip=['Postcode', 'cluster8', \ + 'Average total business income',\ + 'Average total business expenses'])\ + .interactive() + ``` + + + You should get the following output: + + +![Caption: Scatter plot of k-means results using the + min-max-scaled data ](./images/B15019_05_53.jpg) + + + Caption: Scatter plot of k-means results using the + min-max-scaled data + +11. Re-train the k-means model but on the z-score-standardized data with + the same hyperparameter values, + `random_state=1, n_clusters=4, init='k-means++', n_init=5`: + ``` + standard_scaler = StandardScaler() + X_scaled = standard_scaler.fit_transform(X) + kmeans = KMeans(random_state=1, n_clusters=4, \ + init='k-means++', n_init=5) + kmeans.fit(X_scaled) + ``` + + +12. Assign the k-means predictions of each value of `X_scaled` + in a new column called `'cluster9' `in the `df` + DataFrame: + ``` + df['cluster9'] = kmeans.predict(X_scaled) + ``` + + +13. Plot the k-means results in a scatter plot using the + `altair` package: + + ``` + scatter_plot = alt.Chart(df).mark_circle() + scatter_plot.encode(x='Average total business income', \ + y='Average total business expenses', \ + color='cluster9:N', \ + tooltip=['Postcode', 'cluster9', \ + 'Average total business income',\ + 'Average total business expenses'])\ + .interactive() + ``` + + + You should get the following output: + + +![Caption: Scatter plot of k-means results using the](./images/B15019_05_54.jpg) + + + + +Activity 5.01: Perform Customer Segmentation Analysis in a Bank Using k-means +----------------------------------------------------------------------------- + +You are working for an international bank. The credit department is +reviewing its offerings and wants to get a better understanding of its +current customers. You have been tasked with performing customer +segmentation analysis. You will perform cluster analysis with k-means to +identify groups of similar customers. + +The following steps will help you complete this activity: + +1. Download the dataset and load it into Python. + +2. Read the CSV file using the `read_csv()` method. + + Note + + This dataset is in the `.dat` file format. You can still + load the file using `read_csv()` but you will need to + specify the following parameter: + `header=None, sep= '\s\s+' and prefix='X'`. + +3. You will be using the fourth and tenth columns (`X3` and + `X9`). Extract these. + +4. Perform data standardization by instantiating a + `StandardScaler` object. + +5. Analyze and define the optimal number of clusters. + +6. Fit a k-means algorithm with the number of clusters you\'ve defined. + +7. Create a scatter plot of the clusters. + + Note + + This is the German Credit Dataset from the UCI Machine Learning + Repository.Even though all the columns in this + dataset are integers, most of them are actually categorical + variables. The data in these columns is not continuous. Only two + variables are really numeric. Those are the ones you will use for + your clustering. + +You should get something similar to the following output: + +![](./images/B15019_05_55.jpg) + +Caption: Scatter plot of the four clusters found + + +Summary +======= + + +You are now ready to perform cluster analysis with the k-means algorithm +on your own dataset. This type of analysis is very popular in the +industry for segmenting customer profiles as well as detecting +suspicious transactions or anomalies. + +We learned about a lot of different concepts, such as centroids and +squared Euclidean distance. We went through the main k-means +hyperparameters: `init` (initialization method), +`n_init` (number of initialization runs), +`n_clusters` (number of clusters), and +`random_state` (specified seed). We also discussed the +importance of choosing the optimal number of clusters, initializing +centroids properly, and standardizing data. You have learned how to use +the following Python packages: `pandas`, `altair`, +`sklearn`, and `KMeans`. + +In this lab, we only looked at k-means, but it is not the only +clustering algorithm. There are quite a lot of algorithms that use +different approaches, such as hierarchical clustering, principal +component analysis, and the Gaussian mixture model, to name a few. If +you are interested in this field, you now have all the basic knowledge +you need to explore these other algorithms on your own. + +Next, you will see how we can assess the performance of these models and +what tools can be used to make them even better. diff --git a/lab_guides/Lab_6.md b/lab_guides/Lab_6.md new file mode 100644 index 0000000..00e5436 --- /dev/null +++ b/lab_guides/Lab_6.md @@ -0,0 +1,2357 @@ + +6. How to Assess Performance +============================ + + + +Overview + +This lab will introduce you to model evaluation, where you evaluate +or assess the performance of each model that you train before you decide +to put it into production. By the end of this lab, you will be able +to create an evaluation dataset. You will be equipped to assess the +performance of linear regression models using **mean absolute error** +(**MAE**) and **mean squared error** (**MSE**). You will also be able to +evaluate the performance of logistic regression models using accuracy, +precision, recall, and F1 score. + + +Introduction +============ + + +When you assess the performance of a model, you look at certain +measurements or values that tell you how well the model is performing +under certain conditions, and that helps you make an informed decision +about whether or not to make use of the model that you have trained in +the real world. Some of the measurements you will encounter in this +lab are MAE, precision, recall, and R[2] score. + +You learned how to train a regression model in *Lab 2, Regression*, +and how to train classification models in *Lab 3, Binary +Classification*. Consider the task of predicting whether or not a +customer is likely to purchase a term deposit, which you addressed in +*Lab 3, Binary Classification*. You have learned how to train a +model to perform this sort of classification. You are now concerned with +how useful this model might be. You might start by training one model, +and then evaluating how often the predictions from that model are +correct. You might then proceed to train more models and evaluate +whether they perform better than previous models you have trained. + +You have already seen an example of splitting data using +`train_test_split` in *Exercise 3.06*, *A Logistic Regression +Model for Predicting the Propensity of Term Deposit Purchases in a +Bank*. You will go further into the necessity and application of +splitting data in *Lab 7, The Generalization of Machine Learning +Models*, but for now, you should note that it is important to split your +data into one set that is used for training a model, and a second set +that is used for validating the model. It is this validation step that +helps you decide whether or not to put a model into production. + + +Splitting Data +============== + + +You will learn more about splitting data in *Lab 7, The +Generalization of Machine Learning Models*, where we will cover the +following: + +- Simple data splits using `train_test_split` +- Multiple data splits using cross-validation + +For now, you will learn how to split data using a function from +`sklearn` called `train_test_split`. + +It is very important that you do not use all of your data to train a +model. You must set aside some data for validation, and this data must +not have been used previously for training. When you train a model, it +tries to generate an equation that fits your data. The longer you train, +the more complex the equation becomes so that it passes through as many +of the data points as possible. + +When you shuffle the data and set some aside for validation, it ensures +that the model learns to not overfit the hypotheses you are trying to +generate. + + + +Exercise 6.01: Importing and Splitting Data +------------------------------------------- + +In this exercise, you will import data from a repository and split it +into a training and an evaluation set to train a model. Splitting your +data is required so that you can evaluate the model later. This exercise +will get you familiar with the process of splitting data; this is +something you will be doing frequently. + +Note + +The Car dataset that you will be using in this lab was taken from the UCI Machine Learning Repository. + +This dataset is about cars. A text file is provided with the following +information: + +- `buying` -- the cost of purchasing this vehicle +- `maint` -- the maintenance cost of the vehicle +- `doors` -- the number of doors the vehicle has +- `persons` -- the number of persons the vehicle is capable + of transporting +- `lug_boot` -- the cargo capacity of the vehicle +- `safety` -- the safety rating of the vehicle +- `car` -- this is the category that the model attempts to + predict + +The following steps will help you complete the exercise: + +1. Open a new Colab notebook. + +2. Import the required libraries: + + ``` + import pandas as pd + from sklearn.model_selection import train_test_split + ``` + + + You started by importing a library called `pandas` in the + first line. This library is useful for reading files into a data + structure that is called a `DataFrame`, which you have + used in previous labs. This structure is like a spreadsheet or a + table with rows and columns that we can manipulate. Because you + might need to reference the library lots of times, we have created + an alias for it, `pd`. + + In the second line, you import a function called + `train_test_split` from a module called + `model_selection`, which is within `sklearn`. + This function is what you will make use of to split the data that + you read in using `pandas`. + +3. Create a Python list: + + ``` + # data doesn't have headers, so let's create headers + _headers = ['buying', 'maint', 'doors', 'persons', \ + 'lug_boot', 'safety', 'car'] + ``` + + + The data that you are reading in is stored as a CSV file. + + The browser will download the file to your computer. You can open + the file using a text editor. If you do, you will see something + similar to the following: + + +![](./images/B15019_06_01.jpg) + + + Caption: The car dataset without headers + + Note + + Alternatively, you can enter the dataset URL in the browser to view + the dataset. + + `CSV` files normally have the name of each column written + in the first row of the data. For instance, have a look at this + dataset\'s CSV file, which you used in *Lab 3, Binary + Classification*: + + +![](./images/B15019_06_02.jpg) + + + Caption: CSV file without headers + + But, in this case, the column name is missing. That is not a + problem, however. The code in this step creates a Python list called + `_headers` that contains the name of each column. You will + supply this list when you read in the data in the next step. + +4. Read the data: + + ``` + df = pd.read_csv('https://raw.githubusercontent.com/'\ + 'fenago/data-science/'\ + 'master/Lab06/Dataset/car.data', \ + names=_headers, index_col=None) + ``` + + + In this step, the code reads in the file using a function called + `read_csv`. The first parameter, + `'https://raw.githubusercontent.com/fenago/data-science/master/Lab06/Dataset/car.data'`, + is mandatory and is the location of the file. In our case, the file + is on the internet. It can also be optionally downloaded, and we can + then point to the local file\'s location. + + The second parameter (`names=_headers`) asks the function + to add the row headers to the data after reading it in. The third + parameter (`index_col=None`) asks the function to generate + a new index for the table because the data doesn\'t contain an + index. The function will produce a DataFrame, which we assign to a + variable called `df`. + +5. Print out the top five records: + + ``` + df.head() + ``` + + + The code in this step is used to print the top five rows of the + DataFrame. The output from that operation is shown in the following + screenshot: + + +![](./images/B15019_06_03.jpg) + + + Caption: The top five rows of the DataFrame + +6. Create a training and an evaluation DataFrame: + + ``` + training, evaluation = train_test_split(df, test_size=0.3, \ + random_state=0) + ``` + + + The preceding code will split the DataFrame containing your data + into two new DataFrames. The first is called `training` + and is used for training the model. The second is called + `evaluation` and will be further split into two in the + next step. We mentioned earlier that you must separate your dataset + into a training and an evaluation dataset, the former for training + your model and the latter for evaluating your model. + + At this point, the `train_test_split` function takes two + parameters. The first parameter is the data we want to split. The + second is the ratio we would like to split it by. What we have done + is specified that we want our evaluation data to be 30% of our data. + + Note + + The third parameter random\_state is set to 0 to ensure + reproducibility of results. + +7. Create a validation and test dataset: + + ``` + validation, test = train_test_split(evaluation, test_size=0.5, \ + random_state=0) + ``` + + + This code is similar to the code in *Step 6*. In this step, the code + splits our evaluation data into two equal parts because we specified + `0.5`, which means `50%`. + + +Assessing Model Performance for Regression Models +================================================= + + +When you create a regression model, you create a model that predicts a +continuous numerical variable, as you learned in *Lab 2, +Regression*. When you set aside your evaluation dataset, you have +something that you can use to compare the quality of your model. + +What you need to do to assess your model quality is compare the quality +of your prediction to what is called the ground truth, which is the +actual observed value that you are trying to predict. Take a look at +*Figure 6.4*, in which the first column contains the ground truth +(called actuals) and the second column contains the predicted values: + +![](./images/B15019_06_04.jpg) + +Caption: Actual versus predicted values + +Line `0` in the output compares the actual value in our +evaluation dataset to what our model predicted. The actual value from +our evaluation dataset is `4.891`. The value that the model +predicted is `4.132270`. + +Line `1` compares the actual value of `4.194` to +what the model predicted, which is `4.364320`. + +In practice, the evaluation dataset will contain a lot of records, so +you will not be making this comparison visually. Instead, you will make +use of some equations. + +You would carry out this comparison by computing the loss. The loss is +the difference between the actuals and the predicted values in the +preceding screenshot. In data mining, it is called a **distance +measure**. There are various approaches to computing distance measures +that give rise to different loss functions. Two of these are: + +- Manhattan distance +- Euclidean distance + +There are various loss functions for regression, but in this book, we +will be looking at two of the commonly used loss functions for +regression, which are: + +- Mean absolute error (MAE) -- this is based on Manhattan distance +- Mean squared error (MSE) -- this is based on Euclidean distance + +The goal of these functions is to measure the usefulness of your models +by giving you a numerical value that shows how much deviation there is +between the ground truths and the predicted values from your models. + +Your mission is to train new models with consistently lower errors. +Before we do that, let\'s have a quick introduction to some data +structures. + + + +Data Structures -- Vectors and Matrices +--------------------------------------- + +In this section, we will look at different data structures, as follows. + + + +### Scalars + +A scalar variable is a simple number, such as 23. Whenever you make use +of numbers on their own, they are scalars. You assign them to variables, +such as in the following expression: + +``` +temperature = 23 +``` +If you had to store the temperature for 5 days, you would need to store +the values in 5 different values, such as in the following code snippet: + +``` +temp_1 = 23 +temp_2 = 24 +temp_3 = 23 +temp_4 = 22 +temp_5 = 22 +``` + +In data science, you will frequently work with a large number of data +points, such as hourly temperature measurements for an entire year. A +more efficient way of storing lots of values is called a vector. Let\'s +look at vectors in the next topic. + + + +### Vectors + +A vector is a collection of scalars. Consider the five temperatures in +the previous code snippet. A vector is a data type that lets you collect +all of the previous temperatures in one variable that supports +arithmetic operations. Vectors look similar to Python lists and can be +created from Python lists. Consider the following code snippet for +creating a Python list: + +``` +temps_list = [23, 24, 23, 22, 22] +``` +You can create a vector from the list using the `.array()` +method from `numpy` by first importing `numpy` and +then using the following snippet: + +``` +import numpy as np +temps_ndarray = np.array(temps_list) +``` +You can proceed to verify the data type using the following code +snippet: + +``` +print(type(temps_ndarray)) +``` + +The code snippet will cause the compiler to print out the following: + +![](./images/B15019_06_05.jpg) + +Caption: The temps\_ndarray vector data type + +You may inspect the contents of the vector using the following code +snippet: + +``` +print(temps_ndarray) +``` +This generates the following output: + +![](./images/B15019_06_06.jpg) + +Caption: The temps\_ndarray vector + +Note that the output contains single square brackets, `[` and +`]`, and the numbers are separated by spaces. This is +different from the output from a Python list, which you can obtain using +the following code snippet: + +``` +print(temps_list) +``` + +The code snippet yields the following output: + +![](./images/B15019_06_07.jpg) + +Caption: List of elements in temps\_list + +Note that the output contains single square brackets, `[` and +`]`, and the numbers are separated by commas. + +Vectors have a shape and a dimension. Both of these can be determined by +using the following code snippet: + +``` +print(temps_ndarray.shape) +``` + +The output is a Python data structure called a **tuple** and looks like +this: + +![](./images/B15019_06_08.jpg) + +Caption: Shape of the temps\_ndarray vector + +Notice that the output consists of brackets, `(` and +`)`, with a number and a comma. The single number followed by +a comma implies that this object has only one dimension. The value of +the number is the number of elements. The output is read as \"a vector +with five elements.\" This is very important because it is very +different from a matrix, which we will discuss next. + + + +### Matrices + +A matrix is also made up of scalars but is different from a scalar in +the sense that a matrix has both rows and columns3 + +There are times when you need to convert between vectors and matrices. +Let\'s revisit `temps_ndarray`. You may recall that it has +five elements because the shape was `(5,)`. To convert it into +a matrix with five rows and one column, you would use the following +snippet: + +``` +temps_matrix = temps_ndarray.reshape(-1, 1) +``` + +The code snippet makes use of the `.reshape()` method. The +first parameter, `-1`, instructs the interpreter to keep the +first dimension constant. The second parameter, `1`, instructs +the interpreter to add a new dimension. This new dimension is the +column. To see the new shape, use the following snippet: + +``` +print(temps_matrix.shape) +``` +You will get the following output: + +![](./images/B15019_06_09.jpg) + +Caption: Shape of the matrix + +Notice that the tuple now has two numbers, `5` and +`1`. The first number, `5`, represents the rows, and +the second number, `1`, represents the columns. You can print +out the value of the matrix using the following snippet: + +``` +print(temps_matrix) +``` + +The output of the code is as follows: + +![](./images/B15019_06_10.jpg) + +Caption: Elements of the matrix + +Notice that the output is different from that of the vector. First, we +have an outer set of square brackets. Then, each row has its element +enclosed in square brackets. Each row contains only one number because +the matrix has only one column. + +You may reshape the matrix to contain `1` row and +`5` columns and print out the value using the following code +snippet: + +``` +print(temps_matrix.reshape(1,5)) +``` + +The output will be as follows: + +![](./images/B15019_06_11.jpg) + +Caption: Reshaping the matrix + +Notice that you now have all the numbers on one row because this matrix +has one row and five columns. The outer square brackets represent the +matrix, while the inner square brackets represent the row. + +Finally, you can convert the matrix back into a vector by dropping the +column using the following snippet: + +``` +vector = temps_matrix.reshape(-1) +``` +You can print out the value of the vector to confirm that you get the +following: + +![](./images/B15019_06_12.jpg) + +Caption: The value of the vector + +Notice that you now have only one set of square brackets. You still have +the same number of elements. + + + + +Exercise 6.02: Computing the R[2] Score of a Linear Regression Model +---------------------------------------------------------------------------------- + +As mentioned in the preceding sections, R[2] score is an +important factor in evaluating the performance of a model. Thus, in this +exercise, we will be creating a linear regression model and then +calculating the R[2] score for it. + + + +The following attributes are useful for our task: + +- CIC0: information indices +- SM1\_Dz(Z): 2D matrix-based descriptors +- GATS1i: 2D autocorrelations +- NdsCH: Pimephales promelas +- NdssC: atom-type counts +- MLOGP: molecular properties +- Quantitative response, LC50 \[-LOG(mol/L)\]: This attribute + represents the concentration that causes death in 50% of test fish + over a test duration of 96 hours. + +The following steps will help you to complete the exercise: + +1. Open a new Colab notebook to write and execute your code. + +2. Next, import the libraries mentioned in the following code snippet: + + ``` + # import libraries + import pandas as pd + from sklearn.model_selection import train_test_split + from sklearn.linear_model import LinearRegression + ``` + + + In this step, you import `pandas`, which you will use to + read your data. You also import `train_test_split()`, + which you will use to split your data into training and validation + sets, and you import `LinearRegression`, which you will + use to train your model. + +3. Now, read the data from the dataset: + + ``` + # column headers + _headers = ['CIC0', 'SM1', 'GATS1i', 'NdsCH', 'Ndssc', \ + 'MLOGP', 'response'] + # read in data + df = pd.read_csv('https://raw.githubusercontent.com/'\ + 'fenago/data-science/'\ + 'master/Lab06/Dataset/'\ + 'qsar_fish_toxicity.csv', \ + names=_headers, sep=';') + ``` + + + In this step, you create a Python list to hold the names of the + columns in your data. You do this because the CSV file containing + the data does not have a first row that contains the column headers. + You proceed to read in the file and store it in a variable called + `df` using the `read_csv()` method in pandas. + You specify the list containing column headers by passing it into + the `names` parameter. This CSV uses semi-colons as column + separators, so you specify that using the `sep` parameter. + You can use `df.head()` to see what the DataFrame looks + like: + + +![](./images/B15019_06_13.jpg) + + + Caption: The first five rows of the DataFrame + +4. Split the data into features and labels and into training and + evaluation datasets: + + ``` + # Let's split our data + features = df.drop('response', axis=1).values + labels = df[['response']].values + X_train, X_eval, y_train, y_eval = train_test_split\ + (features, labels, \ + test_size=0.2, \ + random_state=0) + X_val, X_test, y_val, y_test = train_test_split(X_eval, y_eval,\ + random_state=0) + ``` + + + In this step, you create two `numpy` arrays called + `features` and `labels`. You then proceed to + split them twice. The first split produces a `training` + set and an `evaluation` set. The second split creates a + `validation` set and a `test` set. + +5. Create a linear regression model: + + ``` + model = LinearRegression() + ``` + + + In this step, you create an instance of `LinearRegression` + and store it in a variable called `model`. You will make + use of this to train on the training dataset. + +6. Train the model: + + ``` + model.fit(X_train, y_train) + ``` + + + In this step, you train the model using the `fit()` method + and the training dataset that you made in *Step 4*. The first + parameter is the `features` NumPy array, and the second + parameter is `labels`. + + You should get an output similar to the following: + + +![](./images/B15019_06_14.jpg) + + + Caption: Training the model + +7. Make a prediction, as shown in the following code snippet: + + ``` + y_pred = model.predict(X_val) + ``` + + + In this step, you make use of the validation dataset to make a + prediction. This is stored in `y_pred`. + +8. Compute the R[2] score: + + ``` + r2 = model.score(X_val, y_val) + print('R^2 score: {}'.format(r2)) + ``` + + + In this step, you compute `r2`, which is the + R[2] score of the model. The R[2] score + is computed using the `score()` method of the model. The + next line causes the interpreter to print out the R[2] + score. + + The output is similar to the following: + + +![](./images/B15019_06_15.jpg) + + + Caption: R2 score + + Note + + The MAE and R[2] score may vary depending on the + distribution of the datasets. + +9. You see that the R[2] score we achieved is + `0.56238`, which is not close to 1. In the next step, we + will be making comparisons. + +10. Compare the predictions to the actual ground truth: + + ``` + _ys = pd.DataFrame(dict(actuals=y_val.reshape(-1), \ + predicted=y_pred.reshape(-1))) + _ys.head() + ``` + + + + The output looks similar to the following: + + +![](./images/B15019_06_16.jpg) + + + + + +Mean Absolute Error +------------------- + +The **mean absolute error** (**MAE**) is an evaluation metric for +regression models that measures the absolute distance between your +predictions and the ground truth. The absolute distance is the distance +regardless of the sign, whether positive or negative. For example, if +the ground truth is 6 and you predict 5, the distance is 1. However, if +you predict 7, the distance becomes -1. The absolute distance, without +taking the signs into consideration, is 1 in both cases. This is called +the **magnitude**. The MAE is computed by summing all of the magnitudes +and dividing by the number of observations. + + + +Exercise 6.03: Computing the MAE of a Model +------------------------------------------- + +The goal of this exercise is to find the score and loss of a model using +the same dataset as *Exercise 6.02*, *Computing the R2 Score of a Linear +Regression Model*. + +In this exercise, we will be calculating the MAE of a model. + +The following steps will help you with this exercise: + +1. Open a new Colab notebook file. + +2. Import the necessary libraries: + + ``` + # Import libraries + import pandas as pd + from sklearn.model_selection import train_test_split + from sklearn.linear_model import LinearRegression + from sklearn.metrics import mean_absolute_error + ``` + + + In this step, you import the function called + `mean_absolute_error` from `sklearn.metrics`. + +3. Import the data: + + ``` + # column headers + _headers = ['CIC0', 'SM1', 'GATS1i', 'NdsCH', 'Ndssc', \ + 'MLOGP', 'response'] + # read in data + df = pd.read_csv('https://raw.githubusercontent.com/'\ + 'fenago/data-science/'\ + 'master/Lab06/Dataset/'\ + 'qsar_fish_toxicity.csv', \ + names=_headers, sep=';') + ``` + + + In the preceding code, you read in your data. This data is hosted + online and contains some information about fish toxicity. The data + is stored as a CSV but does not contain any headers. Also, the + columns in this file are not separated by a comma, but rather by a + semi-colon. The Python list called `_headers` contains the + names of the column headers. + + In the next line, you make use of the function called + `read_csv`, which is contained in the `pandas` + library, to load the data. The first parameter specifies the file + location. The second parameter specifies the Python list that + contains the names of the columns in the data. The third parameter + specifies the character that is used to separate the columns in the + data. + +4. Split the data into `features` and `labels` and + into training and evaluation sets: + + ``` + # Let's split our data + features = df.drop('response', axis=1).values + labels = df[['response']].values + X_train, X_eval, y_train, y_eval = train_test_split\ + (features, labels, \ + test_size=0.2, \ + random_state=0) + X_val, X_test, y_val, y_test = train_test_split(X_eval, y_eval,\ + random_state=0) + ``` + + + In this step, you split your data into training, validation, and + test datasets. In the first line, you create a `numpy` + array in two steps. In the first step, the `drop` method + takes a parameter with the name of the column to drop from the + DataFrame. In the second step, you use `values` to convert + the DataFrame into a two-dimensional `numpy` array that is + a tabular structure with rows and columns. This array is stored in a + variable called `features`. + + In the second line, you convert the column into a `numpy` + array that contains the label that you would like to predict. You do + this by picking out the column from the DataFrame and then using + `values` to convert it into a `numpy` array. + + In the third line, you split the `features` and + `labels` using `train_test_split` and a ratio of + 80:20. The training data is contained in `X_train` for the + features and `y_train` for the labels. The evaluation + dataset is contained in `X_eval` and `y_eval`. + + In the fourth line, you split the evaluation dataset into validation + and testing using `train_test_split`. Because you don\'t + specify the `test_size`, a value of `25%` is + used. The validation data is stored in `X_val `and + `y_val`, while the test data is stored in + `X_test` and `y_test`. + +5. Create a simple linear regression model and train it: + + ``` + # create a simple Linear Regression model + model = LinearRegression() + # train the model + model.fit(X_train, y_train) + ``` + + + In this step, you make use of your training data to train a model. + In the first line, you create an instance of + `LinearRegression`, which you call `model`. In + the second line, you train the model using `X_train` and + `y_train`. `X_train` contains the + `features`, while `y_train` contains the + `labels`. + +6. Now predict the values of our validation dataset: + + ``` + # let's use our model to predict on our validation dataset + y_pred = model.predict(X_val) + ``` + + + At this point, your model is ready to use. You make use of the + `predict` method to predict on your data. In this case, + you are passing `X_val` as a parameter to the function. + Recall that `X_va`l is your validation dataset. The result + is assigned to a variable called `y_pred` and will be used + in the next step to compute the MAE of the model. + +7. Compute the MAE: + + ``` + # Let's compute our MEAN ABSOLUTE ERROR + mae = mean_absolute_error(y_val, y_pred) + print('MAE: {}'.format(mae)) + ``` + + + In this step, you compute the MAE of the model by using the + `mean_absolute_error` function and passing in + `y_val` and `y_pred`. `y_val` is the + label that was provided with your training data, and + `y_pred `is the prediction from the model. The preceding + code should give you an MAE value of \~ 0.72434: + + +![](./images/B15019_06_17.jpg) + + + Figure 6.17 MAE score + + +8. Compute the R[2] score of the model: + + ``` + # Let's get the R2 score + r2 = model.score(X_val, y_val) + print('R^2 score: {}'.format(r2)) + ``` + + + You should get an output similar to the following: + + +![](./images/B15019_06_18.jpg) + + +In this exercise, we have calculated the MAE, which is a significant +parameter when it comes to evaluating models. + +You will now train a second model and compare its R[2] +score and MAE to the first model to evaluate which is a better +performing model. + + + +Exercise 6.04: Computing the Mean Absolute Error of a Second Model +------------------------------------------------------------------ + +In this exercise, we will be engineering new features and finding the +score and loss of a new model. + +The following steps will help you with this exercise: + +1. Open a new Colab notebook file. + +2. Import the required libraries: + + ``` + # Import libraries + import pandas as pd + from sklearn.model_selection import train_test_split + from sklearn.linear_model import LinearRegression + from sklearn.metrics import mean_absolute_error + # pipeline + from sklearn.pipeline import Pipeline + # preprocessing + from sklearn.preprocessing import MinMaxScaler + from sklearn.preprocessing import StandardScaler + from sklearn.preprocessing import PolynomialFeatures + ``` + + + In the first step, you will import libraries such as + `train_test_split`, `LinearRegression`, and + `mean_absolute_error`. We make use of a pipeline to + quickly transform our features and engineer new features using + `MinMaxScaler` and `PolynomialFeatures`. + `MinMaxScaler` reduces the variance in your data by + adjusting all values to a range between 0 and 1. It does this by + subtracting the mean of the data and dividing by the range, which is + the minimum value subtracted from the maximum value. + `PolynomialFeatures` will engineer new features by raising + the values in a column up to a certain power and creating new + columns in your DataFrame to accommodate them. + +3. Read in the data from the dataset: + + ``` + # column headers + _headers = ['CIC0', 'SM1', 'GATS1i', 'NdsCH', 'Ndssc', \ + 'MLOGP', 'response'] + # read in data + df = pd.read_csv('https://raw.githubusercontent.com/'\ + 'fenago/data-science/'\ + 'master/Lab06/Dataset/'\ + 'qsar_fish_toxicity.csv', \ + names=_headers, sep=';') + ``` + + + In this step, you will read in your data. While the data is stored + in a CSV, it doesn\'t have a first row that lists the names of the + columns. The Python list called `_headers` will hold the + column names that you will supply to the `pandas` method + called `read_csv`. + + In the next line, you call the `read_csv` + `pandas` method and supply the location and name of the + file to be read in, along with the header names and the file + separator. Columns in the file are separated with a semi-colon. + +4. Split the data into training and evaluation sets: + + ``` + # Let's split our data + features = df.drop('response', axis=1).values + labels = df[['response']].values + X_train, X_eval, y_train, y_eval = train_test_split\ + (features, labels, \ + test_size=0.2, \ + random_state=0) + X_val, X_test, y_val, y_test = train_test_split(X_eval, y_eval,\ + random_state=0) + ``` + + + In this step, you begin by splitting the DataFrame called + `df` into two. The first DataFrame is called + `features` and contains all of the independent variables + that you will use to make your predictions. The second is called + `labels` and contains the values that you are trying to + predict. + + In the third line, you split `features` and + `labels` into four sets using + `train_test_split`. `X_train` and + `y_train` contain 80% of the data and are used for + training your model. `X_eval` and `y_eval` + contain the remaining 20%. + + In the fourth line, you split `X_eval` and + `y_eval` into two additional sets. `X_val` and + `y_val` contain 75% of the data because you did not + specify a ratio or size. `X_test` and `y_test` + contain the remaining 25%. + +5. Create a pipeline: + + ``` + # create a pipeline and engineer quadratic features + steps = [('scaler', MinMaxScaler()),\ + ('poly', PolynomialFeatures(2)),\ + ('model', LinearRegression())] + ``` + + + In this step, you begin by creating a Python list called + `steps`. The list contains three tuples, each one + representing a transformation of a model. The first tuple represents + a scaling operation. The first item in the tuple is the name of the + step, which you call `scaler`. This uses + `MinMaxScaler` to transform the data. The second, called + `poly`, creates additional features by crossing the + columns of data up to the degree that you specify. In this case, you + specify `2`, so it crosses these columns up to a power + of 2. Next comes your `LinearRegression` model. + +6. Create a pipeline: + + ``` + # create a simple Linear Regression model with a pipeline + model = Pipeline(steps) + ``` + + + In this step, you create an instance of `Pipeline` and + store it in a variable called `model`. + `Pipeline` performs a series of transformations, which are + specified in the steps you defined in the previous step. This + operation works because the transformers (`MinMaxScaler` + and `PolynomialFeatures`) implement two methods called + `fit()` and `fit_transform()`. You may recall + from previous examples that models are trained using the + `fit()` method that `LinearRegression` + implements. + +7. Train the model: + + ``` + # train the model + model.fit(X_train, y_train) + ``` + + + On the next line, you call the `fit` method and provide + `X_train` and `y_train` as parameters. Because + the model is a pipeline, three operations will happen. First, + `X_train` will be scaled. Next, additional features will + be engineered. Finally, training will happen using the + `LinearRegression` model. The output from this step is + similar to the following: + + +![](./images/B15019_06_19.jpg) + + + Caption: Training the model + +8. Predict using the validation dataset: + ``` + # let's use our model to predict on our validation dataset + y_pred = model.predict(X_val) + ``` + + +9. Compute the MAE of the model: + + ``` + # Let's compute our MEAN ABSOLUTE ERROR + mae = mean_absolute_error(y_val, y_pred) + print('MAE: {}'.format(mae)) + ``` + + + In the first line, you make use of `mean_absolute_error` + to compute the mean absolute error. You supply `y_val` and + `y_pred`, and the result is stored in the `mae` + variable. In the following line, you print out `mae`: + + +![](./images/B15019_06_20.jpg) + + + Caption: MAE score + + The loss that you compute at this step is called a validation loss + because you make use of the validation dataset. This is different + from a training loss that is computed using the training dataset. + This distinction is important to note as you study other + documentation or books, which might refer to both. + +10. Compute the R[2] score: + + ``` + # Let's get the R2 score + r2 = model.score(X_val, y_val) + print('R^2 score: {}'.format(r2)) + ``` + + + In the final two lines, you compute the R[2] score and + also display it, as shown in the following screenshot: + + +![](./images/B15019_06_21.jpg) + + + +Exercise 6.05: Creating a Classification Model for Computing Evaluation Metrics +------------------------------------------------------------------------------- + +In this exercise, you will create a classification model that you will +make use of later on for model assessment. + +You will make use of the cars dataset from the UCI Machine Learning +Repository. You will use this dataset to classify cars as either +acceptable or unacceptable based on the following categorical features: + +- `buying`: the purchase price of the car + +- `maint`: the maintenance cost of the car + +- `doors`: the number of doors on the car + +- `persons`: the carrying capacity of the vehicle + +- `lug_boot`: the size of the luggage boot + +- `safety`: the estimated safety of the car + + + +The following steps will help you achieve the task: + +1. Open a new Colab notebook. + +2. Import the libraries you will need: + + ``` + # import libraries + import pandas as pd + from sklearn.model_selection import train_test_split + from sklearn.linear_model import LogisticRegression + ``` + + + In this step, you import `pandas` and alias it as + `pd`. `pandas` is needed for reading data into a + DataFrame. You also import `train_test_split`, which is + needed for splitting your data into training and evaluation + datasets. Finally, you also import the + `LogisticRegression` class. + +3. Import your data: + + ``` + # data doesn't have headers, so let's create headers + _headers = ['buying', 'maint', 'doors', 'persons', \ + 'lug_boot', 'safety', 'car'] + # read in cars dataset + df = pd.read_csv('https://raw.githubusercontent.com/'\ + 'fenago/data-science/'\ + 'master/Lab06/Dataset/car.data', \ + names=_headers, index_col=None) + df.head() + ``` + + + In this step, you create a Python list called `_headers` + to hold the names of the columns in the file you will be importing + because the file doesn\'t have a header. You  then proceed to read + the file into a DataFrame named `df` by using + `pd.read_csv` and specifying the file location as well as + the list containing the file headers. Finally, you display the first + five rows using `df.head()`. + + You should get an output similar to the following: + + +![](./images/B15019_06_22.jpg) + + + Caption: Inspecting the DataFrame + +4. Encode categorical variables as shown in the following code snippet: + + ``` + # encode categorical variables + _df = pd.get_dummies(df, columns=['buying', 'maint', 'doors',\ + 'persons', 'lug_boot', \ + 'safety']) + _df.head() + ``` + + + In this step, you convert categorical columns into numeric columns + using a technique called one-hot encoding. You saw an example of + this in *Step 13* of *Exercise 3.04*, *Feature Engineering -- + Creating New Features from Existing Ones*. You need to do this + because the inputs to your model must be numeric. You get numeric + variables from categorical variables using `get_dummies` + from the `pandas` library. You provide your DataFrame as + input and specify the columns to be encoded. You assign the result + to a new DataFrame called `_df`, and then inspect the + result using `head()`. + + The output should now resemble the following screenshot: + + +![](./images/B15019_06_23.jpg) + + + Caption: Encoding categorical variables + + +5. Split the data into training and validation sets: + + ``` + # split data into training and evaluation datasets + features = _df.drop('car', axis=1).values + labels = _df['car'].values + X_train, X_eval, y_train, y_eval = train_test_split\ + (features, labels, \ + test_size=0.3, \ + random_state=0) + X_val, X_test, y_val, y_test = train_test_split(X_eval, y_eval,\ + test_size=0.5, \ + random_state=0) + ``` + + + In this step, you begin by extracting your feature columns and your + labels into two NumPy arrays called `features` and + `labels`. You then proceed to extract 70% into + `X_train` and `y_train`, with the remaining 30% + going into `X_eval` and `y_eval`. You then + further split `X_eval` and `y_eval` into two + equal parts and assign those to `X_val` and + `y_val` for validation, and `X_test` and + `y_test` for testing much later. + +6. Train a logistic regression model: + + ``` + # train a Logistic Regression model + model = LogisticRegression() + model.fit(X_train, y_train) + ``` + + + In this step, you create an instance of + `LogisticRegression` and train the model on your training + data by passing in `X_train` and `y_train` to + the `fit` method. + + You should get an output that looks similar to the following: + + +![](./images/B15019_06_24.jpg) + + + Caption: Training a logistic regression model + +7. Make a prediction: + + ``` + # make predictions for the validation set + y_pred = model.predict(X_val) + ``` + + + In this step, you make a prediction on the validation dataset, + `X_val`, and store the result in `y_pred`. A + look at the first 10 predictions (by executing + `y_pred[0:9]`) should provide an output similar to the + following: + + +![](./images/B15019_06_25.jpg) + + +Caption: Prediction for the validation set + + + +The Confusion Matrix +==================== + + +You encountered the confusion matrix in *Lab 3, Binary +Classification*. You may recall that the confusion matrix compares the +number of classes that the model predicted against the actual +occurrences of those classes in the validation dataset. The output is a +square matrix that has the number of rows and columns equal to the +number of classes you are predicting. The columns represent the actual +values, while the rows represent the predictions. You get a confusion +matrix by using `confusion_matrix` from +`sklearn.metrics`. + + + +Exercise 6.06: Generating a Confusion Matrix for the Classification Model +------------------------------------------------------------------------- + +The goal of this exercise is to create a confusion matrix for the +classification model you trained in *Exercise 6.05*, *Creating a +Classification Model for Computing Evaluation Metrics*. + +Note + +You should continue this exercise in the same notebook as that used in +*Exercise 6.05, Creating a Classification Model for Computing Evaluation +Metrics.* If you wish to use a new notebook, make sure you copy and run +the entire code from *Exercise 6.05*, *Creating a Classification Model +for Computing Evaluation Metrics*, and then begin with the execution of +the code of this exercise. + +The following steps will help you achieve the task: + +1. Open a new Colab notebook file. + +2. Import `confusion_matrix`: + + ``` + from sklearn.metrics import confusion_matrix + ``` + + + In this step, you import `confusion_matrix` from + `sklearn.metrics`. This function will let you generate a + confusion matrix. + +3. Generate a confusion matrix: + + ``` + confusion_matrix(y_val, y_pred) + ``` + + + In this step, you generate a confusion matrix by supplying + `y_val`, the actual classes, and `y_pred`, the + predicted classes. + + The output should look similar to the following: + + +![](./images/B15019_06_26.jpg) + + + + +More on the Confusion Matrix +---------------------------- + +The confusion matrix helps you analyze the impact of the choices you +would have to make if you put the model into production. Let\'s consider +the example of predicting the presence of a disease based on the inputs +to the model. This is a binary classification problem, where 1 implies +that the disease is present and 0 implies the disease is absent. The +confusion matrix for this model would have two columns and two rows. + +The first column would show the items that fall into class **0**. The +first row would show the items that were correctly classified into class +**0** and are called `true negatives`. The second row would +show the items that were wrongly classified as **1** but should have +been **0**. These are `false positives`. + +The second column would show the items that fall into class **1**. The +first row would show the items that were wrongly classified into class 0 +when they should have been **1** and are +called` false negatives`. Finally, the second row shows items +that were correctly classified into class 1 and are called +`true positives`. + +False positives are the cases in which the samples were wrongly +predicted to be infected when they are actually healthy. The implication +of this is that these cases would be treated for a disease that they do +not have. + +False negatives are the cases that were wrongly predicted to be healthy +when they actually have the disease. The implication of this is that +these cases would not be treated for a disease that they actually have. + +The question you need to ask about this model depends on the nature of +the disease and requires domain expertise about the disease. For +example, if the disease is contagious, then the untreated cases will be +released into the general population and could infect others. What would +be the implication of this versus placing cases into quarantine and +observing them for symptoms? + +On the other hand, if the disease is not contagious, the question +becomes that of the implications of treating people for a disease they +do not have versus the implications of not treating cases of a disease. + +It should be clear that there isn\'t a definite answer to these +questions. The model would need to be tuned to provide performance that +is acceptable to the users. + + + +Precision +--------- + +Precision was introduced in *Lab 3, Binary Classification*; however, +we will be looking at it in more detail in this lab. The precision +is the total number of cases that were correctly classified as positive +(called **true positive** and abbreviated as **TP**) divided by the +total number of cases in that prediction (that is, the total number of +entries in the row, both correctly classified (TP) and wrongly +classified (FP) from the confusion matrix). Suppose 10 entries were +classified as positive. If 7 of the entries were actually positive, then +TP would be 7 and FP would be 3. The precision would, therefore, be 0.7. +The equation is given as follows: + +![](./images/B15019_06_27.jpg) + +Caption: Equation for precision + +In the preceding equation: + +- `tp` is true positive -- the number of predictions that + were correctly classified as belonging to that class. +- `fp` is false positive -- the number of predictions that + were wrongly classified as belonging to that class. +- The function in `sklearn.metrics` to compute precision is + called `precision_score`. Go ahead and give it a try. + + + +Exercise 6.07: Computing Precision for the Classification Model +--------------------------------------------------------------- + +In this exercise, you will be computing the precision for the +classification model you trained in *Exercise 6.05*, *Creating a +Classification Model for Computing Evaluation Metrics*. + +Note + +You should continue this exercise in the same notebook as that used in +*Exercise 6.05, Creating a Classification Model for Computing Evaluation +Metrics.* If you wish to use a new notebook, make sure you copy and run +the entire code from *Exercise 6.05*, *Creating a Classification Model +for Computing Evaluation Metrics*, and then begin with the execution of +the code of this exercise. + +The following steps will help you achieve the task: + +1. Import the required libraries: + + ``` + from sklearn.metrics import precision_score + ``` + + + In this step, you import `precision_score` from + `sklearn.metrics`. + +2. Next, compute the precision score as shown in the following code + snippet: + + ``` + precision_score(y_val, y_pred, average='macro') + ``` + + + In this step, you compute the precision score using + `precision_score`. + + The output is a floating-point number between 0 and 1. It might look + like this: + + +![](./images/B15019_06_28.jpg) + + + +Recall +------ + +Recall is the total number of predictions that were true divided by the +number of predictions for the class, both true and false. Think of it as +the true positive divided by the sum of entries in the column. The +equation is given as follows: + +![](./images/B15019_06_29.jpg) + +Caption: Equation for recall + +The function for this is `recall_score`, which is available +from `sklearn.metrics`. + + + +Exercise 6.08: Computing Recall for the Classification Model +------------------------------------------------------------ + +The goal of this exercise is to compute the recall for the +classification model you trained in *Exercise 6.05*, *Creating a +Classification Model for Computing Evaluation Metrics*. + +Note + +You should continue this exercise in the same notebook as that used in +*Exercise 6.05, Creating a Classification Model for Computing Evaluation +Metrics.* If you wish to use a new notebook, make sure you copy and run +the entire code from *Exercise 6.05*, *Creating a Classification Model +for Computing Evaluation Metrics*, and then begin with the execution of +the code of this exercise. + +The following steps will help you accomplish the task: + +1. Open a new Colab notebook file. + +2. Now, import the required libraries: + + ``` + from sklearn.metrics import recall_score + ``` + + + In this step, you import `recall_score` from + `sklearn.metrics`. This is the function that you will make + use of in the second step. + +3. Compute the recall: + + ``` + recall_score(y_val, y_pred, average='macro') + ``` + + + In this step, you compute the recall by using + `recall_score`. You need to specify `y_val` and + `y_pred` as parameters to the function. The documentation + for `recall_score` explains the values that you can supply + to `average`. If your model does binary prediction and the + labels are `0` and `1`, you can set + `average` to `binary`. Other options are + `micro`, `macro`, `weighted`, and + `samples`. You should read the documentation to see what + they do. + + You should get an output that looks like the following: + + +![](./images/B15019_06_30.jpg) + + +Caption: Recall score + +Note + +The recall score can vary, depending on the data. + +As you can see, we have calculated the recall score in the exercise, +which is `0.622`. This means that of the total number of +classes that were predicted, `62%` of them were correctly +predicted. On its own, this value might not mean much until it is +compared to the recall score from another model. + + + +Let\'s now move toward calculating the F1 score, which also helps +greatly in evaluating the model performance, which in turn aids in +making better decisions when choosing models. + + + +F1 Score +-------- + +The F1 score is another important parameter that helps us to evaluate +the model performance. It considers the contribution of both precision +and recall using the following equation: + +![](./images/B15019_06_31.jpg) + +Caption: F1 score + +The F1 score ranges from 0 to 1, with 1 being the best possible score. +You compute the F1 score using `f1_score` from +`sklearn.metrics`. + + + +Exercise 6.09: Computing the F1 Score for the Classification Model +------------------------------------------------------------------ + +In this exercise, you will compute the F1 score for the classification +model you trained in *Exercise 6.05*, *Creating a Classification Model +for Computing Evaluation Metrics*. + +Note + +You should continue this exercise in the same notebook as that used in +*Exercise 6.05, Creating a Classification Model for Computing Evaluation +Metrics.* If you wish to use a new notebook, make sure you copy and run +the entire code from *Exercise 6.05*, *Creating a Classification Model +for Computing Evaluation Metrics*, and then begin with the execution of +the code of this exercise. + +The following steps will help you accomplish the task: + +1. Open a new Colab notebook file. + +2. Import the necessary modules: + + ``` + from sklearn.metrics import f1_score + ``` + + + In this step, you import the `f1_score` method from + `sklearn.metrics`. This score will let you compute + evaluation metrics. + +3. Compute the F1 score: + + ``` + f1_score(y_val, y_pred, average='macro') + ``` + + + In this step, you compute the F1 score by passing in + `y_val` and `y_pred`. You also specify + `average='macro'` because this is not binary + classification. + + You should get an output similar to the following: + + +![](./images/B15019_06_32.jpg) + + +Caption: F1 score + + +By the end of this exercise, you will see that the `F1` score +we achieved is `0.6746`. There is a lot of room for +improvement, and you would engineer new features and train a new model +to try and get a better F1 score. + + + +Accuracy +-------- + +Accuracy is an evaluation metric that is applied to classification +models. It is computed by counting the number of labels that were +correctly predicted, meaning that the predicted label is exactly the +same as the ground truth. The `accuracy_score()` function +exists in `sklearn.metrics` to provide this value. + + + +Exercise 6.10: Computing Model Accuracy for the Classification Model +-------------------------------------------------------------------- + +The goal of this exercise is to compute the accuracy score of the model +trained in *Exercise 6.04*, *Computing the Mean Absolute Error of a +Second Model*. + +Note + +You should continue this exercise in the same notebook as that used in +*Exercise 6.05, Creating a Classification Model for Computing Evaluation +Metrics.* If you wish to use a new notebook, make sure you copy and run +the entire code from *Exercise 6.05*, *Creating a Classification Model +for Computing Evaluation Metrics*, and then begin with the execution of +the code of this exercise. + +The following steps will help you accomplish the task: + +1. Continue from where the code for *Exercise 6.05*, *Creating a + Classification Model for Computing Evaluation Metrics*, ends in your + notebook. + +2. Import `accuracy_score()`: + + ``` + from sklearn.metrics import accuracy_score + ``` + + + In this step, you import `accuracy_score()`, which you + will use to compute the model accuracy. + +3. Compute the accuracy: + + ``` + _accuracy = accuracy_score(y_val, y_pred) + print(_accuracy) + ``` + + + In this step, you compute the model accuracy by passing in + `y_val` and `y_pred` as parameters to + `accuracy_score()`. The interpreter assigns the result to + a variable called `c`. The `print()` method + causes the interpreter to render the value of `_accuracy`. + + The result is similar to the following: + + +![](./images/B15019_06_33.jpg) + + + +Thus, we have successfully calculated the accuracy of the model as being +`0.876`. The goal of this exercise is to show you how to +compute the accuracy of a model and to compare this accuracy value to +that of another model that you will train in the future. + + + +Logarithmic Loss +---------------- + +The logarithmic loss (or log loss) is the loss function for categorical +models. It is also called categorical cross-entropy. It seeks to +penalize incorrect predictions. The `sklearn` documentation +defines it as \"the negative log-likelihood of the true values given +your model predictions.\" + + + +Exercise 6.11: Computing the Log Loss for the Classification Model +------------------------------------------------------------------ + +The goal of this exercise is to predict the log loss of the model +trained in *Exercise 6.05*, *Creating a Classification Model for +Computing Evaluation Metrics*. + +Note + +You should continue this exercise in the same notebook as that used in +*Exercise 6.05, Creating a Classification Model for Computing Evaluation +Metrics.* If you wish to use a new notebook, make sure you copy and run +the entire code from *Exercise 6.05* and then begin with the execution +of the code of this exercise. + +The following steps will help you accomplish the task: + +1. Open your Colab notebook and continue from where *Exercise 6.05*, + *Creating a Classification Model for Computing Evaluation Metrics*, + stopped. + +2. Import the required libraries: + + ``` + from sklearn.metrics import log_loss + ``` + + + In this step, you import `log_loss()` from + `sklearn.metrics`. + +3. Compute the log loss: + ``` + _loss = log_loss(y_val, model.predict_proba(X_val)) + print(_loss) + ``` + + +In this step, you compute the log loss and store it in a variable called +`_loss`. You need to observe something very important: +previously, you made use of `y_val`, the ground truths, and +`y_pred`, the predictions. + +In this step, you do not make use of predictions. Instead, you make use +of predicted probabilities. You see that in the code where you specify +`model.predict_proba()`. You specify the validation dataset +and it returns the predicted probabilities. + +The `print()` function causes the interpreter to render the +log loss. + +This should look like the following: + +![](./images/B15019_06_34.jpg) + + + + +Exercise 6.12: Computing and Plotting ROC Curve for a Binary Classification Problem +----------------------------------------------------------------------------------- + +The goal of this exercise is to plot the ROC curve for a binary +classification problem. The data for this problem is used to predict +whether or not a mother will require a caesarian section to give birth. + + + +From the UCI Machine Learning Repository, the abstract for this dataset +follows: \"This dataset contains information about caesarian section +results of 80 pregnant women with the most important characteristics of +delivery problems in the medical field.\" The attributes of interest are +age, delivery number, delivery time, blood pressure, and heart status. + +The following steps will help you accomplish this task: + +1. Open a Colab notebook file. + +2. Import the required libraries: + + ``` + # import libraries + import pandas as pd + from sklearn.model_selection import train_test_split + from sklearn.linear_model import LogisticRegression + from sklearn.metrics import roc_curve + from sklearn.metrics import auc + ``` + + + In this step, you import `pandas`, which you will use to + read in data. You also import `train_test_split` for + creating training and validation datasets, and + `LogisticRegression` for creating a model. + +3. Read in the data: + + ``` + # data doesn't have headers, so let's create headers + _headers = ['Age', 'Delivery_Nbr', 'Delivery_Time', \ + 'Blood_Pressure', 'Heart_Problem', 'Caesarian'] + # read in cars dataset + df = pd.read_csv('https://raw.githubusercontent.com/'\ + 'fenago/data-science/'\ + 'master/Lab06/Dataset/caesarian.csv.arff',\ + names=_headers, index_col=None, skiprows=15) + df.head() + # target column is 'Caesarian' + ``` + + + +![](./images/B15019_06_35.jpg) + + + Caption: Reading the dataset + + You will need to do a few things to work with this file. Skip 15 + rows and specify the column headers and read the file without an + index. + + The code shows how you do that by creating a Python list to hold + your column headers and then read in the file using + `read_csv()`. The parameters that you pass in are the + file\'s location, the column headers as a Python list, the name of + the index column (in this case, it is None), and the number of rows + to skip. + + The `head()` method will print out the top five rows and + should look similar to the following: + + +![](./images/B15019_06_36.jpg) + + + Caption: The top five rows of the DataFrame + +4. Split the data: + + ``` + # target column is 'Caesarian' + features = df.drop(['Caesarian'], axis=1).values + labels = df[['Caesarian']].values + # split 80% for training and 20% into an evaluation set + X_train, X_eval, y_train, y_eval = train_test_split\ + (features, labels, \ + test_size=0.2, \ + random_state=0) + """ + further split the evaluation set into validation and test sets + of 10% each + """ + X_val, X_test, y_val, y_test = train_test_split(X_eval, y_eval,\ + test_size=0.5, \ + random_state=0) + ``` + + + In this step, you begin by creating two `numpy` arrays, + which you call `features` and `labels`. You then + split these arrays into a `training` and an + `evaluation` dataset. You further split the + `evaluation` dataset into `validation` and + `test` datasets. + +5. Now, train and fit a logistic regression model: + + ``` + model = LogisticRegression() + model.fit(X_train, y_train) + ``` + + + In this step, you begin by creating an instance of a logistic + regression model. You then proceed to train or fit the model on the + training dataset. + + The output should be similar to the following: + + +![](./images/B15019_06_37.jpg) + + + Caption: Training a logistic regression model + +6. Predict the probabilities, as shown in the following code snippet: + + ``` + y_proba = model.predict_proba(X_val) + ``` + + + In this step, the model predicts the probabilities for each entry in + the validation dataset. It stores the results in + `y_proba`. + +7. Compute the true positive rate, the false positive rate, and the + thresholds: + + ``` + _false_positive, _true_positive, _thresholds = roc_curve\ + (y_val, \ + y_proba[:, 0]) + ``` + + + In this step, you make a call to `roc_curve()` and specify + the ground truth and the first column of the predicted + probabilities. The result is a tuple of false positive rate, true + positive rate, and thresholds. + +8. Explore the false positive rates: + + ``` + print(_false_positive) + ``` + + + In this step, you instruct the interpreter to print out the false + positive rate. The output should be similar to the following: + + +![](./images/B15019_06_38.jpg) + + + Caption: False positive rates + + Note + + The false positive rates can vary, depending on the data. + +9. Explore the true positive rates: + + ``` + print(_true_positive) + ``` + + + In this step, you instruct the interpreter to print out the true + positive rates. This should be similar to the following: + + +![](./images/B15019_06_39.jpg) + + + Caption: True positive rates + +10. Explore the thresholds: + + ``` + print(_thresholds) + ``` + + + In this step, you instruct the interpreter to display the + thresholds. The output should be similar to the following: + + +![](./images/B15019_06_40.jpg) + + + Caption: Thresholds + +11. Now, plot the ROC curve: + + ``` + # Plot the RoC + import matplotlib.pyplot as plt + %matplotlib inline + plt.plot(_false_positive, _true_positive, lw=2, \ + label='Receiver Operating Characteristic') + plt.xlim(0.0, 1.2) + plt.ylim(0.0, 1.2) + plt.xlabel('False Positive Rate') + plt.ylabel('True Positive Rate') + plt.title('Receiver Operating Characteristic') + plt.show() + ``` + + The output should look similar to the following: + + +![](./images/B15019_06_41.jpg) + + +Caption: ROC curve + + + +Exercise 6.13: Computing the ROC AUC for the Caesarian Dataset +-------------------------------------------------------------- + +The goal of this exercise is to compute the ROC AUC for the binary +classification model that you trained in *Exercise 6.12*, *Computing and +Plotting ROC Curve for a Binary Classification Problem*. + +Note + +You should continue this exercise in the same notebook as that used in +*Exercise 6.12, Computing and Plotting ROC Curve for a Binary +Classification Problem.* If you wish to use a new notebook, make sure +you copy and run the entire code from *Exercise 6.12* and then begin +with the execution of the code of this exercise. + +The following steps will help you accomplish the task: + +1. Open a Colab notebook to the code for *Exercise 6.12*, *Computing + and Plotting ROC Curve for a Binary Classification Problem,* and + continue writing your code. + +2. Predict the probabilities: + + ``` + y_proba = model.predict_proba(X_val) + ``` + + + In this step, you compute the probabilities of the classes in the + validation dataset. You store the result in `y_proba`. + +3. Compute the ROC AUC: + + ``` + from sklearn.metrics import roc_auc_score + _auc = roc_auc_score(y_val, y_proba[:, 0]) + print(_auc) + ``` + + + In this step, you compute the ROC AUC and store the result in + `_auc`. You then proceed to print this value out. The + result should look similar to the following: + + +![](./images/B15019_06_42.jpg) + + +Caption: Computing the ROC AUC + +Note + +The AUC can be different, depending on the data. + + + +Saving and Loading Models +========================= + + +You will eventually need to transfer some of the models you have trained +to a different computer so they can be put into production. There are +various utilities for doing this, but the one we will discuss is called +`joblib`. + +`joblib` supports saving and loading models, and it saves the +models in a format that is supported by other machine learning +architectures, such as `ONNX`. + +`joblib` is found in the `sklearn.externals` module. + + + +Exercise 6.14: Saving and Loading a Model +----------------------------------------- + +In this exercise, you will train a simple model and use it for +prediction. You will then proceed to save the model and then load it +back in. You will use the loaded model for a second prediction, and then +compare the predictions from the first model to those from the second +model. You will make use of the car dataset for this exercise. + +The following steps will guide you toward the goal: + +1. Open a Colab notebook. + +2. Import the required libraries: + ``` + import pandas as pd + from sklearn.model_selection import train_test_split + from sklearn.linear_model import LinearRegression + ``` + + +3. Read in the data: + ``` + _headers = ['CIC0', 'SM1', 'GATS1i', 'NdsCH', 'Ndssc', \ + 'MLOGP', 'response'] + # read in data + df = pd.read_csv('https://raw.githubusercontent.com/'\ + 'fenago/data-science/'\ + 'master/Lab06/Dataset/'\ + 'qsar_fish_toxicity.csv', \ + names=_headers, sep=';') + ``` + + +4. Inspect the data: + + ``` + df.head() + ``` + + + The output should be similar to the following: + + +![](./images/B15019_06_43.jpg) + + + Caption: Inspecting the first five rows of the DataFrame + +5. Split the data into `features` and `labels`, and + into training and validation sets: + ``` + features = df.drop('response', axis=1).values + labels = df[['response']].values + X_train, X_eval, y_train, y_eval = train_test_split\ + (features, labels, \ + test_size=0.2, \ + random_state=0) + X_val, X_test, y_val, y_test = train_test_split(X_eval, y_eval,\ + random_state=0) + ``` + + +6. Create a linear regression model: + + ``` + model = LinearRegression() + print(model) + ``` + + + The output will be as follows: + + +![](./images/B15019_06_44.jpg) + + + Caption: Training a linear regression model + +7. Fit the training data to the model: + ``` + model.fit(X_train, y_train) + ``` + + +8. Use the model for prediction: + ``` + y_pred = model.predict(X_val) + ``` + + +9. Import `joblib`: + ``` + from sklearn.externals import joblib + ``` + + +10. Save the model: + + ``` + joblib.dump(model, './model.joblib') + ``` + + + The output should be similar to the following: + + +![](./images/B15019_06_45.jpg) + + + Caption: Saving the model + +11. Load it as a new model: + ``` + m2 = joblib.load('./model.joblib') + ``` + + +12. Use the new model for predictions: + ``` + m2_preds = m2.predict(X_val) + ``` + + +13. Compare the predictions: + + ``` + ys = pd.DataFrame(dict(predicted=y_pred.reshape(-1), \ + m2=m2_preds.reshape(-1))) + ys.head() + ``` + + + The output should be similar to the following: + + +![](./images/B15019_06_46.jpg) + + +Caption: Comparing predictions + + + +Activity 6.01: Train Three Different Models and Use Evaluation Metrics to Pick the Best Performing Model +-------------------------------------------------------------------------------------------------------- + +You work as a data scientist at a bank. The bank would like to implement +a model that predicts the likelihood of a customer purchasing a term +deposit. The bank provides you with a dataset, which is the same as the +one in *Lab 3*, *Binary Classification*. You have previously learned +how to train a logistic regression model for binary classification. +You have also heard about other non-parametric modeling techniques and +would like to try out a decision tree as well as a random forest to see +how well they perform against the logistic regression models you have +been training. + +In this activity, you will train a logistic regression model and compute +a classification report. You will then proceed to train a decision tree +classifier and compute a classification report. You will compare the +models using the classification reports. Finally, you will train a +random forest classifier and generate the classification report. You +will then compare the logistic regression model with the random forest +using the classification reports to determine which model you should put +into production. + +The steps to accomplish this task are: + +1. Open a Colab notebook. + +2. Load the necessary libraries. + +3. Read in the data. + +4. Explore the data. + +5. Convert categorical variables using + `pandas.get_dummies()`. + +6. Prepare the `X` and `y` variables. + +7. Split the data into training and evaluation sets. + +8. Create an instance of `LogisticRegression`. + +9. Fit the training data to the `LogisticRegression` model. + +10. Use the evaluation set to make a prediction. + +11. Use the prediction from the `LogisticRegression` model to + compute the classification report. + +12. Create an instance of `DecisionTreeClassifier`: + ``` + dt_model = DecisionTreeClassifier(max_depth= 6) + ``` + + +13. Fit the training data to the `DecisionTreeClassifier` + model: + ``` + dt_model.fit(train_X, train_y) + ``` + + +14. Using the `DecisionTreeClassifier` model, make a + prediction on the evaluation dataset: + ``` + dt_preds = dt_model.predict(val_X) + ``` + + +15. Use the prediction from the `DecisionTreeClassifier` model + to compute the classification report: + + ``` + dt_report = classification_report(val_y, dt_preds) + print(dt_report) + ``` + + + Note + + We will be studying decision trees in detail in *Lab 7, The + Generalization of Machine Learning Models*. + +16. Compare the classification report from the linear regression model + and the classification report from the decision tree classifier to + determine which is the better model. + +17. Create an instance of `RandomForestClassifier`. + +18. Fit the training data to the `RandomForestClassifier` + model. + +19. Using the `RandomForestClassifier` model, make a + prediction on the evaluation dataset. + +20. Using the prediction from the random forest classifier, compute the + classification report. + +21. Compare the classification report from the linear regression model + with the classification report from the random forest classifier to + decide which model to keep or improve upon. + +22. Compare the R[2] scores of all three models. The + output should be similar to the following: + +![](./images/B15019_06_47.jpg) + + + + +Summary +======= + +In this lab we observed that some of the evaluation metrics for +classification models require a binary classification model. We saw that +when we worked with more than two classes, we were required to use the +one-versus-all approach. The one-versus-all approach builds one model +for each class and tries to predict the probability that the input +belongs to a specific class. We saw that once this was done, we then +predicted that the input belongs to the class where the model has the +highest prediction probability. We also split our evaluation dataset +into two, it\'s because `X_test` and `y_test` are +used once for a final evaluation of the model\'s performance. You +can make use of them before putting your model into production to see +how the model would perform in a production environment. diff --git a/lab_guides/Lab_7.md b/lab_guides/Lab_7.md new file mode 100644 index 0000000..1a89366 --- /dev/null +++ b/lab_guides/Lab_7.md @@ -0,0 +1,2919 @@ + +7. The Generalization of Machine Learning Models +================================================ + + + +Overview + +This lab will teach you how to make use of the data you have to +train better models by either splitting your data if it is sufficient or +making use of cross-validation if it is not. By the end of this lab, +you will know how to split your data into training, validation, and test +datasets. You will be able to identify the ratio in which data has to be +split and also consider certain features while splitting. You will also +be able to implement cross-validation to use limited data for testing +and use regularization to reduce overfitting in models. + + +Introduction +============ + + +In the previous lab, you learned about model assessment using +various metrics such as R2 score, MAE, and accuracy. These metrics help +you decide which models to keep and which ones to discard. In this +lab, you will learn some more techniques for training better models. + +Generalization deals with getting your models to perform well enough on +data points that they have not encountered in the past (that is, during +training). We will address two specific areas: + +- How to make use of as much of your data as possible to train a model +- How to reduce overfitting in a model + + +Overfitting +=========== + + +A model is said to overfit the training data when it generates a +hypothesis that accounts for every example. What this means is that it +correctly predicts the outcome of every example. The problem with this +scenario is that the model equation becomes extremely complex, and such +models have been observed to be incapable of correctly predicting new +observations. + +Overfitting occurs when a model has been over-engineered. Two of the +ways in which this could occur are: + +- The model is trained on too many features. +- The model is trained for too long. + +We\'ll discuss each of these two points in the following sections. + + + +Training on Too Many Features +----------------------------- + +When a model trains on too many features, the hypothesis becomes +extremely complicated. Consider a case in which you have one column of +features and you need to generate a hypothesis. This would be a simple +linear equation, as shown here: + +![](./images/B15019_07_01.jpg) + +Caption: Equation for a hypothesis for a line + +Now, consider a case in which you have two columns, and in which you +cross the columns by multiplying them. The hypothesis becomes the +following: + +![](./images/B15019_07_02.jpg) + +Caption: Equation for a hypothesis for a curve + +While the first equation yields a line, the second equation yields a +curve, because it is now a quadratic equation. But the same two features +could become even more complicated depending on how you engineer your +features. Consider the following equation: + +![](./images/B15019_07_03.jpg) + +Caption: Cubic equation for a hypothesis + +The same set of features has now given rise to a cubic equation. This +equation will have the property of having a large number of weights, for +example: + +- The simple linear equation has one weight and one bias. +- The quadratic equation has three weights and one bias. +- The cubic equation has five weights and one bias. + +One solution to overfitting as a result of too many features is to +eliminate certain features. The technique for this is called lasso +regression. + +A second solution to overfitting as a result of too many features is to +provide more data to the model. This might not always be a feasible +option, but where possible, it is always a good idea to do so. + + + +Training for Too Long +--------------------- + +The model starts training by initializing the vector of weights such +that all values are equal to zero. During training, the weights are +updated according to the gradient update rule. This systematically adds +or subtracts a small value to each weight. As training progresses, the +magnitude of the weights increases. If the model trains for too long, +these model weights become too large. + +The solution to overfitting as a result of large weights is to reduce +the magnitude of the weights to as close to zero as possible. The +technique for this is called ridge regression. + + +Underfitting +============ + + +Consider an alternative situation in which the data has 10 features, but +you only make use of 1 feature. Your model hypothesis would still be the +following: + +![](./images/B15019_07_04.jpg) + +Caption: Equation for a hypothesis for a line + +However, that is the equation of a straight line, but your model is +probably ignoring a lot of information. The model is over-simplified and +is said to underfit the data. + +The solution to underfitting is to provide the model with more features, +or conversely, less data to train on; but more features is the better +approach. + + +Data +==== + + +In the world of machine learning, the data that you have is not used in +its entirety to train your model. Instead, you need to separate your +data into three sets, as mentioned here: + +- A training dataset, which is used to train your model and measure + the training loss. +- An evaluation or validation dataset, which you use to measure the + validation loss of the model to see whether the validation loss + continues to reduce as well as the training loss. +- A test dataset for final testing to see how well the model performs + before you put it into production. + + + +The Ratio for Dataset Splits +---------------------------- + +The evaluation dataset is set aside from your entire training data and +is never used for training. There are various schools of thought around +the particular ratio that is set aside for evaluation, but it generally +ranges from a high of 30% to a low of 10%. This evaluation dataset is +normally further split into a validation dataset that is used during +training and a test dataset that is used at the end for a sanity check. +If you are using 10% for evaluation, you might set 5% aside for +validation and the remaining 5% for testing. If using 30%, you might set +20% aside for validation and 10% for testing. + +To summarize, you might split your data into 70% for training, 20% for +validation, and 10% for testing, or you could split your data into 80% +for training, 15% for validation, and 5% for test. Or, finally, you +could split your data into 90% for training, 5% for validation, and 5% +for testing. + +The choice of what ratio to use is dependent on the amount of data that +you have. If you are working with 100,000 records, for example, then 20% +validation would give you 20,000 records. However, if you were working +with 100,000,000 records, then 5% would give you 5 million records for +validation, which would be more than sufficient. + + + +Creating Dataset Splits +----------------------- + +At a very basic level, splitting your data involves random sampling. +Let\'s say you have 10 items in a bowl. To get 30% of the items, you +would reach in and take any 3 items at random. + +In the same way, because you are writing code, you could do the +following: + +1. Create a Python list. +2. Place 10 numbers in the list. +3. Generate 3 non-repeating random whole numbers from 0 to 9. +4. Pick items whose indices correspond to the random numbers + previously generated. + +![](./images/B15019_07_05.jpg) + + +Caption: Visualization of data splitting + +This is something you will only do once for a particular dataset. You +might write a function for it. If it is something that you need to do +repeatedly and you also need to handle advanced functionality, you might +want to write a class for it. + +`sklearn` has a class called `train_test_split`, +which provides the functionality for splitting data. It is available as +`sklearn.model_selection.train_test_split`. This function will +let you split a DataFrame into two parts. + +Have a look at the following exercise on importing and splitting data. + + + +Exercise 7.01: Importing and Splitting Data +------------------------------------------- + +The goal of this exercise is to import data from a repository and to +split it into a training and an evaluation set. +We will be using the Cars dataset from the UCI Machine Learning +Repository. + +This dataset is about the cost of owning cars with certain attributes. +The abstract from the website states: \"*Derived from simple +hierarchical decision model, this database may be useful for testing +constructive induction and structure discovery methods*.\" Here are some +of the key attributes of this dataset: + +``` +CAR car acceptability +. PRICE overall price +. . buying buying price +. . maint price of the maintenance +. TECH technical characteristics +. . COMFORT comfort +. . . doors number of doors +. . . persons capacity in terms of persons to carry +. . . lug_boot the size of luggage boot +. . safety estimated safety of the car +``` + +The following steps will help you complete the exercise: + +1. Open a new Colab notebook file. + +2. Import the necessary libraries: + + ``` + # import libraries + import pandas as pd + from sklearn.model_selection import train_test_split + ``` + + + In this step, you have imported `pandas` and aliased it as + `pd`. As you know, `pandas` is required to read + in the file. You also import `train_test_split` from + `sklearn.model_selection` to split the data into two + parts. + +3. Before reading the file into your notebook, open and inspect the + file (`car.data`) with an editor. You should see an output + similar to the following: + + +![](./images/B15019_07_06.jpg) + + + Caption: Car data + + You will notice from the preceding screenshot that the file doesn\'t + have a first row containing the headers. + +4. Create a Python list to hold the headers for the data: + ``` + # data doesn't have headers, so let's create headers + _headers = ['buying', 'maint', 'doors', 'persons', \ + 'lug_boot', 'safety', 'car'] + ``` + + +5. Now, import the data as shown in the following code snippet: + + ``` + # read in cars dataset + df = pd.read_csv('https://raw.githubusercontent.com/'\ + 'fenago/data-science/'\ + 'master/Lab07/Dataset/car.data', \ + names=_headers, index_col=None) + ``` + + + You then proceed to import the data into a variable called + `df` by using `pd.read_csv`. You specify the + location of the data file, as well as the list of column headers. + You also specify that the data does not have a column index. + +6. Show the top five records: + + ``` + df.info() + ``` + + + In order to get information about the columns in the data as well as + the number of records, you make use of the `info()` + method. You should get an output similar to the following: + + +![](./images/B15019_07_07.jpg) + + + Caption: The top five records of the DataFrame + + The `RangeIndex` value shows the number of records, which + is `1728`. + +7. Now, you need to split the data contained in `df` into a + training dataset and an evaluation dataset: + + ``` + #split the data into 80% for training and 20% for evaluation + training_df, eval_df = train_test_split(df, train_size=0.8, \ + random_state=0) + ``` + + + In this step, you make use of `train_test_split` to create + two new DataFrames called `training_df` and + `eval_df`. + + You specify a value of `0.8` for `train_size` so + that `80%` of the data is assigned to + `training_df`. + + `random_state` ensures that your experiments are + reproducible. Without `random_state`, the data is split + differently every time using a different random number. With + `random_state`, the data is split the same way every time. + We will be studying `random_state` in depth in the next + lab. + +8. Check the information of `training_df`: + + ``` + training_df.info() + ``` + + + In this step, you make use of `.info()` to get the details + of `training_df`. This will print out the column names as + well as the number of records. + + You should get an output similar to the following: + + +![](./images/B15019_07_08.jpg) + + + Caption: Information on training\_df + + You should observe that the column names match those in + `df`, but you should have `80%` of the records + that you did in `df`, which is `1382` out of + `1728`. + +9. Check the information on `eval_df`: + + ``` + eval_df.info() + ``` + + + In this step, you print out the information about + `eval_df`. This will give you the column names and the + number of records. The output should be similar to the following: + + +![](./images/B15019_07_09.jpg) + + +Caption: Information on eval\_df + + + +**Random State** + +![](./images/B15019_07_10.jpg) + +Caption: Numbers generated using random state + + + +Exercise 7.02: Setting a Random State When Splitting Data +--------------------------------------------------------- + +The goal of this exercise is to have a reproducible way of splitting the +data that you imported in *Exercise 7.01*, *Importing and Splitting +Data*. + +Note + +We going to refactor the code from the previous exercise. Hence, if you +are using a new Colab notebook then make sure you copy the code from the +previous exercise. Alternatively, you can make a copy of the notebook +used in *Exercise 7.01* and use the revised the code as suggested in the +following steps. + +The following steps will help you complete the exercise: + +1. Continue from the previous *Exercise 7.01* notebook. + +2. Set the random state as `1` and split the data: + + ``` + """ + split the data into 80% for training and 20% for evaluation + using a random state + """ + training_df, eval_df = train_test_split(df, train_size=0.8, \ + random_state=1) + ``` + + + In this step, you specify a `random_state` value of 1 to + the `train_test_split` function. + +3. Now, view the top five records in `training_df`: + + ``` + #view the head of training_eval + training_df.head() + ``` + + + In this step, you print out the first five records in + `training_df`. + + The output should be similar to the following: + + +![](./images/B15019_07_11.jpg) + + + Caption: The top five rows for the training evaluation set + +4. View the top five records in `eval_df`: + + ``` + #view the top of eval_df + eval_df.head() + ``` + + + In this step, you print out the first five records in + `eval_df`. + + The output should be similar to the following: + + +![](./images/B15019_07_12.jpg) + + + + +Cross-Validation +================ + + +Consider an example where you split your data into five parts of 20% +each. You would then make use of four parts for training and one part +for evaluation. Because you have five parts, you can make use of the +data five times, each time using one part for validation and the +remaining data for training. + +![](./images/B15019_07_13.jpg) + +Caption: Cross-validation + + +Exercise 7.03: Creating a Five-Fold Cross-Validation Dataset +------------------------------------------------------------ + +The goal of this exercise is to create a five-fold cross-validation +dataset from the data that you imported in *Exercise 7.01*, *Importing +and Splitting Data*. + +Note + +If you are using a new Colab notebook then make sure you copy the code +from *Exercise 7.01*, *Importing and Splitting Data*. Alternatively, you +can make a copy of the notebook used in *Exercise 7.01* and then use the +code as suggested in the following steps. + +The following steps will help you complete the exercise: + +1. Continue from the notebook file of *Exercise 7.01.* + +2. Import all the necessary libraries: + + ``` + from sklearn.model_selection import KFold + ``` + + + In this step, you import `KFold` from + `sklearn.model_selection`. + +3. Now create an instance of the class: + + ``` + _kf = KFold(n_splits=5) + ``` + + + In this step, you create an instance of `KFold` and assign + it to a variable called `_kf`. You specify a value of + `5` for the `n_splits` parameter so that it + splits the dataset into five parts. + +4. Now split the data as shown in the following code snippet: + + ``` + indices = _kf.split(df) + ``` + + + In this step, you call the `split` method, which is + `.split()` on `_kf`. The result is stored in a + variable called `indices`. + +5. Find out what data type `indices` has: + + ``` + print(type(indices)) + ``` + + + In this step, you inspect the call to split the output returns. + + The output should be a `generator`, as seen in the + following output: + + +![](./images/B15019_07_14.jpg) + + + Caption: Data type for indices + +6. Get the first set of indices: + + ``` + #first set + train_indices, val_indices = next(indices) + ``` + + + In this step, you make use of the `next()` Python function + on the generator function. Using `next()` is the way that + you get a generator to return results to you. You asked for five + splits, so you can call `next()` five times on this + particular generator. Calling `next()` a sixth time will + cause the Python runtime to raise an exception. + + The call to `next()` yields a tuple. In this case, it is a + pair of indices. The first one contains your training indices and + the second one contains your validation indices. You assign these to + `train_indices` and `val_indices`. + +7. Create a training dataset as shown in the following code snippet: + + ``` + train_df = df.drop(val_indices) + train_df.info() + ``` + + + In this step, you create a new DataFrame called `train_df` + by dropping the validation indices from `df`, the + DataFrame that contains all of the data. This is a subtractive + operation similar to what is done in set theory. The `df` + set is a union of `train` and `val`. Once you + know what `val` is, you can work backward to determine + `train` by subtracting `val` from + `df`. If you consider `df` to be a set called + `A`, `val` to be a set called `B`, and + train to be a set called `C`, then the following holds + true: + + +![](./images/B15019_07_15.jpg) + + + Caption: Dataframe A + + Similarly, set `C` can be the difference between set + `A` and set `B`, as depicted in the following: + + +![](./images/B15019_07_16.jpg) + + + Caption: Dataframe C + + The way to accomplish this with a pandas DataFrame is to drop the + rows with the indices of the elements of `B` from + `A`, which is what you see in the preceding code snippet. + + You can see the result of this by calling the `info()` + method on the new DataFrame. + + The result of that call should be similar to the following + screenshot: + + +![](./images/B15019_07_17.jpg) + + + Caption: Information on the new dataframe + +8. Create a validation dataset: + + ``` + val_df = df.drop(train_indices) + val_df.info() + ``` + + + In this step, you create the `val_df` validation dataset + by dropping the training indices from the `df` DataFrame. + Again, you can see the details of this new DataFrame by calling the + `info()` method. + + The output should be similar to the following: + + +![](./images/B15019_07_18.jpg) + + +Caption: Information for the validation dataset + + +Exercise 7.04: Creating a Five-Fold Cross-Validation Dataset Using a Loop for Calls +----------------------------------------------------------------------------------- + +The goal of this exercise is to create a five-fold cross-validation +dataset from the data that you imported in *Exercise 7.01*, *Importing +and Splitting Data*. You will make use of a loop for calls to the +generator function. + + +The following steps will help you complete this exercise: + +1. Open a new Colab notebook and repeat the steps you used to import + data in *Exercise 7.01*, *Importing and Splitting Data*. + +2. Define the number of splits you would like: + + ``` + from sklearn.model_selection import KFold + #define number of splits + n_splits = 5 + ``` + + + In this step, you set the number of splits to `5`. You + store this in a variable called `n_splits`. + +3. Create an instance of `Kfold`: + + ``` + #create an instance of KFold + _kf = KFold(n_splits=n_splits) + ``` + + + In this step, you create an instance of `Kfold`. You + assign this instance to a variable called `_kf`. + +4. Generate the split indices: + + ``` + #create splits as _indices + _indices = _kf.split(df) + ``` + + + In this step, you call the `split()` method on + `_kf`, which is the instance of `KFold` that you + defined earlier. You provide `df` as a parameter so that + the splits are performed on the data contained in the DataFrame + called `df`. The resulting generator is stored as + `_indices`. + +5. Create two Python lists: + + ``` + _t, _v = [], [] + ``` + + + In this step, you create two Python lists. The first is called + `_t` and holds the training DataFrames, and the second is + called `_v` and holds the validation DataFrames. + +6. Iterate over the generator and create DataFrames called + `train_idx`, `val_idx`, `_train_df` + and `_val_df`: + + ``` + #iterate over _indices + for i in range(n_splits): + train_idx, val_idx = next(_indices) + _train_df = df.drop(val_idx) + _t.append(_train_df) + _val_df = df.drop(train_idx) + _v.append(_val_df) + ``` + + + In this step, you create a loop using `range` to determine + the number of iterations. You specify the number of iterations by + providing `n_splits` as a parameter to + `range()`. On every iteration, you execute + `next()` on the `_indices` generator and store + the results in `train_idx` and `val_idx`. You + then proceed to create `_train_df` by dropping the + validation indices, `val_idx`, from `df`. You + also create `_val_df` by dropping the training indices + from `df`. + +7. Iterate over the training list: + + ``` + for d in _t: + print(d.info()) + ``` + + + In this step, you verify that the compiler created the DataFrames. + You do this by iterating over the list and using the + `.info()` method to print out the details of each element. + The output is similar to the following screenshot, which is + incomplete due to the size of the output. Each element in the list + is a DataFrame with 1,382 entries: + + +![](./images/B15019_07_19.jpg) + + + Caption: Iterating over the training list + + Note + + The preceding output is a truncated version of the actual output. + +8. Iterate over the validation list: + + ``` + for d in _v: + print(d.info()) + ``` + + + In this step, you iterate over the validation list and make use of + `.info()` to print out the details of each element. The + output is similar to the following screenshot, which is incomplete + due to the size. Each element is a DataFrame with 346 entries: + + +![](./images/B15019_07_20.jpg) + + + + +Exercise 7.05: Getting the Scores from Five-Fold Cross-Validation +----------------------------------------------------------------- + +The goal of this exercise is to create a five-fold cross-validation +dataset from the data that you imported in *Exercise 7.01*, *Importing +and Splitting Data*. You will then use `cross_val_score` to +get the scores of models trained on those datasets. + + +The following steps will help you complete the exercise: + +1. Open a new Colab notebook and repeat *steps 1-6* that you took to + import data in *Exercise 7.01*, *Importing and Splitting Data*. + +2. Encode the categorical variables in the dataset: + + ``` + # encode categorical variables + _df = pd.get_dummies(df, columns=['buying', 'maint', 'doors', \ + 'persons', 'lug_boot', \ + 'safety']) + _df.head() + ``` + + + In this step, you make use of `pd.get_dummies()` to + convert categorical variables into an encoding. You store the result + in a new DataFrame variable called `_df`. You then proceed + to take a look at the first five records. + + The result should look similar to the following: + + +![](./images/B15019_07_21.jpg) + + + Caption: Encoding categorical variables + +3. Split the data into features and labels: + + ``` + # separate features and labels DataFrames + features = _df.drop(['car'], axis=1).values + labels = _df[['car']].values + ``` + + + In this step, you create a `features` DataFrame by + dropping `car` from `_df`. You also create + `labels` by selecting only `car` in a new + DataFrame. Here, a feature and a label are similar in the Cars + dataset. + +4. Create an instance of the `LogisticRegression` class to be + used later: + + ``` + from sklearn.linear_model import LogisticRegression + # create an instance of LogisticRegression + _lr = LogisticRegression() + ``` + + + In this step, you import `LogisticRegression` from + `sklearn.linear_model`. We use + `LogisticRegression` because it lets us create a + classification model, as you learned in *Lab 3, Binary + Classification*. You then proceed to create an instance and store it + as `_lr`. + +5. Import the `cross_val_score` function: + + ``` + from sklearn.model_selection import cross_val_score + ``` + + + In this step now, you import `cross_val_score`, which you + will make use of to compute the scores of the models. + +6. Compute the cross-validation scores: + + ``` + _scores = cross_val_score(_lr, features, labels, cv=5) + ``` + + + In this step, you the compute cross-validation scores and store the + result in a Python list, which you call `_scores`. You do + this using `cross_cal_score`. The function requires the + following four parameters: the model to make use of (in our case, + it\'s called `_lr`); the features of the dataset; the + labels of the dataset; and the number of cross-validation splits to + create (five, in our case). + +7. Now, display the scores as shown in the following code snippet: + + ``` + print(_scores) + ``` + + + In this step, you display the scores using `print()`. + + The output should look similar to the following: + + +![](./images/B15019_07_22.jpg) + + +Caption: Printing the cross-validation scores + + + +LogisticRegressionCV +==================== + + +`LogisticRegressionCV` is a class that implements +cross-validation inside it. This class will train multiple +`LogisticRegression` models and return the best one. + + + +Exercise 7.06: Training a Logistic Regression Model Using Cross-Validation +-------------------------------------------------------------------------- + +The goal of this exercise is to train a logistic regression model using +cross-validation and get the optimal R2 result. We will be making use of +the Cars dataset that you worked with previously. + +The following steps will help you complete the exercise: + +1. Open a new Colab notebook. + +2. Import the necessary libraries: + + ``` + # import libraries + import pandas as pd + from sklearn.model_selection import train_test_split + ``` + + + In this step, you import `pandas` and alias it as + `pd`. You will make use of pandas to read in the file you + will be working with. + +3. Create headers for the data: + + ``` + # data doesn't have headers, so let's create headers + _headers = ['buying', 'maint', 'doors', 'persons', \ + 'lug_boot', 'safety', 'car'] + ``` + + + In this step, you start by creating a Python list to hold the + `headers` column for the file you will be working with. + You store this list as `_headers`. + +4. Read the data: + + ``` + # read in cars dataset + df = pd.read_csv('https://raw.githubusercontent.com/'\ + 'fenago/data-science/'\ + 'master/Lab07/Dataset/car.data', \ + names=_headers, index_col=None) + ``` + + + You then proceed to read in the file and store it as `df`. + This is a DataFrame. + +5. Print out the top five records: + + ``` + df.info() + ``` + + + Finally, you look at the summary of the DataFrame using + `.info()`. + + The output looks similar to the following: + + +![](./images/B15019_07_23.jpg) + + + Caption: The top five records of the dataframe + +6. Encode the categorical variables as shown in the following code + snippet: + + ``` + # encode categorical variables + _df = pd.get_dummies(df, columns=['buying', 'maint', 'doors', \ + 'persons', 'lug_boot', \ + 'safety']) + _df.head() + ``` + + + In this step, you convert categorical variables into encodings using + the `get_dummies()` method from pandas. You supply the + original DataFrame as a parameter and also specify the columns you + would like to encode. + + Finally, you take a peek at the top five rows. The output looks + similar to the following: + + +![](./images/B15019_07_24.jpg) + + + Caption: Encoding categorical variables + +7. Split the DataFrame into features and labels: + + ``` + # separate features and labels DataFrames + features = _df.drop(['car'], axis=1).values + labels = _df[['car']].values + ``` + + + In this step, you create two NumPy arrays. The first, called + `features`, contains the independent variables. The + second, called `labels`, contains the values that the + model learns to predict. These are also called `targets`. + +8. Import logistic regression with cross-validation: + + ``` + from sklearn.linear_model import LogisticRegressionCV + ``` + + + In this step, you import the `LogisticRegressionCV` class. + +9. Instantiate `LogisticRegressionCV` as shown in the + following code snippet: + + ``` + model = LogisticRegressionCV(max_iter=2000, multi_class='auto',\ + cv=5) + ``` + + + In this step, you create an instance of + `LogisticRegressionCV`. You specify the following + parameters: + + `max_iter` : You set this to `2000` so that the + trainer continues training for `2000` iterations to find + better weights. + + `multi_class`: You set this to `auto` so that + the model automatically detects that your data has more than two + classes. + + `cv`: You set this to `5`, which is the number + of cross-validation sets you would like to train on. + +10. Now fit the model: + + ``` + model.fit(features, labels.ravel()) + ``` + + + In this step, you train the model. You pass in `features` + and `labels`. Because `labels` is a 2D array, + you make use of `ravel()` to convert it into a 1D array + or vector. + + The interpreter produces an output similar to the following: + + +![](./images/B15019_07_25.jpg) + + + Caption: Fitting the model + + In the preceding output, you see that the model fits the training + data. The output shows you the parameters that were used in + training, so you are not taken by surprise. Notice, for example, + that `max_iter` is `2000`, which is the value + that you set. Other parameters you didn\'t set make use of default + values, which you can find out more about from the documentation. + +11. Evaluate the training R2: + + ``` + print(model.score(features, labels.ravel())) + ``` + + + In this step, we make use of the training dataset to compute the R2 + score. While we didn\'t set aside a specific validation dataset, it + is important to note that the model only saw 80% of our training + data, so it still has new data to work with for this evaluation. + + The output looks similar to the following: + + +![](./images/B15019_07_26.jpg) + + +Caption: Computing the R2 score + + + +Hyperparameter Tuning with GridSearchCV +======================================= + + +`GridSearchCV` will take a model and parameters and train one +model for each permutation of the parameters. At the end of the +training, it will provide access to the parameters and the model scores. +This is called hyperparameter tuning and you will be looking at this in +much more depth in *Lab 8, Hyperparameter Tuning*. + +The usual practice is to make use of a small training set to find the +optimal parameters using hyperparameter tuning and then to train a final +model with all of the data. + +Before the next exercise, let\'s take a brief look at decision trees, +which are a type of model or estimator. + + + +Decision Trees +-------------- + +A decision tree works by generating a separating hyperplane or a +threshold for the features in data. It does this by considering every +feature and finding the correlation between the spread of the values in +that feature and the label that you are trying to predict. + +Consider the following data about balloons. The label you need to +predict is called `inflated`. This dataset is used for +predicting whether the balloon is inflated or deflated given the +features. The features are: + +- `color` +- `size` +- `act` +- `age` + +The following table displays the distribution of features: + +![](./images/B15019_07_27.jpg) + +Caption: Tabular data for balloon features + +Now consider the following charts, which are visualized depending on the +spread of the features against the label: + +- If you consider the `Color` feature, the values are + `PURPLE` and `YELLOW`, but the number of + observations is the same, so you can\'t infer whether the balloon is + inflated or not based on the color, as you can see in the following + figure: + +![](./images/B15019_07_28.jpg) + + +Caption: Barplot for the color feature + +- The `Size` feature has two values: `LARGE` and + `SMALL`. These are equally spread, so we can\'t infer + whether the balloon is inflated or not based on the color, as you + can see in the following figure: + +![](./images/B15019_07_29.jpg) + + +Caption: Barplot for the size feature + +- The `Act` feature has two values: `DIP` and + `STRETCH`. You can see from the chart that the majority of + the `STRETCH` values are inflated. If you had to make a + guess, you could easily say that if `Act` is + `STRETCH`, then the balloon is inflated. Consider the + following figure: + +![](./images/B15019_07_30.jpg) + + +Caption: Barplot for the act feature + +- Finally, the `Age` feature also has two values: + `ADULT` and `CHILD`. It\'s also visible from the + chart that the `ADULT` value constitutes the majority of + inflated balloons: + +![](./images/B15019_07_31.jpg) + + +Caption: Barplot for the age feature + +The two features that are useful to the decision tree are +`Act` and `Age`. The tree could start by considering +whether `Act` is `STRETCH`. If it is, the prediction +will be true. This tree would look like the following figure: + +![](./images/B15019_07_32.jpg) + +Caption: Decision tree with depth=1 + +The left side evaluates to the condition being false, and the right side +evaluates to the condition being true. This tree has a depth of 1. F +means that the prediction is false, and T means that the prediction is +true. + +To get better results, the decision tree could introduce a second level. +The second level would utilize the `Age` feature and evaluate +whether the value is `ADULT`. It would look like the following +figure: + +![](./images/B15019_07_33.jpg) + +Caption: Decision tree with depth=2 + +This tree has a depth of 2. At the first level, it predicts true if +`Act` is `STRETCH`. If `Act` is not +`STRETCH`, it checks whether `Age` is +`ADULT`. If it is, it predicts true, otherwise, it predicts +false. + +The decision tree can have as many levels as you like but starts to +overfit at a certain point. As with everything in data science, the +optimal depth depends on the data and is a hyperparameter, meaning you +need to try different values to find the optimal one. + +In the following exercise, we will be making use of grid search with +cross-validation to find the best parameters for a decision tree +estimator. + + + +Exercise 7.07: Using Grid Search with Cross-Validation to Find the Best Parameters for a Model +---------------------------------------------------------------------------------------------- + +The goal of this exercise is to make use of grid search to find the best +parameters for a `DecisionTree` classifier. We will be making +use of the Cars dataset that you worked with previously. + +The following steps will help you complete the exercise: + +1. Open a Colab notebook file. + +2. Import `pandas`: + + ``` + import pandas as pd + ``` + + + In this step, you import `pandas`. You alias it as + `pd`. `Pandas` is used to read in the data you + will work with subsequently. + +3. Create `headers`: + ``` + _headers = ['buying', 'maint', 'doors', 'persons', \ + 'lug_boot', 'safety', 'car'] + ``` + + +4. Read in the `headers`: + ``` + # read in cars dataset + df = pd.read_csv('https://raw.githubusercontent.com/'\ + 'fenago/data-science/'\ + 'master/Lab07/Dataset/car.data', \ + names=_headers, index_col=None) + ``` + + +5. Inspect the top five records: + + ``` + df.info() + ``` + + + The output looks similar to the following: + + +![](./images/B15019_07_34.jpg) + + + Caption: The top five records of the dataframe + +6. Encode the categorical variables: + + ``` + _df = pd.get_dummies(df, columns=['buying', 'maint', 'doors',\ + 'persons', 'lug_boot', \ + 'safety']) + _df.head() + ``` + + + In this step, you utilize `.get_dummies()` to convert the + categorical variables into encodings. The `.head()` method + instructs the Python interpreter to output the top five columns. + + The output is similar to the following: + + +![](./images/B15019_07_35.jpg) + + + Caption: Encoding categorical variables + +7. Separate `features` and `labels`: + + ``` + features = _df.drop(['car'], axis=1).values + labels = _df[['car']].values + ``` + + + In this step, you create two `numpy` arrays, + `features` and `labels`, the first containing + independent variables or predictors, and the second containing + dependent variables or targets. + +8. Import more libraries -- `numpy`, + `DecisionTreeClassifier`, and `GridSearchCV`: + + ``` + import numpy as np + from sklearn.tree import DecisionTreeClassifier + from sklearn.model_selection import GridSearchCV + ``` + + + In this step, you import `numpy`. NumPy is a numerical + computation library. You alias it as `np`. You also import + `DecisionTreeClassifier`, which you use to create decision + trees. Finally, you import `GridSearchCV`, which will use + cross-validation to train multiple models. + +9. Instantiate the decision tree: + + ``` + clf = DecisionTreeClassifier() + ``` + + + In this step, you create an instance of + `DecisionTreeClassifier` as `clf`. This instance + will be used repeatedly by the grid search. + +10. Create parameters -- `max_depth`: + + ``` + params = {'max_depth': np.arange(1, 8)} + ``` + + + In this step, you create a dictionary of parameters. There are two + parts to this dictionary: + + The key of the dictionary is a parameter that is passed into the + model. In this case, `max_depth` is a parameter that + `DecisionTreeClassifier` takes. + + The value is a Python list that grid search iterates over and passes + to the model. In this case, we create an array that starts at 1 and + ends at 7, inclusive. + +11. Instantiate the grid search as shown in the following code snippet: + + ``` + clf_cv = GridSearchCV(clf, param_grid=params, cv=5) + ``` + + + In this step, you create an instance of `GridSearchCV`. + The first parameter is the model to train. The second parameter is + the parameters to search over. The third parameter is the number of + cross-validation splits to create. + +12. Now train the models: + + ``` + clf_cv.fit(features, labels) + ``` + + + In this step, you train the models using the features and labels. + Depending on the type of model, this could take a while. Because we + are using a decision tree, it trains quickly. + + The output is similar to the following: + + +![](./images/B15019_07_36.jpg) + + + Caption: Training the model + + You can learn a lot by reading the output, such as the number of + cross-validation datasets created (called `cv` and equal + to `5`), the estimator used + (`DecisionTreeClassifier`), and the parameter search space + (called `param_grid`). + +13. Print the best parameter: + + ``` + print("Tuned Decision Tree Parameters: {}"\ + .format(clf_cv.best_params_)) + ``` + + + In this step, you print out what the best parameter is. In this + case, what we were looking for was the best `max_depth`. + The output looks like the following: + + +![](./images/B15019_07_37.jpg) + + + Caption: Printing the best parameter + + In the preceding output, you see that the best performing model is + one with a `max_depth` of `2`. + + Accessing `best_params_` lets you train another model with + the best-known parameters using a larger training dataset. + +14. Print the best `R2`: + + ``` + print("Best score is {}".format(clf_cv.best_score_)) + ``` + + + In this step, you print out the `R2` score of the best + performing model. + + The output is similar to the following: + + ``` + Best score is 0.7777777777777778 + ``` + + + In the preceding output, you see that the best performing model has + an `R2` score of `0.778`. + +15. Access the best model: + + ``` + model = clf_cv.best_estimator_ + model + ``` + + + In this step, you access the best model (or estimator) using + `best_estimator_`. This will let you analyze the model, or + optionally use it to make predictions and find other metrics. + Instructing the Python interpreter to print the best estimator will + yield an output similar to the following: + + +![](./images/B15019_07_38.jpg) + + +Caption: Accessing the model + +In the preceding output, you see that the best model is +`DecisionTreeClassifier` with a `max_depth` of +`2`. + + + +Hyperparameter Tuning with RandomizedSearchCV +============================================= + + +Grid search goes over the entire search space and trains a model or +estimator for every combination of parameters. Randomized search goes +over only some of the combinations. This is a more optimal use of +resources and still provides the benefits of hyperparameter tuning and +cross-validation. You will be looking at this in depth in *Lab 8, +Hyperparameter Tuning*. + +Have a look at the following exercise. + + + +Exercise 7.08: Using Randomized Search for Hyperparameter Tuning +---------------------------------------------------------------- + +The goal of this exercise is to perform hyperparameter tuning using +randomized search and cross-validation. + +The following steps will help you complete this exercise: + +1. Open a new Colab notebook file. + +2. Import `pandas`: + + ``` + import pandas as pd + ``` + + + In this step, you import `pandas`. You will make use of it + in the next step. + +3. Create `headers`: + ``` + _headers = ['buying', 'maint', 'doors', 'persons', \ + 'lug_boot', 'safety', 'car'] + ``` + + +4. Read in the data: + ``` + # read in cars dataset + df = pd.read_csv('https://raw.githubusercontent.com/'\ + 'fenago/data-science/'\ + 'master/Lab07/Dataset/car.data', \ + names=_headers, index_col=None) + ``` + + +5. Check the first five rows: + + ``` + df.info() + ``` + + + You need to provide a Python list of column headers because the data + does not contain column headers. You also inspect the DataFrame that + you created. + + The output is similar to the following: + + +![](./images/B15019_07_39.jpg) + + + Caption: The top five rows of the DataFrame + +6. Encode categorical variables as shown in the following code snippet: + + ``` + _df = pd.get_dummies(df, columns=['buying', 'maint', 'doors',\ + 'persons', 'lug_boot', \ + 'safety']) + _df.head() + ``` + + + In this step, you find a numerical representation of text data using + one-hot encoding. The operation results in a new DataFrame. You will + see that the resulting data structure looks similar to the + following: + + +![](./images/B15019_07_40.jpg) + + + Caption: Encoding categorical variables + +7. Separate the data into independent and dependent variables, which + are the `features` and `labels`: + + ``` + features = _df.drop(['car'], axis=1).values + labels = _df[['car']].values + ``` + + + In this step, you separate the DataFrame into two `numpy` + arrays called `features` and `labels`. + `Features` contains the independent variables, while + `labels` contains the target or dependent variables. + +8. Import additional libraries -- `numpy`, + `RandomForestClassifier`, and + `RandomizedSearchCV`: + + ``` + import numpy as np + from sklearn.ensemble import RandomForestClassifier + from sklearn.model_selection import RandomizedSearchCV + ``` + + + In this step, you import `numpy` for numerical + computations, `RandomForestClassifier` to create an + ensemble of estimators, and `RandomizedSearchCV` to + perform a randomized search with cross-validation. + +9. Create an instance of `RandomForestClassifier`: + + ``` + clf = RandomForestClassifier() + ``` + + + In this step, you instantiate `RandomForestClassifier`. A + random forest classifier is a voting classifier. It makes use of + multiple decision trees, which are trained on different subsets of + the data. The results from the trees contribute to the output of the + random forest by using a voting mechanism. + +10. Specify the parameters: + + ``` + params = {'n_estimators':[500, 1000, 2000], \ + 'max_depth': np.arange(1, 8)} + ``` + + + `RandomForestClassifier` accepts many parameters, but we + specify two: the number of trees in the forest, called + `n_estimators`, and the depth of the nodes in each tree, + called `max_depth`. + +11. Instantiate a randomized search: + + ``` + clf_cv = RandomizedSearchCV(clf, param_distributions=params, \ + cv=5) + ``` + + + In this step, you specify three parameters when you instantiate the + `clf` class, the estimator, or model to use, which is a + random forest classifier, `param_distributions`, the + parameter search space, and `cv`, the number of + cross-validation datasets to create. + +12. Perform the search: + + ``` + clf_cv.fit(features, labels.ravel()) + ``` + + + In this step, you perform the search by calling `fit()`. + This operation trains different models using the cross-validation + datasets and various combinations of the hyperparameters. The output + from this operation is similar to the following: + + +![](./images/B15019_07_41.jpg) + + + Caption: Output of the search operation + + In the preceding output, you see that the randomized search will be + carried out using cross-validation with five splits + (`cv=5`). The estimator to be used is + `RandomForestClassifier`. + +13. Print the best parameter combination: + + ``` + print("Tuned Random Forest Parameters: {}"\ + .format(clf_cv.best_params_)) + ``` + + + In this step, you print out the best hyperparameters. + + The output is similar to the following: + + +![](./images/B15019_07_42.jpg) + + + Caption: Printing the best parameter combination + + In the preceding output, you see that the best estimator is a Random + Forest classifier with 1,000 trees (`n_estimators=1000`) + and `max_depth=5`. You can print the best score by + executing + `print("Best score is {}".format(clf_cv.best_score_))`. + For this exercise, this value is \~ `0.76`. + +14. Inspect the best model: + + ``` + model = clf_cv.best_estimator_ + model + ``` + + + In this step, you find the best performing estimator (or model) and + print out its details. The output is similar to the following: + + +![](./images/B15019_07_43.jpg) + + +Caption: Inspecting the model + +In the preceding output, you see that the best estimator is +`RandomForestClassifier` with `n_estimators=1000` +and `max_depth=5`. + + +In this exercise, you learned to make use of cross-validation and random +search to find the best model using a combination of hyperparameters. +This process is called hyperparameter tuning, in which you find the best +combination of hyperparameters to use to train the model that you will +put into production. + + +Model Regularization with Lasso Regression +========================================== + + +As mentioned at the beginning of this lab models can overfit +training data. One reason for this is having too many features with +large coefficients (also called weights). The key to solving this type +of overfitting problem is reducing the magnitude of the coefficients. + +You may recall that weights are optimized during model training. One +method for optimizing weights is called gradient descent. The gradient +update rule makes use of a differentiable loss function. Examples of +differentiable loss functions are: + +- Mean Absolute Error (MAE) +- Mean Squared Error (MSE) + +For lasso regression, a penalty is introduced in the loss function. The +technicalities of this implementation are hidden by the class. The +penalty is also called a regularization parameter. + +Consider the following exercise in which you over-engineer a model to +introduce overfitting, and then use lasso regression to get better +results. + + + +Exercise 7.09: Fixing Model Overfitting Using Lasso Regression +-------------------------------------------------------------- + +The goal of this exercise is to teach you how to identify when your +model starts overfitting, and to use lasso regression to fix overfitting +in your model. + + +The attribute information states \"Features consist of hourly average +ambient variables: + +- Temperature (T) in the range 1.81°C and 37.11°C, +- Ambient Pressure (AP) in the range 992.89-1033.30 millibar, +- Relative Humidity (RH) in the range 25.56% to 100.16% +- Exhaust Vacuum (V) in the range 25.36-81.56 cm Hg +- Net hourly electrical energy output (EP) 420.26-495.76 MW + +The averages are taken from various sensors located around the plant +that record the ambient variables every second. The variables are given +without normalization.\" + +The following steps will help you complete the exercise: + +1. Open a Colab notebook. + +2. Import the required libraries: + ``` + import pandas as pd + from sklearn.model_selection import train_test_split + from sklearn.linear_model import LinearRegression, Lasso + from sklearn.metrics import mean_squared_error + from sklearn.pipeline import Pipeline + from sklearn.preprocessing import MinMaxScaler, \ + PolynomialFeatures + ``` + + +3. Read in the data: + ``` + _df = pd.read_csv('https://raw.githubusercontent.com/'\ + 'fenago/data-science/'\ + 'master/Lab07/Dataset/ccpp.csv') + ``` + + +4. Inspect the DataFrame: + + ``` + _df.info() + ``` + + + The `.info()` method prints out a summary of the + DataFrame, including the names of the columns and the number of + records. The output might be similar to the following: + + +![](./images/B15019_07_44.jpg) + + + Caption: Inspecting the dataframe + + You can see from the preceding figure that the DataFrame has 5 + columns and 9,568 records. You can see that all columns contain + numeric data and that the columns have the following names: + `AT`, `V`, `AP`, `RH`, and + `PE`. + +5. Extract features into a column called `X`: + ``` + X = _df.drop(['PE'], axis=1).values + ``` + + +6. Extract labels into a column called `y`: + ``` + y = _df['PE'].values + ``` + + +7. Split the data into training and evaluation sets: + ``` + train_X, eval_X, train_y, eval_y = train_test_split\ + (X, y, train_size=0.8, \ + random_state=0) + ``` + + +8. Create an instance of a `LinearRegression` model: + ``` + lr_model_1 = LinearRegression() + ``` + + +9. Fit the model on the training data: + + ``` + lr_model_1.fit(train_X, train_y) + ``` + + + The output from this step should look similar to the following: + + +![](./images/B15019_07_45.jpg) + + + Caption: Fitting the model on training data + +10. Use the model to make predictions on the evaluation dataset: + ``` + lr_model_1_preds = lr_model_1.predict(eval_X) + ``` + + +11. Print out the `R2` score of the model: + + ``` + print('lr_model_1 R2 Score: {}'\ + .format(lr_model_1.score(eval_X, eval_y))) + ``` + + + The output of this step should look similar to the following: + + +![](./images/B15019_07_46.jpg) + + + Caption: Printing the R2 score + + You will notice that the `R2` score for this model is + `0.926`. You will make use of this figure to compare with + the next model you train. Recall that this is an evaluation metric. + +12. Print out the Mean Squared Error (MSE) of this model: + + ``` + print('lr_model_1 MSE: {}'\ + .format(mean_squared_error(eval_y, lr_model_1_preds))) + ``` + + + The output of this step should look similar to the following: + + +![](./images/B15019_07_47.jpg) + + + Caption: Printing the MSE + + You will notice that the MSE is `21.675`. This is an + evaluation metric that you will use to compare this model to + subsequent models. + + The first model was trained on four features. You will now train a + new model on four cubed features. + +13. Create a list of tuples to serve as a pipeline: + + ``` + steps = [('scaler', MinMaxScaler()),\ + ('poly', PolynomialFeatures(degree=3)),\ + ('lr', LinearRegression())] + ``` + + + In this step, you create a list with three tuples. The first tuple + represents a scaling operation that makes use of + `MinMaxScaler`. The second tuple represents a feature + engineering step and makes use of `PolynomialFeatures`. + The third tuple represents a `LinearRegression` model. + + The first element of the tuple represents the name of the step, + while the second element represents the class that performs a + transformation or an estimator. + +14. Create an instance of a pipeline: + ``` + lr_model_2 = Pipeline(steps) + ``` + + +15. Train the instance of the pipeline: + + ``` + lr_model_2.fit(train_X, train_y) + ``` + + + The pipeline implements a `.fit()` method, which is also + implemented in all instances of transformers and estimators. The + `.fit()` method causes `.fit_transform()` to be + called on transformers, and causes `.fit()` to be called + on estimators. The output of this step is similar to the following: + + +![](./images/B15019_07_48.jpg) + + + Caption: Training the instance of the pipeline + + You can see from the output that a pipeline was trained. You can see + that the steps are made up of `MinMaxScaler` and + `PolynomialFeatures`, and that the final step is made up + of `LinearRegression`. + +16. Print out the `R2` score of the model: + + ``` + print('lr_model_2 R2 Score: {}'\ + .format(lr_model_2.score(eval_X, eval_y))) + ``` + + + The output is similar to the following: + + +![](./images/B15019_07_49.jpg) + + + Caption: The R2 score of the model + + You can see from the preceding that the `R2` score is + `0.944`, which is better than the `R2` score of + the first model, which was `0.932`. You can start to + observe that the metrics suggest that this model is better than the + first one. + +17. Use the model to predict on the evaluation data: + ``` + lr_model_2_preds = lr_model_2.predict(eval_X) + ``` + + +18. Print the MSE of the second model: + + ``` + print('lr_model_2 MSE: {}'\ + .format(mean_squared_error(eval_y, lr_model_2_preds))) + ``` + + + The output is similar to the following: + + +![](./images/B15019_07_50.jpg) + + + Caption: The MSE of the second model + + You can see from the output that the MSE of the second model is + `16.27`. This is less than the MSE of the first model, + which is `19.73`. You can safely conclude that the second + model is better than the first. + +19. Inspect the model coefficients (also called weights): + + ``` + print(lr_model_2[-1].coef_) + ``` + + + In this step, you will note that `lr_model_2` is a + pipeline. The final object in this pipeline is the model, so you + make use of list addressing to access this by setting the index of + the list element to `-1`. + + Once you have the model, which is the final element in the pipeline, + you make use of `.coef_` to get the model coefficients. + The output is similar to the following: + + +![](./images/B15019_07_51.jpg) + + + Caption: Print the model coefficients + + You will note from the preceding output that the majority of the + values are in the tens, some values are in the hundreds, and one + value has a really small magnitude. + +20. Check for the number of coefficients in this model: + + ``` + print(len(lr_model_2[-1].coef_)) + ``` + + + The output for this step is similar to the following: + + ``` + 35 + ``` + + + You can see from the preceding screenshot that the second model has + `35` coefficients. + +21. Create a `steps` list with `PolynomialFeatures` + of degree `10`: + ``` + steps = [('scaler', MinMaxScaler()),\ + ('poly', PolynomialFeatures(degree=10)),\ + ('lr', LinearRegression())] + ``` + + +22. Create a third model from the preceding steps: + ``` + lr_model_3 = Pipeline(steps) + ``` + + +23. Fit the third model on the training data: + + ``` + lr_model_3.fit(train_X, train_y) + ``` + + + The output from this step is similar to the following: + + +![](./images/B15019_07_52.jpg) + + + Caption: Fitting the third model on the data + + You can see from the output that the pipeline makes use of + `PolynomialFeatures` of degree `10`. You are + doing this in the hope of getting a better model. + +24. Print out the `R2` score of this model: + + ``` + print('lr_model_3 R2 Score: {}'\ + .format(lr_model_3.score(eval_X, eval_y))) + ``` + + + The output of this model is similar to the following: + + +![](./images/B15019_07_53.jpg) + + + Caption: R2 score of the model + + You can see from the preceding figure that the R2 score is now + `0.56`. The previous model had an `R2` score of + `0.944`. This model has an R2 score that is considerably + worse than the one of the previous model, `lr_model_2`. + This happens when your model is overfitting. + +25. Use `lr_model_3` to predict on evaluation data: + ``` + lr_model_3_preds = lr_model_3.predict(eval_X) + ``` + + +26. Print out the MSE for `lr_model_3`: + + ``` + print('lr_model_3 MSE: {}'\ + .format(mean_squared_error(eval_y, lr_model_3_preds))) + ``` + + + The output for this step might be similar to the following: + + +![](./images/B15019_07_54.jpg) + + + Caption: The MSE of the model + + You can see from the preceding figure that the MSE is also + considerably worse. The MSE is `126.25`, as compared to + `16.27` for the previous model. + +27. Print out the number of coefficients (also called weights) in this + model: + + ``` + print(len(lr_model_3[-1].coef_)) + ``` + + + The output might resemble the following: + + +![](./images/B15019_07_55.jpg) + + + Caption: Printing the number of coefficients + + You can see that the model has 1,001 coefficients. + +28. Inspect the first 35 coefficients to get a sense of the individual + magnitudes: + + ``` + print(lr_model_3[-1].coef_[:35]) + ``` + + + The output might be similar to the following: + + +![](./images/B15019_07_56.jpg) + + + Caption: Inspecting the first 35 coefficients + + You can see from the output that the coefficients have significantly + larger magnitudes than the coefficients from `lr_model_2`. + + In the next steps, you will train a lasso regression model on the + same set of features to reduce overfitting. + +29. Create a list of steps for the pipeline you will create later on: + + ``` + steps = [('scaler', MinMaxScaler()),\ + ('poly', PolynomialFeatures(degree=10)),\ + ('lr', Lasso(alpha=0.01))] + ``` + + + You create a list of steps for the pipeline you will create. Note + that the third step in this list is an instance of lasso. The + parameter called `alpha` in the call to + `Lasso()` is the regularization parameter. You can play + around with any values from 0 to 1 to see how it affects the + performance of the model that you train. + +30. Create an instance of a pipeline: + ``` + lasso_model = Pipeline(steps) + ``` + + +31. Fit the pipeline on the training data: + + ``` + lasso_model.fit(train_X, train_y) + ``` + + + The output from this operation might be similar to the following: + + +![](./images/B15019_07_57.jpg) + + + Caption: Fitting the pipeline on the training data + + You can see from the output that the pipeline trained a lasso model + in the final step. The regularization parameter was `0.01` + and the model trained for a maximum of 1,000 iterations. + +32. Print the `R2` score of `lasso_model`: + + ``` + print('lasso_model R2 Score: {}'\ + .format(lasso_model.score(eval_X, eval_y))) + ``` + + + The output of this step might be similar to the following: + + +![](./images/B15019_07_58.jpg) + + + Caption: R2 score + + You can see that the `R2` score has climbed back up to + `0.94`, which is considerably better than the score of + `0.56` that `lr_model_3` had. This is already + looking like a better model. + +33. Use `lasso_model` to predict on the evaluation data: + ``` + lasso_preds = lasso_model.predict(eval_X) + ``` + + +34. Print the MSE of `lasso_model`: + + ``` + print('lasso_model MSE: {}'\ + .format(mean_squared_error(eval_y, lasso_preds))) + ``` + + + The output might be similar to the following: + + +![](./images/B15019_07_59.jpg) + + + Caption: MSE of lasso model + + You can see from the output that the MSE is `17.01`, which + is way lower than the MSE value of `126.25` that + `lr_model_3` had. You can safely conclude that this is a + much better model. + +35. Print out the number of coefficients in `lasso_model`: + + ``` + print(len(lasso_model[-1].coef_)) + ``` + + + The output might be similar to the following: + + ``` + 1001 + ``` + + + You can see that this model has 1,001 coefficients, which is the + same number of coefficients that `lr_model_3` had. + +36. Print out the values of the first 35 coefficients: + + ``` + print(lasso_model[-1].coef_[:35]) + ``` + + + The output might be similar to the following: + + +![](./images/B15019_07_60.jpg) + + +Caption: Printing the values of 35 coefficients + +You can see from the preceding output that some of the coefficients are +set to `0`. This has the effect of ignoring the corresponding +column of data in the input. You can also see that the remaining +coefficients have magnitudes of less than 100. This goes to show that +the model is no longer overfitting. + +This exercise taught you how to fix overfitting by using +`LassoRegression` to train a new model. + +In the next section, you will learn about using ridge regression to +solve overfitting in a model. + + +Ridge Regression +================ + + +You just learned about lasso regression, which introduces a penalty and +tries to eliminate certain features from the data. Ridge regression +takes an alternative approach by introducing a penalty that penalizes +large weights. As a result, the optimization process tries to reduce the +magnitude of the coefficients without completely eliminating them. + + + +Exercise 7.10: Fixing Model Overfitting Using Ridge Regression +-------------------------------------------------------------- + +The goal of this exercise is to teach you how to identify when your +model starts overfitting, and to use ridge regression to fix overfitting +in your model. + +Note + +You will be using the same dataset as in *Exercise 7.09*, *Fixing Model +Overfitting Using Lasso Regression.* + +The following steps will help you complete the exercise: + +1. Open a Colab notebook. + +2. Import the required libraries: + ``` + import pandas as pd + from sklearn.model_selection import train_test_split + from sklearn.linear_model import LinearRegression, Ridge + from sklearn.metrics import mean_squared_error + from sklearn.pipeline import Pipeline + from sklearn.preprocessing import MinMaxScaler, \ + PolynomialFeatures + ``` + + +3. Read in the data: + ``` + _df = pd.read_csv('https://raw.githubusercontent.com/'\ + 'fenago/data-science/'\ + 'master/Lab07/Dataset/ccpp.csv') + ``` + + +4. Inspect the DataFrame: + + ``` + _df.info() + ``` + + + The `.info()` method prints out a summary of the + DataFrame, including the names of the columns and the number of + records. The output might be similar to the following: + + +![](./images/B15019_07_61.jpg) + + + Caption: Inspecting the dataframe + + You can see from the preceding figure that the DataFrame has 5 + columns and 9,568 records. You can see that all columns contain + numeric data and that the columns have the names: `AT`, + `V`, `AP`, `RH`, and `PE`. + +5. Extract features into a column called `X`: + ``` + X = _df.drop(['PE'], axis=1).values + ``` + + +6. Extract labels into a column called `y`: + ``` + y = _df['PE'].values + ``` + + +7. Split the data into training and evaluation sets: + ``` + train_X, eval_X, train_y, eval_y = train_test_split\ + (X, y, train_size=0.8, \ + random_state=0) + ``` + + +8. Create an instance of a `LinearRegression` model: + ``` + lr_model_1 = LinearRegression() + ``` + + +9. Fit the model on the training data: + + ``` + lr_model_1.fit(train_X, train_y) + ``` + + + The output from this step should look similar to the following: + + +![](./images/B15019_07_62.jpg) + + + Caption: Fitting the model on data + +10. Use the model to make predictions on the evaluation dataset: + ``` + lr_model_1_preds = lr_model_1.predict(eval_X) + ``` + + +11. Print out the `R2` score of the model: + + ``` + print('lr_model_1 R2 Score: {}'\ + .format(lr_model_1.score(eval_X, eval_y))) + ``` + + + The output of this step should look similar to the following: + + +![](./images/B15019_07_63.jpg) + + + Caption: R2 score + + You will notice that the R2 score for this model is + `0.933`. You will make use of this figure to compare it + with the next model you train. Recall that this is an evaluation + metric. + +12. Print out the MSE of this model: + + ``` + print('lr_model_1 MSE: {}'\ + .format(mean_squared_error(eval_y, lr_model_1_preds))) + ``` + + + The output of this step should look similar to the following: + + +![](./images/B15019_07_64.jpg) + + + Caption: The MSE of the model + + You will notice that the MSE is `19.734`. This is an + evaluation metric that you will use to compare this model to + subsequent models. + + The first model was trained on four features. You will now train a + new model on four cubed features. + +13. Create a list of tuples to serve as a pipeline: + + ``` + steps = [('scaler', MinMaxScaler()),\ + ('poly', PolynomialFeatures(degree=3)),\ + ('lr', LinearRegression())] + ``` + + + In this step, you create a list with three tuples. The first tuple + represents a scaling operation that makes use of + `MinMaxScaler`. The second tuple represents a feature + engineering step and makes use of `PolynomialFeatures`. + The third tuple represents a `LinearRegression` model. + + The first element of the tuple represents the name of the step, + while the second element represents the class that performs a + transformation or an estimation. + +14. Create an instance of a pipeline: + ``` + lr_model_2 = Pipeline(steps) + ``` + + +15. Train the instance of the pipeline: + + ``` + lr_model_2.fit(train_X, train_y) + ``` + + + The pipeline implements a `.fit()` method, which is also + implemented in all instances of transformers and estimators. The + `.fit()` method causes `.fit_transform()` to be + called on transformers, and causes `.fit()` to be called + on estimators. The output of this step is similar to the following: + + +![](./images/B15019_07_65.jpg) + + + Caption: Training the instance of a pipeline + + You can see from the output that a pipeline was trained. You can see + that the steps are made up of `MinMaxScaler` and + `PolynomialFeatures`, and that the final step is made up + of `LinearRegression`. + +16. Print out the `R2` score of the model: + + ``` + print('lr_model_2 R2 Score: {}'\ + .format(lr_model_2.score(eval_X, eval_y))) + ``` + + + The output is similar to the following: + + +![](./images/B15019_07_66.jpg) + + + Caption: R2 score + + You can see from the preceding that the R2 score is + `0.944`, which is better than the R2 score of the first + model, which was `0.933`. You can start to observe that + the metrics suggest that this model is better than the first one. + +17. Use the model to predict on the evaluation data: + ``` + lr_model_2_preds = lr_model_2.predict(eval_X) + ``` + + +18. Print the MSE of the second model: + + ``` + print('lr_model_2 MSE: {}'\ + .format(mean_squared_error(eval_y, lr_model_2_preds))) + ``` + + + The output is similar to the following: + + +![](./images/B15019_07_67.jpg) + + + Caption: The MSE of the model + + You can see from the output that the MSE of the second model is + `16.272`. This is less than the MSE of the first model, + which is `19.734`. You can safely conclude that the second + model is better than the first. + +19. Inspect the model coefficients (also called weights): + + ``` + print(lr_model_2[-1].coef_) + ``` + + + In this step, you will note that `lr_model_2` is a + pipeline. The final object in this pipeline is the model, so you + make use of list addressing to access this by setting the index of + the list element to `-1`. + + Once you have the model, which is the final element in the pipeline, + you make use of `.coef_` to get the model coefficients. + The output is similar to the following: + + +![](./images/B15019_07_68.jpg) + + + Caption: Printing model coefficients + + You will note from the preceding output that the majority of the + values are in the tens, some values are in the hundreds, and one + value has a really small magnitude. + +20. Check the number of coefficients in this model: + + ``` + print(len(lr_model_2[-1].coef_)) + ``` + + + The output of this step is similar to the following: + + +![](./images/B15019_07_69.jpg) + + + Caption: Checking the number of coefficients + + You will see from the preceding that the second model has 35 + coefficients. + +21. Create a `steps` list with `PolynomialFeatures` + of degree `10`: + ``` + steps = [('scaler', MinMaxScaler()),\ + ('poly', PolynomialFeatures(degree=10)),\ + ('lr', LinearRegression())] + ``` + + +22. Create a third model from the preceding steps: + ``` + lr_model_3 = Pipeline(steps) + ``` + + +23. Fit the third model on the training data: + + ``` + lr_model_3.fit(train_X, train_y) + ``` + + + The output from this step is similar to the following: + + +![](./images/B15019_07_70.jpg) + + + Caption: Fitting lr\_model\_3 on the training data + + You can see from the output that the pipeline makes use of + `PolynomialFeatures` of degree `10`. You are + doing this in the hope of getting a better model. + +24. Print out the `R2` score of this model: + + ``` + print('lr_model_3 R2 Score: {}'\ + .format(lr_model_3.score(eval_X, eval_y))) + ``` + + + The output of this model is similar to the following: + + +![](./images/B15019_07_71.jpg) + + + Caption: R2 score + + You can see from the preceding figure that the `R2` score + is now `0.568` The previous model had an `R2` + score of `0.944`. This model has an `R2` score + that is worse than the one of the previous model, + `lr_model_2`. This happens when your model is overfitting. + +25. Use `lr_model_3` to predict on evaluation data: + ``` + lr_model_3_preds = lr_model_3.predict(eval_X) + ``` + + +26. Print out the MSE for `lr_model_3`: + + ``` + print('lr_model_3 MSE: {}'\ + .format(mean_squared_error(eval_y, lr_model_3_preds))) + ``` + + + The output of this step might be similar to the following: + + +![](./images/B15019_07_72.jpg) + + + Caption: The MSE of lr\_model\_3 + + You can see from the preceding figure that the MSE is also worse. + The MSE is `126.254`, as compared to `16.271` + for the previous model. + +27. Print out the number of coefficients (also called weights) in this + model: + + ``` + print(len(lr_model_3[-1].coef_)) + ``` + + + The output might resemble the following: + + ``` + 1001 + ``` + + + You can see that the model has `1,001` coefficients. + +28. Inspect the first `35` coefficients to get a sense of the + individual magnitudes: + + ``` + print(lr_model_3[-1].coef_[:35]) + ``` + + + The output might be similar to the following: + + +![](./images/B15019_07_73.jpg) + + + Caption: Inspecting 35 coefficients + + You can see from the output that the coefficients have significantly + larger magnitudes than the coefficients from `lr_model_2`. + + In the next steps, you will train a ridge regression model on the + same set of features to reduce overfitting. + +29. Create a list of steps for the pipeline you will create later on: + + ``` + steps = [('scaler', MinMaxScaler()),\ + ('poly', PolynomialFeatures(degree=10)),\ + ('lr', Ridge(alpha=0.9))] + ``` + + + You create a list of steps for the pipeline you will create. Note + that the third step in this list is an instance of + `Ridge`. The parameter called `alpha` in the + call to `Ridge()` is the regularization parameter. You can + play around with any values from 0 to 1 to see how it affects the + performance of the model that you train. + +30. Create an instance of a pipeline: + ``` + ridge_model = Pipeline(steps) + ``` + + +31. Fit the pipeline on the training data: + + ``` + ridge_model.fit(train_X, train_y) + ``` + + + The output of this operation might be similar to the following: + + +![](./images/B15019_07_74.jpg) + + + Caption: Fitting the pipeline on training data + + You can see from the output that the pipeline trained a ridge model + in the final step. The regularization parameter was `0`. + +32. Print the R2 score of `ridge_model`: + + ``` + print('ridge_model R2 Score: {}'\ + .format(ridge_model.score(eval_X, eval_y))) + ``` + + + The output of this step might be similar to the following: + + +![](./images/B15019_07_75.jpg) + + + Caption: R2 score + + You can see that the R2 score has climbed back up to + `0.945`, which is way better than the score of + `0.568` that `lr_model_3` had. This is already + looking like a better model. + +33. Use `ridge_model` to predict on the evaluation data: + ``` + ridge_model_preds = ridge_model.predict(eval_X) + ``` + + +34. Print the MSE of `ridge_model`: + + ``` + print('ridge_model MSE: {}'\ + .format(mean_squared_error(eval_y, ridge_model_preds))) + ``` + + + The output might be similar to the following: + + +![](./images/B15019_07_76.jpg) + + + Caption: The MSE of ridge\_model + + You can see from the output that the MSE is `16.030`, + which is lower than the MSE value of `126.254` that + `lr_model_3` had. You can safely conclude that this is a + much better model. + +35. Print out the number of coefficients in `ridge_model`: + + ``` + print(len(ridge_model[-1].coef_)) + ``` + + + The output might be similar to the following: + + +![](./images/B15019_07_77.jpg) + + + Caption: The number of coefficients in the ridge model + + You can see that this model has `1001` coefficients, which + is the same number of coefficients that `lr_model_3` had. + +36. Print out the values of the first 35 coefficients: + + ``` + print(ridge_model[-1].coef_[:35]) + ``` + + + The output might be similar to the following: + + +![](./images/B15019_07_78.jpg) + + +Caption: The values of the first 35 coefficients + + +This exercise taught you how to fix overfitting by using +`RidgeRegression` to train a new model. + + + +Activity 7.01: Find an Optimal Model for Predicting the Critical Temperatures of Superconductors +------------------------------------------------------------------------------------------------ + +You work as a data scientist for a cable manufacturer. Management has +decided to start shipping low-resistance cables to clients around the +world. To ensure that the right cables are shipped to the right +countries, they would like to predict the critical temperatures of +various cables based on certain observed readings. + +In this activity, you will train a linear regression model and compute +the R2 score and the MSE. You will proceed to engineer new features +using polynomial features of degree 3. You will compare the R2 score and +MSE of this new model to those of the first model to determine +overfitting. You will then use regularization to train a model that +generalizes to previously unseen data. + + + +The steps to accomplish this task are: + +1. Open a Colab notebook. + +2. Load the necessary libraries. + +3. Read in the data from the `superconduct` folder. + +4. Prepare the `X` and `y` variables. + +5. Split the data into training and evaluation sets. + +6. Create a baseline linear regression model. + +7. Print out the R2 score and MSE of the model. + +8. Create a pipeline to engineer polynomial features and train a linear + regression model. + +9. Print out the R2 score and MSE. + +10. Determine that this new model is overfitting. + +11. Create a pipeline to engineer polynomial features and train a ridge + or lasso model. + +12. Print out the R2 score and MSE. + + The output will be as follows: + + +![](./images/B15019_07_79.jpg) + + + Caption: The R2 score and MSE of the ridge model + +13. Determine that this model is no longer overfitting. This is the + model to put into production. + + The coefficients for the ridge model are as shown in the following + figure: + + +![](./images/B15019_07_80.jpg) + + +Caption: The coefficients for the ridge model + + + +Summary +======= + + +In this lab, we studied the importance of withholding some of the +available data to evaluate models. We also learned how to make use of +all of the available data with a technique called cross-validation to +find the best performing model from a set of models you are training. We +also made use of evaluation metrics to determine when a model starts to +overfit and made use of ridge and lasso regression to fix a model that +is overfitting. + +In the next lab, we will go into hyperparameter tuning in depth. You +will learn about various techniques for finding the best hyperparameters +to train your models. diff --git a/lab_guides/Lab_8.md b/lab_guides/Lab_8.md new file mode 100644 index 0000000..f911134 --- /dev/null +++ b/lab_guides/Lab_8.md @@ -0,0 +1,1761 @@ + +8. Hyperparameter Tuning +======================== + + + +Overview + +In this lab, each hyperparameter tuning strategy will be first +broken down into its key steps before any high-level scikit-learn +implementations are demonstrated. This is to ensure that you fully +understand the concept behind each of the strategies before jumping to +the more automated methods. + +By the end of this lab, you will be able to find further predictive +performance improvements via the systematic evaluation of estimators +with different hyperparameters. You will successfully deploy manual, +grid, and random search strategies to find the optimal hyperparameters. +You will be able to parameterize **k-nearest neighbors** (**k-NN**), +**support vector machines** (**SVMs**), ridge regression, and random +forest classifiers to optimize model performance. + + +Introduction +============ + + +In previous labs, we discussed several methods to arrive at a model +that performs well. These include transforming the data via +preprocessing, feature engineering and scaling, or simply choosing an +appropriate estimator (algorithm) type from the large set of possible +estimators made available to the users of scikit-learn. + +Depending on which estimator you eventually select, there may be +settings that can be adjusted to improve overall predictive performance. +These settings are known as hyperparameters, and deriving the best +hyperparameters is known as tuning or optimizing. Properly tuning your +hyperparameters can result in performance improvements well into the +double-digit percentages, so it is well worth doing in any modeling +exercise. + +This lab will discuss the concept of hyperparameter tuning and will +present some simple strategies that you can use to help find the best +hyperparameters for your estimators. + +In previous labs, we have seen some exercises that use a range of +estimators, but we haven\'t conducted any hyperparameter tuning. After +reading this lab, we recommend you revisit these exercises, apply +the techniques taught, and see if you can improve the results. + + +What Are Hyperparameters? +========================= + + +Hyperparameters can be thought of as a set of dials and switches for +each estimator that change how the estimator works to explain +relationships in the data. + +Have a look at *Figure 8.1*: + +![](./images/B15019_08_01.jpg) + +Caption: How hyperparameters work + +If you read from left to right in the preceding figure, you can see that +during the tuning process we change the value of the hyperparameter, +which results in a change to the estimator. This in turn causes a change +in model performance. Our objective is to find hyperparameterization +that leads to the best model performance. This will be the *optimal* +hyperparameterization. + +Estimators can have hyperparameters of varying quantities and types, +which means that sometimes you can be faced with a very large number of +possible hyperparameterizations to choose for an estimator. + +For instance, scikit-learn\'s implementation of the SVM classifier +(`sklearn.svm.SVC`), which you will be introduced to later in +the lab, is an estimator that has multiple possible +hyperparameterizations. We will test out only a small subset of these, +namely using a linear kernel or a polynomial kernel of degree 2, 3, or +4. + +Some of these hyperparameters are continuous in nature, while others are +discrete, and the presence of continuous hyperparameters means that the +number of possible hyperparameterizations is theoretically infinite. Of +course, when it comes to producing a model with good predictive +performance, some hyperparameterizations are much better than others, +and it is your job as a data scientist to find them. + +In the next section, we will be looking at setting these hyperparameters +in more detail. But first, some clarification of terms. + + + +Difference between Hyperparameters and Statistical Model Parameters +------------------------------------------------------------------- + +In your reading on data science, particularly in the area of statistics, +you will come across terms such as \"model parameters,\" \"parameter +estimation,\" and \"(non)-parametric models.\" These terms relate to the +parameters that feature in the mathematical formulation of models. The +simplest example is that of the single variable linear model with no +intercept term that takes the following form: + +![](./images/B15019_08_02.jpg) + +Caption: Equation for a single variable linear model + +Here, 𝛽 is the statistical model parameter, and if this formulation is +chosen, it is the data scientist\'s job to use data to estimate what +value it takes. This could be achieved using **Ordinary Least Squares** +(**OLS**) regression modeling, or it could be achieved through a method +called median regression. + +Hyperparameters are different in that they are external to the +mathematical form. An example of a hyperparameter in this case is the +way in which 𝛽 will be estimated (OLS, or median regression). In some +cases, hyperparameters can change the algorithm completely (that is, +generating a completely different mathematical form). You will see +examples of this occurring throughout this lab. + +In the next section, you will be looking at how to set a hyperparameter. + + + +Setting Hyperparameters +----------------------- + +In *Lab 7*, *The Generalization of Machine Learning Models*, you +were introduced to the k-NN model for classification and you saw how +varying k, the number of nearest neighbors, resulted in changes in model +performance with respect to the prediction of class labels. Here, k is a +hyperparameter, and the act of manually trying different values of k is +a simple form of hyperparameter tuning. + +Each time you initialize a scikit-learn estimator, it will take on a +hyperparameterization as determined by the values you set for its +arguments. If you specify no values, then the estimator will take on a +default hyperparameterization. If you would like to see how the +hyperparameters have been set for your estimator, and what +hyperparameters you can adjust, simply print the output of the +`estimator.get_params()` method. + +For instance, say we initialize a k-NN estimator without specifying any +arguments (empty brackets). To see the default hyperparameterization, we +can run: + +``` +from sklearn import neighbors +# initialize with default hyperparameters +knn = neighbors.KNeighborsClassifier() +# examine the defaults +print(knn.get_params()) +``` +You should get the following output: + +``` +{'algorithm': 'auto', 'leaf_size': 30, 'metric': 'minkowski', + 'metric_params': None, 'n_jobs': None, 'n_neighbors': 5, + 'p': 2, 'weights': 'uniform'} +``` +A dictionary of all the hyperparameters is now printed to the screen, +revealing their default settings. Notice `k`, our number of +nearest neighbors, is set to `5`. + +To get more information as to what these parameters mean, how they can +be changed, and what their likely effect may be, you can run the +following command and view the help file for the estimator in question. + +For our k-NN estimator: + +``` +?knn +``` + +The output will be as follows: + +![](./images/B15019_08_03.jpg) + +Caption: Help file for the k-NN estimator + +If you look closely at the help file, you will see the default +hyperparameterization for the estimator under the +`String form` heading, along with an explanation of what each +hyperparameter means under the `Parameters` heading. + +Coming back to our example, if we want to change the +hyperparameterization from `k = 5` to `k = 15`, just +re-initialize the estimator and set the `n_neighbors` argument +to `15`, which will override the default: + +``` +""" +initialize with k = 15 and all other hyperparameters as default +""" +knn = neighbors.KNeighborsClassifier(n_neighbors=15) +# examine +print(knn.get_params()) +``` +You should get the following output: + +``` +{'algorithm': 'auto', 'leaf_size': 30, 'metric': 'minkowski', + 'metric_params': None, 'n_jobs': None, 'n_neighbors': 15, + 'p': 2, 'weights': 'uniform'} +``` +You may have noticed that k is not the only hyperparameter available for +k-NN classifiers. Setting multiple hyperparameters is as easy as +specifying the relevant arguments. For example, let\'s increase the +number of neighbors from `5` to `15` and force the +algorithm to take the distance of points in the neighborhood, rather +than a simple majority vote, into account when training. For more +information, see the description for the `weights` argument in +the help file (`?knn`): + +``` +""" +initialize with k = 15, weights = distance and all other +hyperparameters as default +""" +knn = neighbors.KNeighborsClassifier(n_neighbors=15, \ + weights='distance') +# examine +print(knn.get_params()) +``` + +The output will be as follows: + +``` +{'algorithm': 'auto', 'leaf_size': 30, 'metric': 'minkowski', + 'metric_params': None, 'n_jobs': None, 'n_neighbors': 15, + 'p': 2, 'weights': 'distance'} +``` + +In the output, you can see `n_neighbors` (`k`) is +now set to `15`, and `weights` is now set to +`distance`, rather than `uniform`. + + + +A Note on Defaults +------------------ + +Generally, efforts have been made by the developers of machine learning +libraries to set sensible default hyperparameters for estimators. That +said, for certain datasets, significant performance improvements may be +achieved through tuning. + + +Finding the Best Hyperparameterization +====================================== + + +The best hyperparameterization depends on your overall objective in +building a machine learning model in the first place. In most cases, +this is to find the model that has the highest predictive performance on +unseen data, as measured by its ability to correctly label data points +(classification) or predict a number (regression). + +The prediction of unseen data can be simulated using hold-out test sets +or cross-validation, the former being the method used in this lab. +Performance is evaluated differently in each case, for instance, **Mean +Squared Error** (**MSE**) for regression and accuracy for +classification. We seek to reduce the MSE or increase the accuracy of +our predictions. + +Let\'s implement manual hyperparameterization in the following exercise. + + + +Exercise 8.01: Manual Hyperparameter Tuning for a k-NN Classifier +----------------------------------------------------------------- + +In this exercise, we will manually tune a k-NN classifier, which was +covered in *Lab 7, The Generalization of Machine Learning Models*, +our goal being to predict incidences of malignant or benign breast +cancer based on cell measurements sourced from the affected breast +sample. + + +These are the important attributes of the dataset: + +- ID number +- Diagnosis (M = malignant, B = benign) +- 3-32) + +10 real-valued features are computed for each cell nucleus as follows: + +- Radius (mean of distances from the center to points on the + perimeter) + +- Texture (standard deviation of grayscale values) + +- Perimeter + +- Area + +- Smoothness (local variation in radius lengths) + +- Compactness (perimeter\^2 / area - 1.0) + +- Concavity (severity of concave portions of the contour) + +- Concave points (number of concave portions of the contour) + +- Symmetry + +- Fractal dimension (refers to the complexity of the tissue + architecture; \"coastline approximation\" - 1) + + +The following steps will help you complete this exercise: + +1. Create a new notebook in Google Colab. + +2. Next, import `neighbors`, `datasets`, and + `model_selection` from scikit-learn: + ``` + from sklearn import neighbors, datasets, model_selection + ``` + + +3. Load the data. We will call this object `cancer`, and + isolate the target `y`, and the features, `X`: + ``` + # dataset + cancer = datasets.load_breast_cancer() + # target + y = cancer.target + # features + X = cancer.data + ``` + + +4. Initialize a k-NN classifier with its default hyperparameterization: + ``` + # no arguments specified + knn = neighbors.KNeighborsClassifier() + ``` + + +5. Feed this classifier into a 10-fold cross-validation + (`cv`), calculating the precision score for each fold. + Assume that maximizing precision (the proportion of true positives + in all positive classifications) is the primary objective of this + exercise: + ``` + # 10 folds, scored on precision + cv = model_selection.cross_val_score(knn, X, y, cv=10,\ + scoring='precision') + ``` + + +6. Printing `cv` shows the precision score calculated for + each fold: + + ``` + # precision scores + print(cv) + ``` + + + You will see the following output: + + ``` + [0.91666667 0.85 0.91666667 0.94736842 0.94594595 + 0.94444444 0.97222222 0.92105263 0.96969697 0.97142857] + ``` + + +7. Calculate and print the mean precision score for all folds. This + will give us an idea of the overall performance of the model, as + shown in the following code snippet: + + ``` + # average over all folds + print(round(cv.mean(), 2)) + ``` + + + You should get the following output: + + ``` + 0.94 + ``` + + + You should see the mean score is close to 94%. Can this be improved + upon? + +8. Run everything again, this time setting hyperparameter `k` + to `15`. You can see that the result is actually + marginally worse (1% lower): + + ``` + # k = 15 + knn = neighbors.KNeighborsClassifier(n_neighbors=15) + cv = model_selection.cross_val_score(knn, X, y, cv=10, \ + scoring='precision') + print(round(cv.mean(), 2)) + ``` + + + The output will be as follows: + + ``` + 0.93 + ``` + + +9. Try again with `k` = `7`, `3`, and + `1`. In this case, it seems reasonable that the default + value of 5 is the best option. To avoid repetition, you may like to + define and call a Python function as follows: + + ``` + def evaluate_knn(k): + knn = neighbors.KNeighborsClassifier(n_neighbors=k) + cv = model_selection.cross_val_score(knn, X, y, cv=10, \ + scoring='precision') + print(round(cv.mean(), 2)) + evaluate_knn(k=7) + evaluate_knn(k=3) + evaluate_knn(k=1) + ``` + + + The output will be as follows: + + ``` + 0.93 + 0.93 + 0.92 + ``` + + + Nothing beats 94%. + +10. Let\'s alter a second hyperparameter. Setting `k = 5`, + what happens if we change the k-NN weighing system to depend on + `distance` rather than having `uniform` weights? + Run all code again, this time with the following + hyperparameterization: + + ``` + # k =5, weights evaluated using distance + knn = neighbors.KNeighborsClassifier(n_neighbors=5, \ + weights='distance') + cv = model_selection.cross_val_score(knn, X, y, cv=10, \ + scoring='precision') + print(round(cv.mean(), 2)) + ``` + + + Did performance improve? + + You should see no further improvement on the default + hyperparameterization because the output is: + + ``` + 0.93 + ``` + + +We therefore conclude that the default hyperparameterization is the +optimal one in this case. + + + + +Simple Demonstration of the Grid Search Strategy +------------------------------------------------ + + +This time, instead of manually fitting models with different values of +`k` we just define the `k` values we would like to +try, that is, `k = 1, 3, 5, 7` in a Python dictionary. This +dictionary will be the grid we will search through to find the optimal +hyperparameterization. + + +The code will be as follows: + +``` +from sklearn import neighbors, datasets, model_selection +# load data +cancer = datasets.load_breast_cancer() +# target +y = cancer.target +# features +X = cancer.data +# hyperparameter grid +grid = {'k': [1, 3, 5, 7]} +``` + +In the code snippet, we have used a dictionary `{}` and set +the `k` values in a Python dictionary. + +In the next part of the code snippet, to conduct the search, we iterate +through the grid, fitting a model for each value of `k`, each +time evaluating the model through 10-fold cross-validation. + +At the end of each iteration, we extract, format, and report back the +mean precision score after cross-validation via the `print` +method: + +``` +# for every value of k in the grid +for k in grid['k']: + # initialize the knn estimator + knn = neighbors.KNeighborsClassifier(n_neighbors=k) + # conduct a 10-fold cross-validation + cv = model_selection.cross_val_score(knn, X, y, cv=10, \ + scoring='precision') + # calculate the average precision value over all folds + cv_mean = round(cv.mean(), 3) + # report the result + print('With k = {}, mean precision = {}'.format(k, cv_mean)) +``` + +The output will be as follows: + +![](./images/B15019_08_04.jpg) + +Caption: Average precisions for all folds + +We can see from the output that `k = 5` is the best +hyperparameterization found, with a mean precision score of roughly 94%. +Increasing `k` to `7` didn\'t significantly improve +performance. It is important to note that the only parameter we are +changing here is k and that each time the k-NN estimator is initialized, +it is done with the remaining hyperparameters set to their default +values. + +To make this point clear, we can run the same loop, this time just +printing the hyperparameterization that will be tried: + +``` +# for every value of k in the grid +for k in grid['k']: + # initialize the knn estimator + knn = neighbors.KNeighborsClassifier(n_neighbors=k) + # print the hyperparameterization + print(knn.get_params()) +``` + +The output will be as follows: + +``` +{'algorithm': 'auto', 'leaf_size': 30, 'metric': 'minkowski', + 'metric_params': None, 'n_jobs': None, 'n_neighbors': 1, + 'p': 2, 'weights': 'uniform'} +{'algorithm': 'auto', 'leaf_size': 30, 'metric': 'minkowski', + 'metric_params': None, 'n_jobs': None, 'n_neighbors': 3, + 'p': 2, 'weights': 'uniform'} +{'algorithm': 'auto', 'leaf_size': 30, 'metric': 'minkowski', + 'metric_params': None, 'n_jobs': None, 'n_neighbors': 5, + 'p': 2, 'weights': 'uniform'} +{'algorithm': 'auto', 'leaf_size': 30, 'metric': 'minkowski', + 'metric_params': None, 'n_jobs': None, 'n_neighbors': 7, + 'p': 2, 'weights': 'uniform'} +``` +You can see from the output that the only parameter we are changing is +k; everything else remains the same in each iteration. + +Simple, single-loop structures are fine for a grid search of a single +hyperparameter, but what if we would like to try a second one? Remember +that for k-NN we also have weights that can take values +`uniform` or `distance`, the choice of which +influences how k-NN learns how to classify points. + +To proceed, all we need to do is create a dictionary containing both the +values of k and the weight functions we would like to try as separate +key/value pairs: + +``` +# hyperparameter grid +grid = {'k': [1, 3, 5, 7],\ + 'weight_function': ['uniform', 'distance']} +# for every value of k in the grid +for k in grid['k']: + # and every possible weight_function in the grid + for weight_function in grid['weight_function']: + # initialize the knn estimator + knn = neighbors.KNeighborsClassifier\ + (n_neighbors=k, \ + weights=weight_function) + # conduct a 10-fold cross-validation + cv = model_selection.cross_val_score(knn, X, y, cv=10, \ + scoring='precision') + # calculate the average precision value over all folds + cv_mean = round(cv.mean(), 3) + # report the result + print('With k = {} and weight function = {}, '\ + 'mean precision = {}'\ + .format(k, weight_function, cv_mean)) +``` + +The output will be as follows: + +![Caption: Average precision values for all folds for different +values of k ](./images/B15019_08_05.jpg) + +Caption: Average precision values for all folds for different values +of k + +You can see that when `k = 5`, the weight function is not +based on distance and all the other hyperparameters are kept as their +default values, and the mean precision comes out highest. As we +discussed earlier, if you would like to see the full set of +hyperparameterizations evaluated for k-NN, just add +`print(knn.get_params())` inside the `for` loop +after the estimator is initialized: + +``` +# for every value of k in the grid +for k in grid['k']: + # and every possible weight_function in the grid + for weight_function in grid['weight_function']: + # initialize the knn estimator + knn = neighbors.KNeighborsClassifier\ + (n_neighbors=k, \ + weights=weight_function) + # print the hyperparameterizations + print(knn.get_params()) +``` + +The output will be as follows: + +``` +{'algorithm': 'auto', 'leaf_size': 30, 'metric': 'minkowski', + 'metric_params': None, 'n_jobs': None, 'n_neighbors': 1, + 'p': 2, 'weights': 'uniform'} +{'algorithm': 'auto', 'leaf_size': 30, 'metric': 'minkowski', + 'metric_params': None, 'n_jobs': None, 'n_neighbors': 1, + 'p': 2, 'weights': 'distance'} +{'algorithm': 'auto', 'leaf_size': 30, 'metric': 'minkowski', + 'metric_params': None, 'n_jobs': None, 'n_neighbors': 3, + 'p': 2, 'weights': 'uniform'} +{'algorithm': 'auto', 'leaf_size': 30, 'metric': 'minkowski', + 'metric_params': None, 'n_jobs': None, 'n_neighbors': 3, + 'p': 2, 'weights': 'distance'} +{'algorithm': 'auto', 'leaf_size': 30, 'metric': 'minkowski', + 'metric_params': None, 'n_jobs': None, 'n_neighbors': 5, + 'p': 2, 'weights': 'uniform'} +{'algorithm': 'auto', 'leaf_size': 30, 'metric': 'minkowski', + 'metric_params': None, 'n_jobs': None, 'n_neighbors': 5, + 'p': 2, 'weights': 'distance'} +{'algorithm': 'auto', 'leaf_size': 30, 'metric': 'minkowski', + 'metric_params': None, 'n_jobs': None, 'n_neighbors': 7, + 'p': 2, 'weights': 'uniform'} +{'algorithm': 'auto', 'leaf_size': 30, 'metric': 'minkowski', + 'metric_params': None, 'n_jobs': None, 'n_neighbors': 7, + 'p': 2, 'weights': 'distance'} +``` +This implementation, while great for demonstrating how the grid search +process works, may not practical when trying to evaluate estimators that +have `3`, `4`, or even `10` different +types of hyperparameters, each with a multitude of possible settings. + +To carry on in this way will mean writing and keeping track of multiple +`for` loops, which can be tedious. Thankfully, +`scikit-learn`\'s `model_selection` module gives us +a method called `GridSearchCV` that is much more +user-friendly. We will be looking at this in the topic ahead. + + +GridSearchCV +============ + + +`GridsearchCV` is a method of tuning wherein the model can be +built by evaluating the combination of parameters mentioned in a grid. +In the following figure, we will see how `GridSearchCV` is +different from manual search and look at grid search in a muchdetailed +way in a table format. + + + +Tuning using GridSearchCV +------------------------- + +We can conduct a grid search much more easily in practice by leveraging +`model_selection.GridSearchCV`. + +For the sake of comparison, we will use the same breast cancer dataset +and k-NN classifier as before: + +``` +from sklearn import model_selection, datasets, neighbors +# load the data +cancer = datasets.load_breast_cancer() +# target +y = cancer.target +# features +X = cancer.data +``` + +The next thing we need to do after loading the data is to initialize the +class of the estimator we would like to evaluate under different +hyperparameterizations: + +``` +# initialize the estimator +knn = neighbors.KNeighborsClassifier() +``` +We then define the grid: + +``` +# grid contains k and the weight function +grid = {'n_neighbors': [1, 3, 5, 7],\ + 'weights': ['uniform', 'distance']} +``` +To set up the search, we pass the freshly initialized estimator and our +grid of hyperparameters to `model_selection.GridSearchCV()`. +We must also specify a scoring metric, which is the method that will be +used to evaluate the performance of the various hyperparameterizations +tried during the search. + +The last thing to do is set the number splits to be used using +cross-validation via the `cv` argument. We will set this to +`10`, thereby conducting 10-fold cross-validation: + +``` +""" + set up the grid search with scoring on precision and +number of folds = 10 +""" +gscv = model_selection.GridSearchCV(estimator=knn, \ + param_grid=grid, \ + scoring='precision', cv=10) +``` + +The last step is to feed data to this object via its `fit()` +method. Once this has been done, the grid search process will be +kick-started: + +``` +# start the search +gscv.fit(X, y) +``` +By default, information relating to the search will be printed to the +screen, allowing you to see the exact estimator parameterizations that +will be evaluated for the k-NN estimator: + +![](./images/B15019_08_06.jpg) + +Caption: Estimator parameterizations for the k-NN estimator + +Once the search is complete, we can examine the results by accessing and +printing the `cv_results_` attribute. `cv_results_` +is a dictionary containing helpful information regarding model +performance under each hyperparameterization, such as the mean test-set +value of your scoring metric (`mean_test_score`, the lower the +better), the complete list of hyperparameterizations tried +(`params`), and the model ranks as they relate to the +`mean_test_score` (`rank_test_score`). + +The best model found will have rank = 1, the second-best model will have +rank = 2, and so on, as you can see in *Figure 8.8*. The model fitting +times are reported through `mean_fit_time`. + +Although not usually a consideration for smaller datasets, this value +can be important because in some cases you may find that a marginal +increase in model performance through a certain hyperparameterization is +associated with a significant increase in model fit time, which, +depending on the computing resources you have available, may render that +hyperparameterization infeasible because it will take too long to fit: + +``` +# view the results +print(gscv.cv_results_) +``` + +The output will be as follows: + +![](./images/B15019_08_07.jpg) + +Caption: GridsearchCV results + +The model ranks can be seen in the following image: + +![](./images/B15019_08_08.jpg) + +Caption: Model ranks + + + +For example, say we are only interested in each hyperparameterization +(`params`) and mean cross-validated test score +(`mean_test_score`) for the top five high - performing models: + +``` +import pandas as pd +# convert the results dictionary to a dataframe +results = pd.DataFrame(gscv.cv_results_) +""" +select just the hyperparameterizations tried, +the mean test scores, order by score and show the top 5 models +""" +print(results.loc[:,['params','mean_test_score']]\ + .sort_values('mean_test_score', ascending=False).head(5)) +``` +Running this code produces the following output: + +![](./images/B15019_08_09.jpg) + +Caption: mean\_test\_score for top 5 models + +We can also use pandas to produce visualizations of the result as +follows: + +``` +# visualise the result +results.loc[:,['params','mean_test_score']]\ + .plot.barh(x = 'params') +``` + +The output will be as follows: + +![](./images/B15019_08_10.jpg) + +Caption: Using pandas to visualize the output + + + +Exercise 8.02: Grid Search Hyperparameter Tuning for an SVM +----------------------------------------------------------- + +In this exercise, we will employ a class of estimator called an SVM +classifier and tune its hyperparameters using a grid search strategy. + +The supervised learning objective we will focus on here is the +classification of handwritten digits (0-9) based solely on images. The +dataset we will use contains 1,797 labeled images of handwritten digits. + + + +1. Create a new notebook in Google Colab. + +2. Import `datasets`, `svm`, and + `model_selection` from scikit-learn: + ``` + from sklearn import datasets, svm, model_selection + ``` + + +3. Load the data. We will call this object images, and then we\'ll + isolate the target `y` and the features `X`. In + the training step, the SVM classifier will learn how `y` + relates to `X` and will therefore be able to predict new + `y` values when given new `X` values: + ``` + # load data + digits = datasets.load_digits() + # target + y = digits.target + # features + X = digits.data + ``` + + +4. Initialize the estimator as a multi-class SVM classifier and set the + `gamma` argument to `scale`: + + ``` + # support vector machine classifier + clr = svm.SVC(gamma='scale') + ``` + + +5. Define our grid to cover four distinct hyperparameterizations of the + classifier with a linear kernel and with a polynomial kernel of + degrees `2`, `3,` and `4`. We want to + see which of the four hyperparameterizations leads to more accurate + predictions: + ``` + # hyperparameter grid. contains linear and polynomial kernels + grid = [{'kernel': ['linear']},\ + {'kernel': ['poly'], 'degree': [2, 3, 4]}] + ``` + + +6. Set up grid search k-fold cross-validation with `10` folds + and a scoring measure of accuracy. Make sure it has our + `grid` and `estimator` objects as inputs: + ``` + """ + setting up the grid search to score on accuracy and + evaluate over 10 folds + """ + cv_spec = model_selection.GridSearchCV\ + (estimator=clr, param_grid=grid, \ + scoring='accuracy', cv=10) + ``` + + +7. Start the search by providing data to the `.fit()` method. + Details of the process, including the hyperparameterizations tried + and the scoring method selected, will be printed to the screen: + + ``` + # start the grid search + cv_spec.fit(X, y) + ``` + + + You should see the following output: + + +![](./images/B15019_08_11.jpg) + + + Caption: Grid Search using the .fit() method + +8. To examine all of the results, simply print + `cv_spec.cv_results_` to the screen. You will see that the + results are structured as a dictionary, allowing you to access the + information you require using the keys: + + ``` + # what is the available information + print(cv_spec.cv_results_.keys()) + ``` + + + You will see the following information: + + +![](./images/B15019_08_12.jpg) + + + Caption: Results as a dictionary + +9. For this exercise, we are primarily concerned with the test-set + performance of each distinct hyperparameterization. You can see the + first hyperparameterization through + `cv_spec.cv_results_['mean_test_score']`, and the second + through `cv_spec.cv_results_['params']`. + + Let\'s convert the results dictionary to a `pandas` + DataFrame and find the best hyperparameterization: + + ``` + import pandas as pd + # convert the dictionary of results to a pandas dataframe + results = pd.DataFrame(cv_spec.cv_results_) + # show hyperparameterizations + print(results.loc[:,['params','mean_test_score']]\ + .sort_values('mean_test_score', ascending=False)) + ``` + + + You will see the following results: + + +![](./images/B15019_08_13.jpg) + + + Caption: Parameterization results + + Note + + You may get slightly different results. However, the values you + obtain should largely agree with those in the preceding output. + +10. It is best practice to visualize any results you produce. + `pandas` makes this easy. Run the following code to + produce a visualization: + + ``` + # visualize the result + (results.loc[:,['params','mean_test_score']]\ + .sort_values('mean_test_score', ascending=True)\ + .plot.barh(x='params', xlim=(0.8))) + ``` + + + The output will be as follows: + + +![](./images/B15019_08_14.jpg) + + +Caption: Using pandas to visualize the results + + + +Advantages and Disadvantages of Grid Search +------------------------------------------- + +The primary advantage of the grid search compared to a manual search is +that it is an automated process that one can simply set and forget. +Additionally, you have the power to dictate the exact +hyperparameterizations evaluated, which can be a good thing when you +have prior knowledge of what kind of hyperparameterizations might work +well in your context. It is also easy to understand exactly what will +happen during the search thanks to the explicit definitions of the grid. + +The major drawback of the grid search strategy is that it is +computationally very expensive; that is, when the number of +hyperparameterizations to try increases substantially, processing times +can be very slow. Also, when you define your grid, you may inadvertently +omit an hyperparameterization that would in fact be optimal. If it is +not specified in your grid, it will never be tried + +To overcome these drawbacks, we will be looking at random search in the +next section. + + +Random Search +============= + + +Instead of searching through every hyperparameterizations in a +pre-defined set, as is the case with a grid search, in a random search +we sample from a distribution of possibilities by assuming each +hyperparameter to be a random variable. Before we go through the process +in depth, it will be helpful to briefly review what random variables are +and what we mean by a distribution. + + + +Random Variables and Their Distributions +---------------------------------------- + +A random variable is non-constant (its value can change) and its +variability can be described in terms of distribution. There are many +different types of distributions, but each falls into one of two broad +categories: discrete and continuous. We use discrete distributions to +describe random variables whose values can take only whole numbers, such +as counts. + +An example is the count of visitors to a theme park in a day, or the +number of attempted shots it takes a golfer to get a hole-in-one. + +We use continuous distributions to describe random variables whose +values lie along a continuum made up of infinitely small increments. +Examples include human height or weight, or outside air temperature. +Distributions often have parameters that control their shape. + +Discrete distributions can be described mathematically using what\'s +called a probability mass function, which defines the exact probability +of the random variable taking a certain value. Common notation for the +left-hand side of this function is `P(X=x)`, which in plain +English means that the probability that the random variable +`X` equals a certain value `x` is `P`. +Remember that probabilities range between `0` (impossible) and +`1` (certain). + +By definition, the summation of each `P(X=x)` for all possible +`x`\'s will be equal to 1, or if expressed another way, the +probability that `X` will take any value is 1. A simple +example of this kind of distribution is the discrete uniform +distribution, where the random variable `X` will take only one +of a finite range of values and the probability of it taking any +particular value is the same for all values, hence the term uniform. + +For example, if there are 10 possible values the probability that +`X` is any particular value is exactly 1/10. If there were 6 +possible values, as in the case of a standard 6-sided die, the +probability would be 1/6, and so on. The probability mass function for +the discrete uniform distribution is: + +![](./images/B15019_08_15.jpg) + +Caption: Probability mass function for the discrete uniform +distribution + +The following code will allow us to see the form of this distribution +with 10 possible values of X. + +First, we create a list of all the possible values `X` can +take: + +``` +# list of all xs +X = list(range(1, 11)) +print(X) +``` + +The output will be as follows: + +``` + [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] +``` +We then calculate the probability that `X` will take up any +value of `x (P(X=x))`: + +``` +# pmf, 1/n * n = 1 +p_X_x = [1/len(X)] * len(X) +# sums to 1 +print(p_X_x) +``` +As discussed, the summation of probabilities will equal 1, and this is +the case with any distribution. We now have everything we need to +visualize the distribution: + +``` +import matplotlib.pyplot as plt +plt.bar(X, p_X_x) +plt.xlabel('X') +plt.ylabel('P(X=x)') +``` + +The output will be as follows: + +![](./images/B15019_08_16.jpg) + +Caption: Visualizing the bar chart + +In the visual output, we see that the probability of `X` being +a specific whole number between 1 and 10 is equal to 1/10. + +Note + +Other discrete distributions you commonly see include the binomial, +negative binomial, geometric, and Poisson distributions, all of which we +encourage you to investigate. Type these terms into a search engine to +find out more. + +Distributions of continuous random variables are a bit more challenging +in that we cannot calculate an exact `P(X=x)` directly because +`X` lies on a continuum. We can, however, use integration to +approximate probabilities between a range of values, but this is beyond +the scope of this book. The relationship between `X` and +probability is described using a probability density function, +`P(X)`. Perhaps the most well-known continuous distribution is +the normal distribution, which visually takes the form of a bell. + +The normal distribution has two parameters that describe its shape, mean +(`𝜇`) and variance (`𝜎`[2]). The +probability density function for the normal distribution is: + +![](./images/B15019_08_17.jpg) + +Caption: Probability density function for the normal distribution + +The following code shows two normal distributions with the same mean +(`𝜇`` = 0`) but different variance parameters +(`𝜎``2 = 1` and `𝜎``2 = 2.25`). +Let\'s first generate 100 evenly spaced values from `-10` to +`10` using NumPy\'s `.linspace` method: + +``` +import numpy as np +# range of xs +x = np.linspace(-10, 10, 100) +``` +We then generate the approximate `X` probabilities for both +normal distributions. + +Using `scipy.stats` is a good way to work with distributions, +and its `pdf` method allows us to easily visualize the shape +of probability density functions: + +``` +import scipy.stats as stats +# first normal distribution with mean = 0, variance = 1 +p_X_1 = stats.norm.pdf(x=x, loc=0.0, scale=1.0**2) +# second normal distribution with mean = 0, variance = 2.25 +p_X_2 = stats.norm.pdf(x=x, loc=0.0, scale=1.5**2) +``` +Note + +In this case, `loc` corresponds to 𝜇, while `scale` +corresponds to the standard deviation, which is the square root of +`𝜎``2`, hence why we square the inputs. + +We then visualize the result. Notice that `𝜎``2` +controls how fat the distribution is and therefore how variable the +random variable is: + +``` +plt.plot(x,p_X_1, color='blue') +plt.plot(x, p_X_2, color='orange') +plt.xlabel('X') +plt.ylabel('P(X)') +``` + +The output will be as follows: + +![](./images/B15019_08_18.jpg) + +Caption: Visualizing the normal distribution + + + +Simple Demonstration of the Random Search Process +------------------------------------------------- + +Again, before we get to the scikit-learn implementation of random search +parameter tuning, we will step through the process using simple Python +tools. Up until this point, we have only been using classification +problems to demonstrate tuning concepts, but now we will look at a +regression problem. Can we find a model that\'s able to predict the +progression of diabetes in patients based on characteristics such as BMI +and age? + + +We first load the data: + +``` +from sklearn import datasets, linear_model, model_selection +# load the data +diabetes = datasets.load_diabetes() +# target +y = diabetes.target +# features +X = diabetes.data +``` +To get a feel for the data, we can examine the disease progression for +the first patient: + +``` +# the first patient has index 0 +print(y[0]) +``` + +The output will be as follows: + +``` + 151.0 +``` +Let\'s now examine their characteristics: + +``` +# let's look at the first patients data +print(dict(zip(diabetes.feature_names, X[0]))) +``` +We should see the following: + +![](./images/B15019_08_19.jpg) + +Caption: Dictionary for patient characteristics + + + + +For ridge regression, we believe the optimal 𝛼 to be somewhere near 1, +becoming less likely as you move away from 1. A parameterization of the +gamma distribution that reflects this idea is where k and 𝜃 are both +equal to 1. To visualize the form of this distribution, we can run the +following: + +``` +import numpy as np +from scipy import stats +import matplotlib.pyplot as plt +# values of alpha +x = np.linspace(1, 20, 100) +# probabilities +p_X = stats.gamma.pdf(x=x, a=1, loc=1, scale=2) +plt.plot(x,p_X) +plt.xlabel('alpha') +plt.ylabel('P(alpha)') +``` + +The output will be as follows: + +![](./images/B15019_08_20.jpg) + +Caption: Visualization of probabilities + +In the graph, you can see how probability decays sharply for smaller +values of 𝛼, then decays more slowly for larger values. + +The next step in the random search process is to sample n values from +the chosen distribution. In this example, we will draw 100 𝛼 values. +Remember that the probability of drawing out a particular value of 𝛼 is +related to its probability as defined by this distribution: + +``` +# n sample values +n_iter = 100 +# sample from the gamma distribution +samples = stats.gamma.rvs(a=1, loc=1, scale=2, \ + size=n_iter, random_state=100) +``` +Note + +We set a random state to ensure reproducible results. + +Plotting a histogram of the sample, as shown in the following figure, +reveals a shape that approximately conforms to the distribution that we +have sampled from. Note that as your sample sizes increases, the more +the histogram conforms to the distribution: + +``` +# visualize the sample distribution +plt.hist(samples) +plt.xlabel('alpha') +plt.ylabel('sample count') +``` + +The output will be as follows: + +![](./images/B15019_08_21.jpg) + +Caption: Visualization of the sample distribution + +A model will then be fitted for each value of 𝛼 sampled and assessed for +performance. As we have seen with the other approaches to hyperparameter +tuning in this lab, performance will be assessed using k-fold +cross-validation (with `k =10`) but because we are dealing +with a regression problem, the performance metric will be the test-set +negative MSE. + +Using this metric means larger values are better. We will store the +results in a dictionary with each 𝛼 value as the key and the +corresponding cross-validated negative MSE as the value: + +``` +# we will store the results inside a dictionary +result = {} +# for each sample +for sample in samples: + """ + initialize a ridge regression estimator with alpha set + to the sample value + """ + reg = linear_model.Ridge(alpha=sample) + """ + conduct a 10-fold cross validation scoring on + negative mean squared error + """ + cv = model_selection.cross_val_score\ + (reg, X, y, cv=10, \ + scoring='neg_mean_squared_error') + # retain the result in the dictionary + result[sample] = [cv.mean()] +``` + +Instead of examining the raw dictionary of results, we will convert it +to a pandas DataFrame, transpose it, and give the columns names. Sorting +by descending negative mean squared error reveals that the optimal level +of regularization for this problem is actually when 𝛼 is approximately +1, meaning that we did not find evidence to suggest regularization is +necessary for this problem and that the OLS linear model will suffice: + +``` +import pandas as pd +""" +convert the result dictionary to a pandas dataframe, +transpose and reset the index +""" +df_result = pd.DataFrame(result).T.reset_index() +# give the columns sensible names +df_result.columns = ['alpha', 'mean_neg_mean_squared_error'] +print(df_result.sort_values('mean_neg_mean_squared_error', \ + ascending=False).head()) +``` + +The output will be as follows: + +![](./images/B15019_08_22.jpg) + +Caption: Output for the random search process + +Note + +The results will be different, depending on the data used. + +It is always beneficial to visualize results where possible. Plotting 𝛼 +by negative mean squared error as a scatter plot makes it clear that +venturing away from 𝛼 = 1 does not result in improvements in predictive +performance: + +``` +plt.scatter(df_result.alpha, \ + df_result.mean_neg_mean_squared_error) +plt.xlabel('alpha') +plt.ylabel('-MSE') +``` + +The output will be as follows: + +![](./images/B15019_08_23.jpg) + +Caption: Plotting the scatter plot + +The fact that we found the optimal 𝛼 to be 1 (its default value) is a +special case in hyperparameter tuning in that the optimal +hyperparameterization is the default one. + + + +Tuning Using RandomizedSearchCV +------------------------------- + +In practice, we can use the `RandomizedSearchCV` method inside +scikit-learn\'s `model_selection` module to conduct the +search. All you need to do is pass in your estimator, the +hyperparameters you wish to tune along with their distributions, the +number of samples you would like to sample from each distribution, and +the metric by which you would like to assess model performance. These +correspond to the `param_distributions`, `n_iter`, +and `scoring` arguments respectively. For the sake of +demonstration, let\'s conduct the search we completed earlier using +`RandomizedSearchCV`. First, we load the data and initialize +our ridge regression estimator: + +``` +from sklearn import datasets, model_selection, linear_model +# load the data +diabetes = datasets.load_diabetes() +# target +y = diabetes.target +# features +X = diabetes.data +# initialise the ridge regression +reg = linear_model.Ridge() +``` +We then specify that the hyperparameter we would like to tune is +`alpha` and that we would like 𝛼 to be distributed +`gamma`, with `k = 1` and +`𝜃`` = 1`: + +``` +from scipy import stats +# alpha ~ gamma(1,1) +param_dist = {'alpha': stats.gamma(a=1, loc=1, scale=2)} +``` +Next, we set up and run the random search process, which will sample 100 +values from our `gamma(1,1)` distribution, fit the ridge +regression, and evaluate its performance using cross-validation scored +on the negative mean squared error metric: + +``` +""" +set up the random search to sample 100 values and +score on negative mean squared error +""" +rscv = model_selection.RandomizedSearchCV\ + (estimator=reg, param_distributions=param_dist, \ + n_iter=100, scoring='neg_mean_squared_error') +# start the search +rscv.fit(X,y) +``` +After completing the search, we can extract the results and generate a +pandas DataFrame, as we have done previously. Sorting by +`rank_test_score` and viewing the first five rows aligns with +our conclusion that alpha should be set to 1 and regularization does not +seem to be required for this problem: + +``` +import pandas as pd +# convert the results dictionary to a pandas data frame +results = pd.DataFrame(rscv.cv_results_) +# show the top 5 hyperparamaterizations +print(results.loc[:,['params','rank_test_score']]\ + .sort_values('rank_test_score').head(5)) +``` + +The output will be as follows: + +![](./images/B15019_08_24.jpg) + +Caption: Output for tuning using RandomizedSearchCV + +Note + +The preceding results may vary, depending on the data. + + + +Exercise 8.03: Random Search Hyperparameter Tuning for a Random Forest Classifier +--------------------------------------------------------------------------------- + +In this exercise, we will revisit the handwritten digit classification +problem, this time using a random forest classifier with hyperparameters +tuned using a random search strategy. The random forest is a popular +method used for both single-class and multi-class classification +problems. It learns by growing `n` simple tree models that +each progressively split the dataset into areas that best separate the +points of different classes. + +The final model produced can be thought of as the average of each of the +n tree models. In this way, the random forest is an `ensemble` +method. The parameters we will tune in this exercise are +`criterion` and `max_features`. + +`criterion` refers to the way in which each split is evaluated +from a class purity perspective (the purer the splits, the better) and +`max_features` is the maximum number of features the random +forest can use when finding the best splits. + +The following steps will help you complete the exercise. + +1. Create a new notebook in Google Colab. + +2. Import the data and isolate the features `X` and the + target `y`: + ``` + from sklearn import datasets + # import data + digits = datasets.load_digits() + # target + y = digits.target + # features + X = digits.data + ``` + + +3. Initialize the random forest classifier estimator. We will set the + `n_estimators` hyperparameter to `100`, which + means the predictions of the final model will essentially be an + average of `100` simple tree models. Note the use of a + random state to ensure the reproducibility of results: + ``` + from sklearn import ensemble + # an ensemble of 100 estimators + rfc = ensemble.RandomForestClassifier(n_estimators=100, \ + random_state=100) + ``` + + +4. One of the parameters we will be tuning is `max_features`. + Let\'s find out the maximum value this could take: + + ``` + # how many features do we have in our dataset? + n_features = X.shape[1] + print(n_features) + ``` + + + You should see that we have 64 features: + + ``` + 64 + ``` + + + Now that we know the maximum value of `max_features` we + are free to define our hyperparameter inputs to the randomized + search process. At this point, we have no reason to believe any + particular value of `max_features` is more optimal. + +5. Set a discrete uniform distribution covering the range `1` + to `64`. Remember the probability mass function, + `P(X=x) = 1/n`, for this distribution, so + `P(X=x) = 1/64` in our case. Because `criterion` + has only two discrete options, this will also be sampled as a + discrete uniform distribution with `P(X=x) = ½`: + ``` + from scipy import stats + """ + we would like to smaple from criterion and + max_features as discrete uniform distributions + """ + param_dist = {'criterion': ['gini', 'entropy'],\ + 'max_features': stats.randint(low=1, \ + high=n_features)} + ``` + + +6. We now have everything we need to set up the randomized search + process. As before, we will use accuracy as the metric of model + evaluation. Note the use of a random state: + ``` + from sklearn import model_selection + """ + setting up the random search sampling 50 times and + conducting 5-fold cross-validation + """ + rscv = model_selection.RandomizedSearchCV\ + (estimator=rfc, param_distributions=param_dist, \ + n_iter=50, cv=5, scoring='accuracy' , random_state=100) + ``` + + +7. Let\'s kick off the process with the. `fit` method. Please + note that both fitting random forests and cross-validation are + computationally expensive processes due to their internal processes + of iteration. Generating a result may take some time: + + ``` + # start the process + rscv.fit(X,y) + ``` + + + You should see the following: + + +![](./images/B15019_08_25.jpg) + + + Caption: RandomizedSearchCV results + +8. Next, you need to examine the results. Create a `pandas` + DataFrame from the `results` attribute, order by the + `rank_test_score`, and look at the top five model + hyperparameterizations. Note that because the random search draws + samples of hyperparameterizations at random, it is possible to have + duplication. We remove the duplicate entries from the DataFrame: + + ``` + import pandas as pd + # convert the dictionary of results to a pandas dataframe + results = pd.DataFrame(rscv.cv_results_) + # removing duplication + distinct_results = results.loc[:,['params',\ + 'mean_test_score']] + # convert the params dictionaries to string data types + distinct_results.loc[:,'params'] = distinct_results.loc\ + [:,'params'].astype('str') + # remove duplicates + distinct_results.drop_duplicates(inplace=True) + # look at the top 5 best hyperparamaterizations + distinct_results.sort_values('mean_test_score', \ + ascending=False).head(5) + ``` + + + You should get the following output: + + +![](./images/B15019_08_26.jpg) + + + Caption: Top five hyperparameterizations + + Note + + You may get slightly different results. However, the values you + obtain should largely agree with those in the preceding output. + +9. The last step is to visualize the result. Including every + parameterization will result in a cluttered plot, so we will filter + on parameterizations that resulted in a mean test score \> 0.93: + + ``` + # top performing models + distinct_results[distinct_results.mean_test_score > 0.93]\ + .sort_values('mean_test_score')\ + .plot.barh(x='params', xlim=(0.9)) + ``` + + + The output will be as follows: + + +![Caption: Visualizing the test scores of the top-performing + models ](./images/B15019_08_27.jpg) + + +Caption: Visualizing the test scores of the top-performing models + + + +Advantages and Disadvantages of a Random Search +----------------------------------------------- + +Because a random search takes a finite sample from a range of possible +hyperparameterizations (`n_iter` in +`model_selection.RandomizedSearchCV`), it is feasible to +expand the range of your hyperparameter search beyond what would be +practical with a grid search. This is because a grid search has to try +everything in the range, and setting a large range of values may be too +slow to process. Searching this wider range gives you the chance of +discovering a truly optimal solution. + +Compared to the manual and grid search strategies, you do sacrifice a +level of control to obtain this benefit. The other consideration is that +setting up random search is a bit more involved than other options in +that you have to specify distributions. There is always a chance of +getting this wrong. That said, if you are unsure about what +distributions to use, stick with discrete or continuous uniform for the +respective variable types as this will assign an equal probability of +selection to all options. + + + +Activity 8.01: Is the Mushroom Poisonous? +----------------------------------------- + +Imagine you are a data scientist working for the biology department at +your local university. Your colleague who is a mycologist (a biologist +who specializes in fungi) has requested that you help her develop a +machine learning model capable of discerning whether a particular +mushroom species is poisonous or not given attributes relating to its +appearance. + +The objective of this activity is to employ the grid and randomized +search strategies to find an optimal model for this purpose. + + + +1. Load the data into Python using the `pandas.read_csv()` + method, calling the object `mushrooms`. + + Hint: The dataset is in CSV format and has no header. Set + `header=None` in `pandas.read_csv()`. + +2. Separate the target, `y` and features, `X` from + the dataset. + + Hint: The target can be found in the first column + (`mushrooms.iloc[:,0]`) and the features in the remaining + columns (`mushrooms.iloc[:,1:]`). + +3. Recode the target, `y`, so that poisonous mushrooms are + represented as `1` and edible mushrooms as `0`. + +4. Transform the columns of the feature set `X` into a + `numpy` array with a binary representation. This is known + as one-hot encoding. + + Hint: Use `preprocessing.OneHotEncoder()` to transform + `X`. + +5. Conduct both a grid and random search to find an optimal + hyperparameterization for a random forest classifier. Use accuracy + as your method of model evaluation. Make sure that when you + initialize the classifier and when you conduct your random search, + `random_state = 100`. + + For the grid search, use the following: + + ``` + {'criterion': ['gini', 'entropy'],\ + 'max_features': [2, 4, 6, 8, 10, 12, 14]} + ``` + + + For the randomized search, use the following: + + ``` + {'criterion': ['gini', 'entropy'],\ + 'max_features': stats.randint(low=1, high=max_features)} + ``` + + +6. Plot the mean test score versus hyperparameterization for the top 10 + models found using random search. + + You should see a plot similar to the following: + +![](./images/B15019_08_28.jpg) + +Caption: Mean test score plot + + +Summary +======= + + +In this lab, we have covered three strategies for hyperparameter +tuning based on searching for estimator hyperparameterizations that +improve performance. + + +The grid search is an automated method that is the most systematic of +the three but can be very computationally intensive to run when the +range of possible hyperparameterizations increases. +The random search, while the most complicated to set up, is based on +sampling from distributions of hyperparameters. \ No newline at end of file diff --git a/lab_guides/Lab_9.md b/lab_guides/Lab_9.md new file mode 100644 index 0000000..0b8ad5f --- /dev/null +++ b/lab_guides/Lab_9.md @@ -0,0 +1,1565 @@ + +9. Interpreting a Machine Learning Model +======================================== + + + +Overview + +This lab will show you how to interpret a machine learning model\'s +results and get deeper insights into the patterns it found. By the end +of the lab, you will be able to analyze weights from linear models +and variable importance for `RandomForest`. You will be able +to implement variable importance via permutation to analyze feature +importance. You will use a partial dependence plot to analyze single +variables and make use of the lime package for local interpretation. + + +Introduction +============ + + +In the previous lab, you saw how to find the optimal hyperparameters +of some of the most popular machine learning algorithms in order to get +better predictive performance (that is, more accurate predictions). + +Machine learning algorithms are always referred to as black box where we +can only see the inputs and outputs and the implementation inside the +algorithm is quite opaque, so people don\'t know what is happening +inside. + +With each day that passes, we can sense the elevated need for more +transparency in machine learning models. In the last few years, we have +seen some cases where algorithms have been accused of discriminating +against groups of people. For instance, a few years ago, a +not-for-profit news organization called ProPublica highlighted bias in +the COMPAS algorithm, built by the Northpointe company. The objective of +the algorithm is to assess the likelihood of re-offending for a +criminal. It was shown that the algorithm was predicting a higher level +of risk for specific groups of people based on their demographics rather +than other features. This example highlighted the importance of +interpreting the results of your model and its logic properly and +clearly. + +Luckily, some machine learning algorithms provide methods to understand +the parameters they learned for a given task and dataset. There are also +some functions that are model-agnostic and can help us to better +understand the predictions made. So, there are different techniques that +are either model-specific or model-agnostic for interpreting a model. + +These techniques can also differ in their scope. In the literature, we +either have a global or local interpretation. A global interpretation +means we are looking at the variables for all observations from a +dataset and we want to understand which features have the biggest +overall influence on the target variable. For instance, if you are +predicting customer churn for a telco company, you may find the most +important features for your model are customer usage and the average +monthly amount paid. Local interpretation, on the other hand, focuses +only on a single observation and analyzes the impact of the different +variables. We will look at a single specific case and see what led the +model to make its final prediction. For example, you will look at a +specific customer who is predicted to churn and will discover that they +usually buy the new iPhone model every year, in September. + +In this lab, we will go through some techniques on how to interpret +your models or their results. + + +Linear Model Coefficients +========================= + + +In *Lab 2, Regression*, and *Lab 3, Binary Classification*, you +saw that linear regression models learn function parameters in the form +of the following: + +![](./images/B15019_09_01.jpg) + + +In `sklearn`, it is extremely easy to get the coefficient of a +linear model; you just need to call the `coef_` attribute. +Let\'s implement this on a real example with the Diabetes dataset from +`sklearn`: + +``` +from sklearn.datasets import load_diabetes +from sklearn.linear_model import LinearRegression +data = load_diabetes() +# fit a linear regression model to the data +lr_model = LinearRegression() +lr_model.fit(data.data, data.target) +lr_model.coef_ +``` + +The output will be as follows: + +![](./images/B15019_09_02.jpg) + +Caption: Coefficients of the linear regression parameters + +Let\'s create a DataFrame with these values and column names: + +``` +import pandas as pd +coeff_df = pd.DataFrame() +coeff_df['feature'] = data.feature_names +coeff_df['coefficient'] = lr_model.coef_ +coeff_df.head() +``` + +The output will be as follows: + +![](./images/B15019_09_03.jpg) + +Caption: Coefficients of the linear regression model + +A large positive or a large negative number for a feature coefficient +means it has a strong influence on the outcome. On the other hand, if +the coefficient is close to 0, this means the variable does not have +much impact on the prediction. + +From this table, we can see that column `s1` has a very low +coefficient (a large negative number) so it negatively influences the +final prediction. If `s1` increases by a unit of 1, the +prediction value will decrease by `-792.184162`. On the other +hand, `bmi` has a large positive number +(`519.839787`) on the prediction, so the risk of diabetes is +highly linked to this feature: an increase in body mass index (BMI) +means a significant increase in the risk of diabetes. + + + +Exercise 9.01: Extracting the Linear Regression Coefficient +----------------------------------------------------------- + +In this exercise, we will train a linear regression model to predict the +customer drop-out ratio and extract its coefficients. + + +The following steps will help you complete the exercise: + +1. Open a new Colab notebook. + +2. Import the following packages: `pandas`, + `train_test_split` from + `sklearn.model_selection`, `StandardScaler` from + `sklearn.preprocessing`, `LinearRegression` from + `sklearn.linear_model`, `mean_squared_error` + from `sklearn.metrics`, and `altair`: + ``` + import pandas as pd + from sklearn.model_selection import train_test_split + from sklearn.preprocessing import StandardScaler + from sklearn.linear_model import LinearRegression + from sklearn.metrics import mean_squared_error + import altair as alt + ``` + + +3. Create a variable called `file_url` that contains the URL + to the dataset: + ``` + file_url = 'https://raw.githubusercontent.com/'\ + 'fenago/data-science/'\ + 'master/Lab09/Dataset/phpYYZ4Qc.csv' + ``` + + +4. Load the dataset into a DataFrame called `df` using + `.read_csv()`: + ``` + df = pd.read_csv(file_url) + ``` + + +5. Print the first five rows of the DataFrame: + + ``` + df.head() + ``` + + + You should get the following output: + + +![](./images/B15019_09_04.jpg) + + + Caption: First five rows of the loaded DataFrame + + +6. Extract the `rej` column using `.pop()` and save + it into a variable called `y`: + ``` + y = df.pop('rej') + ``` + + +7. Print the summary of the DataFrame using `.describe()`. + + ``` + df.describe() + ``` + + + You should get the following output: + + +![](./images/B15019_09_05.jpg) + + + Caption: Statistical measures of the DataFrame + + Note + + The preceding figure is a truncated version of the output. + + From this output, we can see the data is not standardized. The + variables have different scales. + +8. Split the DataFrame into training and testing sets using + `train_test_split()` with `test_size=0.3` and + `random_state = 1`: + ``` + X_train, X_test, y_train, y_test = train_test_split\ + (df, y, test_size=0.3, \ + random_state=1) + ``` + + +9. Instantiate `StandardScaler`: + ``` + scaler = StandardScaler() + ``` + + +10. Train `StandardScaler` on the training set and standardize + it using `.fit_transform()`: + ``` + X_train = scaler.fit_transform(X_train) + ``` + + +11. Standardize the testing set using `.transform()`: + ``` + X_test = scaler.transform(X_test) + ``` + + +12. Instantiate `LinearRegression` and save it to a variable + called `lr_model`: + ``` + lr_model = LinearRegression() + ``` + + +13. Train the model on the training set using `.fit()`: + + ``` + lr_model.fit(X_train, y_train) + ``` + + + You should get the following output: + + +![](./images/B15019_09_06.jpg) + + + Caption: Logs of LinearRegression + +14. Predict the outcomes of the training and testing sets using + `.predict()`: + ``` + preds_train = lr_model.predict(X_train) + preds_test = lr_model.predict(X_test) + ``` + + +15. Calculate the mean squared error on the training set and print its + value: + + ``` + train_mse = mean_squared_error(y_train, preds_train) + train_mse + ``` + + + You should get the following output: + + +![](./images/B15019_09_07.jpg) + + + Caption: MSE score of the training set + + We achieved quite a low MSE score on the training set. + +16. Calculate the mean squared error on the testing set and print its + value: + + ``` + test_mse = mean_squared_error(y_test, preds_test) + test_mse + ``` + + + You should get the following output: + + +![](./images/B15019_09_08.jpg) + + + Caption: MSE score of the testing set + + We also have a low MSE score on the testing set that is very similar + to the training one. So, our model is not overfitting. + + Note + + You may get slightly different outputs than those present here. + However, the values you would obtain should largely agree with those + obtained in this exercise. + +17. Print the coefficients of the linear regression model using + `.coef_`: + + ``` + lr_model.coef_ + ``` + + + You should get the following output: + + +![](./images/B15019_09_09.jpg) + + + Caption: Coefficients of the linear regression model + +18. Create an empty DataFrame called `coef_df`: + ``` + coef_df = pd.DataFrame() + ``` + + +19. Create a new column called `feature` for this DataFrame + with the name of the columns of `df` using + `.columns`: + ``` + coef_df['feature'] = df.columns + ``` + + +20. Create a new column called `coefficient` for this + DataFrame with the coefficients of the linear regression model using + `.coef_`: + ``` + coef_df['coefficient'] = lr_model.coef_ + ``` + + +21. Print the first five rows of `coef_df`: + + ``` + coef_df.head() + ``` + + + You should get the following output: + + +![](./images/B15019_09_10.jpg) + + + Caption: The first five rows of coef\_df + + From this output, we can see the variables `a1sx` and + `a1sy` have the lowest value (the biggest negative value) + so they are contributing more to the prediction than the three other + variables shown here. + +22. Plot a bar chart with Altair using `coef_df` and + `coefficient` as the `x` axis and + `feature` as the `y` axis: + + ``` + alt.Chart(coef_df).mark_bar().encode(x='coefficient',\ + y="feature") + ``` + + + You should get the following output: + + +![Caption: Graph showing the coefficients of the linear + regression model ](./images/B15019_09_11.jpg) + + + +RandomForest Variable Importance +================================ + + +After training `RandomForest`, you can assess its variable +importance (or feature importance) with the +`feature_importances_` attribute. + +Let\'s see how to extract this information from the Breast Cancer +dataset from `sklearn`: + +``` +from sklearn.datasets import load_breast_cancer +from sklearn.ensemble import RandomForestClassifier +data = load_breast_cancer() +X, y = data.data, data.target +rf_model = RandomForestClassifier(random_state=168) +rf_model.fit(X, y) +rf_model.feature_importances_ +``` + +The output will be as shown in the following figure: + +![](./images/B15019_09_12.jpg) + +Caption: Feature importance of a Random Forest model + +Note + +Due to randomization, you may get a slightly different result. + +It might be a little difficult to evaluate which importance value +corresponds to which variable from this output. Let\'s create a +DataFrame that will contain these values with the name of the columns: + +``` +import pandas as pd +varimp_df = pd.DataFrame() +varimp_df['feature'] = data.feature_names +varimp_df['importance'] = rf_model.feature_importances_ +varimp_df.head() +``` + +The output will be as follows: + +![](./images/B15019_09_13.jpg) + +Caption: RandomForest variable importance for the first five +features of the Breast Cancer dataset + +From this output, we can see that `mean radius` and +`mean perimeter` have the highest scores, which means they are +the most important in predicting the target variable. The +`mean smoothness` column has a very low value, so it seems it +doesn\'t influence the model much to predict the output. + +Note + +The range of values of variable importance is different for datasets; it +is not a standardized measure. + +Let\'s plot these variable importance values onto a graph using +`altair`: + +``` +import altair as alt +alt.Chart(varimp_df).mark_bar().encode(x='importance',\ + y="feature") +``` + +The output will be as follows: + +![](./images/B15019_09_14.jpg) + +Caption: Graph showing RandomForest variable importance + + +Exercise 9.02: Extracting RandomForest Feature Importance +--------------------------------------------------------- + +In this exercise, we will extract the feature importance of a Random +Forest classifier model trained to predict the customer drop-out ratio. + +We will be using the same dataset as in the previous exercise. + +The following steps will help you complete the exercise: + +1. Open a new Colab notebook. + +2. Import the following packages: `pandas`, + `train_test_split` from + `sklearn.model_selection`, and + `RandomForestRegressor` from `sklearn.ensemble`: + ``` + import pandas as pd + from sklearn.model_selection import train_test_split + from sklearn.ensemble import RandomForestRegressor + from sklearn.metrics import mean_squared_error + import altair as alt + ``` + + +3. Create a variable called `file_url` that contains the URL + to the dataset: + ``` + file_url = 'https://raw.githubusercontent.com/'\ + 'fenago/data-science/'\ + 'master/Lab09/Dataset/phpYYZ4Qc.csv' + ``` + + +4. Load the dataset into a DataFrame called `df` using + `.read_csv()`: + ``` + df = pd.read_csv(file_url) + ``` + + +5. Extract the `rej` column using `.pop()` and save + it into a variable called `y`: + ``` + y = df.pop('rej') + ``` + + +6. Split the DataFrame into training and testing sets using + `train_test_split()` with `test_size=0.3` and + `random_state = 1`: + ``` + X_train, X_test, y_train, y_test = train_test_split\ + (df, y, test_size=0.3, \ + random_state=1) + ``` + + +7. Instantiate `RandomForestRegressor` with + `random_state=1`, `n_estimators=50`, + `max_depth=6`, and `min_samples_leaf=60`: + ``` + rf_model = RandomForestRegressor(random_state=1, \ + n_estimators=50, max_depth=6,\ + min_samples_leaf=60) + ``` + + +8. Train the model on the training set using `.fit()`: + + ``` + rf_model.fit(X_train, y_train) + ``` + + + You should get the following output: + + +![](./images/B15019_09_15.jpg) + + + Caption: Logs of the Random Forest model + +9. Predict the outcomes of the training and testing sets using + `.predict()`: + ``` + preds_train = rf_model.predict(X_train) + preds_test = rf_model.predict(X_test) + ``` + + +10. Calculate the mean squared error on the training set and print its + value: + + ``` + train_mse = mean_squared_error(y_train, preds_train) + train_mse + ``` + + + You should get the following output: + + +![](./images/B15019_09_16.jpg) + + + Caption: MSE score of the training set + + We achieved quite a low MSE score on the training set. + +11. Calculate the MSE on the testing set and print its value: + + ``` + test_mse = mean_squared_error(y_test, preds_test) + test_mse + ``` + + + You should get the following output: + + +![](./images/B15019_09_17.jpg) + + + Caption: MSE score of the testing set + + We also have a low MSE score on the testing set that is very similar + to the training one. So, our model is not overfitting. + +12. Print the variable importance using + `.feature_importances_`: + + ``` + rf_model.feature_importances_ + ``` + + + You should get the following output: + + +![](./images/B15019_09_18.jpg) + + + Caption: MSE score of the testing set + +13. Create an empty DataFrame called `varimp_df`: + ``` + varimp_df = pd.DataFrame() + ``` + + +14. Create a new column called `feature` for this DataFrame + with the name of the columns of `df`, using + `.columns`: + ``` + varimp_df['feature'] = df.columns + varimp_df['importance'] = rf_model.feature_importances_ + ``` + + +15. Print the first five rows of `varimp_df`: + + ``` + varimp_df.head() + ``` + + + You should get the following output: + + +![](./images/B15019_09_19.jpg) + + + Caption: Variable importance of the first five variables + + From this output, we can see the variables `a1cy` and + `a1sy` have the highest value, so they are more important + for predicting the target variable than the three other variables + shown here. + +16. Plot a bar chart with Altair using `coef_df` and + `importance` as the `x` axis and + `feature` as the `y` axis: + + ``` + alt.Chart(varimp_df).mark_bar().encode(x='importance',\ + y="feature") + ``` + + + You should get the following output: + + +![](./images/B15019_09_20.jpg) + + +Caption: Graph showing the variable importance of the first five +variables + +From this output, we can see the variables that impact the prediction +the most for this Random Forest model are `a2pop`, +`a1pop`, `a3pop`, `b1eff`, and +`temp`, by decreasing order of importance. + + + +Variable Importance via Permutation +=================================== + + +In the previous section, we saw how to extract feature importance for +RandomForest. There is actually another technique that shares the same +name, but its underlying logic is different and can be applied to any +algorithm, not only tree-based ones. + +This technique can be referred to as variable importance via +permutation. Let\'s say we trained a model to predict a target variable +with five classes and achieved an accuracy of 0.95. One way to assess +the importance of one of the features is to remove and train a model and +see the new accuracy score. If the accuracy score dropped significantly, +then we could infer that this variable has a significant impact on the +prediction. On the other hand, if the score slightly decreased or stayed +the same, we could say this variable is not very important and doesn\'t +influence the final prediction much. So, we can use this difference +between the model\'s performance to assess the importance of a variable. + +The drawback of this method is that you need to retrain a new model for +each variable. If it took you a few hours to train the original model +and you have 100 different features, it would take quite a while to +compute the importance of each variable. It would be great if we didn\'t +have to retrain different models. So, another solution would be to +generate noise or new values for a given column and predict the final +outcomes from this modified data and compare the accuracy score. For +example, if you have a column with values between 0 and 100, you can +take the original data and randomly generate new values for this column +(keeping all other variables the same) and predict the class for them. + +This option also has a catch. The randomly generated values can be very +different from the original data. Going back to the same example we saw +before, if the original range of values for a column is between 0 and +100 and we generate values that can be negative or take a very high +value, it is not very representative of the real distribution of the +original data. So, we will need to understand the distribution of each +variable before generating new values. + +Rather than generating random values, we can simply swap (or permute) +values of a column between different rows and use these modified cases +for predictions. Then, we can calculate the related accuracy score and +compare it with the original one to assess the importance of this +variable. For example, we have the following rows in the original +dataset: + +![](./images/B15019_09_21.jpg) + +Caption: Example of the dataset + +We can swap the values for the X1 column and get a new dataset: + +![](./images/B15019_09_22.jpg) + +Caption: Example of a swapped column from the dataset + +The `mlxtend` package provides a function to perform variable +permutation and calculate variable importance values: +`feature_importance_permutation`. Let\'s see how to use it +with the Breast Cancer dataset from `sklearn`. + +First, let\'s load the data and train a Random Forest model: + +``` +from sklearn.datasets import load_breast_cancer +from sklearn.ensemble import RandomForestClassifier + +data = load_breast_cancer() +X, y = data.data, data.target +rf_model = RandomForestClassifier(random_state=168) +rf_model.fit(X, y) +``` + +Then, we will call the `feature_importance_permutation` +function from `mlxtend.evaluate`. This function takes the +following parameters: + +- `predict_method`: A function that will be called for model + prediction. Here, we will provide the `predict` method + from our trained `rf_model` model. +- `X`: The features from the dataset. It needs to be in + NumPy array form. +- `y`: The target variable from the dataset. It needs to be + in `Numpy` array form. +- `metric`: The metric used for comparing the performance of + the model. For the classification task, we will use accuracy. +- `num_round`: The number of rounds `mlxtend` will + perform permutation on the data and assess the performance change. +- `seed`: The seed set for getting reproducible results. + +Consider the following code snippet: + +``` +from mlxtend.evaluate import feature_importance_permutation +imp_vals, _ = feature_importance_permutation\ + (predict_method=rf_model.predict, X=X, y=y, \ + metric='r2', num_rounds=1, seed=2) +imp_vals +``` + +The output should be as follows: + +![](./images/B15019_09_23.jpg) + +Caption: Variable importance by permutation + +Let\'s create a DataFrame containing these values and the names of the +features and plot them on a graph with `altair`: + +``` +import pandas as pd +varimp_df = pd.DataFrame() +varimp_df['feature'] = data.feature_names +varimp_df['importance'] = imp_vals +varimp_df.head() +import altair as alt +alt.Chart(varimp_df).mark_bar().encode(x='importance',\ + y="feature") +``` + +The output should be as follows: + +![](./images/B15019_09_24.jpg) + +Caption: Graph showing variable importance by permutation + +These results are different from the ones we got from +`RandomForest` in the previous section. Here, worst concave +points is the most important, followed by worst area, and worst +perimeter has a higher value than mean radius. So, we got the same list +of the most important variables but in a different order. This confirms +these three features are indeed the most important in predicting whether +a tumor is malignant or not. The variable importance from +`RandomForest` and the permutation have different logic, +therefore, you might get different outputs when you run the code given +in the preceding section. + + + +Exercise 9.03: Extracting Feature Importance via Permutation +------------------------------------------------------------ + +In this exercise, we will compute and extract feature importance by +permutating a Random Forest classifier model trained to predict the +customer drop-out ratio. + +We will using the same dataset as in the previous exercise. + +The following steps will help you complete the exercise: + +1. Open a new Colab notebook. + +2. Import the following packages: `pandas`, + `train_test_split` from + `sklearn.model_selection`, + `RandomForestRegressor` from `sklearn.ensemble`, + `feature_importance_permutation` from + `mlxtend.evaluate`, and `altair`: + ``` + import pandas as pd + from sklearn.model_selection import train_test_split + from sklearn.ensemble import RandomForestRegressor + from mlxtend.evaluate import feature_importance_permutation + import altair as alt + ``` + + +3. Create a variable called `file_url` that contains the URL + of the dataset: + ``` + file_url = 'https://raw.githubusercontent.com/'\ + 'fenago/data-science/'\ + 'master/Lab09/Dataset/phpYYZ4Qc.csv' + ``` + + +4. Load the dataset into a DataFrame called `df` using + `.read_csv()`: + ``` + df = pd.read_csv(file_url) + ``` + + +5. Extract the `rej` column using `.pop()` and save + it into a variable called `y`: + ``` + y = df.pop('rej') + ``` + + +6. Split the DataFrame into training and testing sets using + `train_test_split()` with `test_size=0.3` and + `random_state = 1`: + ``` + X_train, X_test, y_train, y_test = train_test_split\ + (df, y, test_size=0.3, \ + random_state=1) + ``` + + +7. Instantiate `RandomForestRegressor` with + `random_state=1`, `n_estimators=50`, + `max_depth=6`, and `min_samples_leaf=60`: + ``` + rf_model = RandomForestRegressor(random_state=1, \ + n_estimators=50, max_depth=6, \ + min_samples_leaf=60) + ``` + + +8. Train the model on the training set using `.fit()`: + + ``` + rf_model.fit(X_train, y_train) + ``` + + + You should get the following output: + + +![](./images/B15019_09_25.jpg) + + + Caption: Logs of RandomForest + +9. Extract the feature importance via permutation using + `feature_importance_permutation` from `mlxtend` + with the Random Forest model, the testing set, `r2` as the + metric used, `num_rounds=1`, and `seed=2`. Save + the results into a variable called `imp_vals` and print + its values: + + ``` + imp_vals, _ = feature_importance_permutation\ + (predict_method=rf_model.predict, \ + X=X_test.values, y=y_test.values, \ + metric='r2', num_rounds=1, seed=2) + imp_vals + ``` + + + You should get the following output: + + +![](./images/B15019_09_26.jpg) + + + Caption: Variable importance by permutation + + It is quite hard to interpret the raw results. Let\'s plot the + variable importance by permutating the model on a graph. + +10. Create a DataFrame called `varimp_df` with two columns: + `feature` containing the name of the columns of + `df`, using `.columns` and + `'importance'` containing the values of + `imp_vals`: + ``` + varimp_df = pd.DataFrame({'feature': df.columns, \ + 'importance': imp_vals}) + ``` + + +11. Plot a bar chart with Altair using `coef_df` and + `importance` as the `x` axis and + `feature` as the `y` axis: + + ``` + alt.Chart(varimp_df).mark_bar().encode(x='importance',\ + y="feature") + ``` + + + You should get the following output: + + +![](./images/B15019_09_27.jpg) + + +Caption: Graph showing the variable importance by permutation + + + +Partial Dependence Plots +======================== + + +Another tool that is model-agnostic is a partial dependence plot. It is +a visual tool for analyzing the effect of a feature on the target +variable. To achieve this, we can plot the values of the feature we are +interested in analyzing on the `x`-axis and the target +variable on the `y`-axis and then show all the observations +from the dataset on this graph. Let\'s try it on the Breast Cancer +dataset from `sklearn`: + +``` +from sklearn.datasets import load_breast_cancer +import pandas as pd +data = load_breast_cancer() +df = pd.DataFrame(data.data, columns=data.feature_names) +df['target'] = data.target +``` +Now that we have loaded the data and converted it to a DataFrame, let\'s +have a look at the worst concave points column: + +``` +import altair as alt +alt.Chart(df).mark_circle(size=60)\ + .encode(x='worst concave points', y='target') +``` + +The resulting plot is as follows: + +![Caption: Scatter plot of the worst concave points and target +variables ](./images/B15019_09_28.jpg) + +Caption: Scatter plot of the worst concave points and target +variables + +Note + +The preceding code and figure are just examples. We encourage you to +analyze different features by changing the values assigned to +`x` and `y` in the preceding code. For example, you +can possibly analyze worst concavity versus worst perimeter by setting +`x='worst concavity'` and `y='worst perimeter'` in +the preceding code. + +From this plot, we can see: + +- Most cases with 1 for the target variable have values under 0.16 for + the worst concave points column. +- Cases with a 0 value for the target have values of over 0.08 for + worst concave points. + +With this plot, we are not too sure about which outcome (0 or 1) we will +get for the values between 0.08 and 0.16 for worst concave points. There +are multiple possible reasons why the outcome of the observations within +this range of values is uncertain, such as the fact that there are not +many records that fall into this case, or other variables might +influence the outcome for these cases. This is where a partial +dependence plot can help. + +The logic is very similar to variable importance via permutation but +rather than randomly replacing the values in a column, we will test +every possible value within that column for all observations and see +what predictions it leads to. If we take the example from figure 9.21, +from the three observations we had originally, this method will create +six new observations by keeping columns `X2` and +`X3` as they were and replacing the values of `X1`: + +![](./images/B15019_09_29.jpg) + +Caption: Example of records generated from a partial dependence plot + +With this new data, we can see, for instance, whether the value 12 +really has a strong influence on predicting 1 for the target variable. +The original records, with the values 42 and 1 for the `X1` +column, lead to outcome 0 and only value 12 generated a prediction of 1. +But if we take the same observations for `X1`, equal to 42 and +1, and replace that value with 12, we can see whether the new +predictions will lead to 1 for the target variable. This is exactly the +logic behind a partial dependence plot, and it will assess all the +permutations possible for a column and plot the average of +the predictions. + +`sklearn` provides a function called +`plot_partial_dependence()` to display the partial dependence +plot for a given feature. Let\'s see how to use it on the Breast Cancer +dataset. First, we need to get the index of the column we are interested +in. We will use the `.get_loc()` method from +`pandas` to get the index for the +`worst concave points` column: + +``` +import altair as alt +from sklearn.inspection import plot_partial_dependence +feature_index = df.columns.get_loc("worst concave points") +``` +Now we can call the `plot_partial_dependence()` function. We +need to provide the following parameters: the trained model, the +training set, and the indices of the features to be analyzed: + +``` +plot_partial_dependence(rf_model, df, \ + features=[feature_index]) +``` +![Caption: Partial dependence plot for the worst concave points +column ](./images/B15019_09_30.jpg) + +Caption: Partial dependence plot for the worst concave points column + +This partial dependence plot shows us that, on average, all the +observations with a value under 0.17 for the worst concave points column +will most likely lead to a prediction of 1 for the target (probability +over 0.5) and all the records over 0.17 will have a prediction of 0 +(probability under 0.5). + + + +Exercise 9.04: Plotting Partial Dependence +------------------------------------------ + +In this exercise, we will plot partial dependence plots for two +variables, `a1pop` and `temp`, from a Random Forest +classifier model trained to predict the customer drop-out ratio. + +We will using the same dataset as in the previous exercise. + +1. Open a new Colab notebook. + +2. Import the following packages: `pandas`, + `train_test_split` from + `sklearn.model_selection`, + `RandomForestRegressor` from `sklearn.ensemble`, + `plot_partial_dependence` from + `sklearn.inspection`, and `altair`: + ``` + import pandas as pd + from sklearn.model_selection import train_test_split + from sklearn.ensemble import RandomForestRegressor + from sklearn.inspection import plot_partial_dependence + import altair as alt + ``` + + +3. Create a variable called `file_url` that contains the URL + for the dataset: + ``` + file_url = 'https://raw.githubusercontent.com/'\ + 'fenago/data-science/'\ + 'master/Lab09/Dataset/phpYYZ4Qc.csv' + ``` + + +4. Load the dataset into a DataFrame called `df` using + `.read_csv()`: + ``` + df = pd.read_csv(file_url) + ``` + + +5. Extract the `rej` column using `.pop()` and save + it into a variable called `y`: + ``` + y = df.pop('rej') + ``` + + +6. Split the DataFrame into training and testing sets using + `train_test_split()` with `test_size=0.3` and + `random_state = 1`: + ``` + X_train, X_test, y_train, y_test = train_test_split\ + (df, y, test_size=0.3, \ + random_state=1) + ``` + + +7. Instantiate `RandomForestRegressor` with + `random_state=1`, `n_estimators=50`, + `max_depth=6`, and `min_samples_leaf=60`: + ``` + rf_model = RandomForestRegressor(random_state=1, \ + n_estimators=50, max_depth=6,\ + min_samples_leaf=60) + ``` + + +8. Train the model on the training set using `.fit()`: + + ``` + rf_model.fit(X_train, y_train) + ``` + + + You should get the following output: + + +![](./images/B15019_09_31.jpg) + + + Caption: Logs of RandomForest + +9. Plot the partial dependence plot using + `plot_partial_dependence()` from `sklearn` with + the Random Forest model, the testing set, and the index of the + `a1pop` column: + + ``` + plot_partial_dependence(rf_model, X_test, \ + features=[df.columns.get_loc('a1pop')]) + ``` + + + You should get the following output: + + +![](./images/B15019_09_32.jpg) + + + Caption: Partial dependence plot for a1pop + + This partial dependence plot shows that, on average, the + `a1pop` variable doesn\'t affect the target variable much + when its value is below 2, but from there the target increases + linearly by 0.04 for each unit increase of `a1pop`. This + means if the population size of area 1 is below the value of 2, the + risk of churn is almost null. But over this limit, every increment + of population size for area 1 increases the chance of churn by + `4%`. + +10. Plot the partial dependence plot using + `plot_partial_dependence()` from `sklearn` with + the Random Forest model, the testing set, and the index of the + `temp` column: + + ``` + plot_partial_dependence(rf_model, X_test, \ + features=[df.columns.get_loc('temp')]) + ``` + + + You should get the following output: + + +![](./images/B15019_09_33.jpg) + + +Caption: Partial dependence plot for temp + +This partial dependence plot shows that, on average, the +`temp` variable has a negative linear impact on the target +variable: when `temp` increases by 1, the target variable will +decrease by 0.12. This means if the temperature increases by a degree, +the chance of leaving the queue decreases by 12%. + + + +Local Interpretation with LIME +============================== + + +LIME is one way to get more visibility in such cases. The underlying +logic of LIME is to approximate the original nonlinear model with a +linear one. Then, it uses the coefficients of that linear model in order +to explain the contribution of each variable, as we just saw in the +preceding example. But rather than trying to approximate the entire +model for the whole dataset, LIME tries to approximate it locally around +the observation you are interested in. LIME uses the trained model to +predict new data points near your observation and then fit a linear +regression on that predicted data. + +Let\'s see how we can use it on the Breast Cancer dataset. First, we +will load the data and train a Random Forest model: + +``` +from sklearn.datasets import load_breast_cancer +from sklearn.model_selection import train_test_split +from sklearn.ensemble import RandomForestClassifier +data = load_breast_cancer() +X, y = data.data, data.target +X_train, X_test, y_train, y_test = train_test_split\ + (X, y, test_size=0.3, \ + random_state=1) +rf_model = RandomForestClassifier(random_state=168) +rf_model.fit(X_train, y_train) +``` + +The `lime` package is not directly accessible on Google Colab, +so we need to manually install it with the following command: + +``` +!pip install lime +``` + +The output will be as follows: + +![](./images/B15019_09_34.jpg) + +Caption: Installation logs for the lime package + +Once installed, we will instantiate the `LimeTabularExplainer` +class by providing the training data, the names of the features, the +names of the classes to be predicted, and the task type (in this +example, it is `classification`): + +``` +from lime.lime_tabular import LimeTabularExplainer +lime_explainer = LimeTabularExplainer\ + (X_train, feature_names=data.feature_names,\ + class_names=data.target_names,\ + mode='classification') +``` + +Then, we will call the `.explain_instance()` method with the +observations we are interested in (here, it will be the second +observation from the testing set) and the function that will predict the +outcome probabilities (here, it is `.predict_proba()`). +Finally, we will call the `.show_in_notebook()` method to +display the results from `lime`: + +``` +exp = lime_explainer.explain_instance\ + (X_test[1], rf_model.predict_proba, num_features=10) +exp.show_in_notebook() +``` + +The output will be as follows: + +![](./images/B15019_09_35.jpg) + +Caption: Output of LIME + +Note + +Your output may differ slightly. This is due to the random sampling +process of LIME. + +There is a lot of information in the preceding output. Let\'s go through +it a bit at a time. The left-hand side shows the prediction +probabilities for the two classes of the target variable. For this +observation, the model thinks there is a 0.85 probability that the +predicted value will be malignant: + +![](./images/B15019_09_36.jpg) + +Caption: Prediction probabilities from LIME + +The right-hand side shows the value of each feature for this +observation. Each feature is color-coded to highlight its contribution +toward the possible classes of the target variable. The list sorts the +features by decreasing importance. In this example, the mean perimeter, +mean area, and area error contributed to the model to increase the +probability toward class 1. All the other features influenced the model +to predict outcome 0: + +![](./images/B15019_09_37.jpg) + +Caption: Value of the feature for the observation of interest + +Finally, the central part shows how each variable contributed to the +final prediction. In this example, the `worst concave points` +and `worst compactness` variables led to an increase of, +respectively, 0.10 and 0.05 probability in predicting outcome 0. On the +other hand, `mean perimeter` and `mean area` both +contributed to an increase of 0.03 probability of predicting class 1: + +![](./images/B15019_09_38.jpg) + +Caption: Contribution of each feature to the final prediction + +It\'s as simple as that. With LIME, we can easily see how each variable +impacted the probabilities of predicting the different outcomes of the +model. As you saw, the LIME package not only computes the local +approximation but also provides a visual representation of its results. +It is much easier to interpret than looking at raw outputs. It is also +very useful for presenting your findings and illustrating how different +features influenced the prediction of a single observation. + + + +Exercise 9.05: Local Interpretation with LIME +--------------------------------------------- + +In this exercise, we will analyze some predictions from a Random Forest +classifier model trained to predict the customer drop-out ratio using +LIME. + +We will be using the same dataset as in the previous exercise. + +1. Open a new Colab notebook. + +2. Import the following packages: `pandas`, + `train_test_split` from + `sklearn.model_selection`, and + `RandomForestRegressor` from `sklearn.ensemble`: + ``` + import pandas as pd + from sklearn.model_selection import train_test_split + from sklearn.ensemble import RandomForestRegressor + ``` + + +3. Create a variable called `file_url` that contains the URL + of the dataset: + ``` + file_url = 'https://raw.githubusercontent.com/'\ + 'fenago/data-science/'\ + 'master/Lab09/Dataset/phpYYZ4Qc.csv' + ``` + + +4. Load the dataset into a DataFrame called `df` using + `.read_csv()`: + ``` + df = pd.read_csv(file_url) + ``` + + +5. Extract the `rej` column using `.pop()` and save + it into a variable called `y`: + ``` + y = df.pop('rej') + ``` + + +6. Split the DataFrame into training and testing sets using + `train_test_split()` with `test_size=0.3` and + `random_state = 1`: + ``` + X_train, X_test, y_train, y_test = train_test_split\ + (df, y, test_size=0.3, \ + random_state=1) + ``` + + +7. Instantiate `RandomForestRegressor` with + `random_state=1`, `n_estimators=50`, + `max_depth=6`, and `min_samples_leaf=60`: + ``` + rf_model = RandomForestRegressor(random_state=1, \ + n_estimators=50, max_depth=6,\ + min_samples_leaf=60) + ``` + + +8. Train the model on the training set using `.fit()`: + + ``` + rf_model.fit(X_train, y_train) + ``` + + + You should get the following output: + + +![](./images/B15019_09_39.jpg) + + + Caption: Logs of RandomForest + +9. Install the lime package using the `!pip` install command: + ``` + !pip install lime + ``` + + +10. Import `LimeTabularExplainer` from + `lime.lime_tabular`: + ``` + from lime.lime_tabular import LimeTabularExplainer + ``` + + +11. Instantiate `LimeTabularExplainer` with the training set + and `mode='regression'`: + ``` + lime_explainer = LimeTabularExplainer\ + (X_train.values, \ + feature_names=X_train.columns, \ + mode='regression') + ``` + + +12. Display the LIME analysis on the first row of the testing set using + `.explain_instance()` and `.show_in_notebook()`: + + ``` + exp = lime_explainer.explain_instance\ + (X_test.values[0], rf_model.predict) + exp.show_in_notebook() + ``` + + + You should get the following output: + + +![Caption: LIME output for the first observation of the testing + set ](./images/B15019_09_40.jpg) + + + Caption: LIME output for the first observation of the testing + set + + This output shows that the predicted value for this observation is a + 0.02 chance of customer drop-out and it has been mainly influenced + by the `a1pop`, `a3pop`, `a2pop`, and + `b2eff` features. For instance, the fact that + `a1pop` was under 0.87 has decreased the value of the + target variable by 0.01. + +13. Display the LIME analysis on the third row of the testing set using + `.explain_instance()` and `.show_in_notebook()`: + + ``` + exp = lime_explainer.explain_instance\ + (X_test.values[2], rf_model.predict) + exp.show_in_notebook() + ``` + + + You should get the following output: + + +![Caption: LIME output for the third observation of the testing + set ](./images/B15019_09_41.jpg) + + +Caption: LIME output for the third observation of the testing set + + +You have completed the last exercise of this lab. You saw how to use +LIME to interpret the prediction of single observations. We learned that +the `a1pop`, `a2pop`, and `a3pop` features +have a strong negative impact on the first and third observations of the +training set. + + + +Activity 9.01: Train and Analyze a Network Intrusion Detection Model +-------------------------------------------------------------------- + +You are working for a cybersecurity company and you have been tasked +with building a model that can recognize network intrusion then analyze +its feature importance, plot partial dependence, and perform local +interpretation on a single observation using LIME. + +The dataset provided contains data from 7 weeks of network traffic. + + +The following steps will help you to complete this activity: + +1. Download and load the dataset using `.read_csv()` from + `pandas`. + +2. Extract the response variable using `.pop()` from + `pandas`. + +3. Split the dataset into training and test sets using + `train_test_split()` from + `sklearn.model_selection`. + +4. Create a function that will instantiate and fit + `RandomForestClassifier` using `.fit()` from + `sklearn.ensemble`. + +5. Create a function that will predict the outcome for the training and + testing sets using `.predict()`. + +6. Create a function that will print the accuracy score for the + training and testing sets using `accuracy_score()` from + `sklearn.metrics`. + +7. Compute the feature importance via permutation with + `feature_importance_permutation()` and display it on a bar + chart using `altair`. + +8. Plot the partial dependence plot using + `plot_partial_dependence` on the `src_bytes` + variable. + +9. Install `lime` using `!pip install`. + +10. Perform a LIME analysis on row `99893` with + `explain_instance()`. + + The output should be as follows: + + +![](./images/B15019_09_42.jpg) + + + +Summary +======= + + +In this lab, we learned a few techniques for interpreting machine +learning models. We saw that there are techniques that are specific to +the model used: coefficients for linear models and variable importance +for tree-based models. There are also some methods that are +model-agnostic, such as variable importance via permutation. diff --git a/lab_guides/logo.png b/lab_guides/logo.png new file mode 100644 index 0000000..f30cbd1 Binary files /dev/null and b/lab_guides/logo.png differ diff --git a/lab_guides/lab_overview.md b/lab_overview.md similarity index 100% rename from lab_guides/lab_overview.md rename to lab_overview.md diff --git a/logo.png b/logo.png new file mode 100644 index 0000000..f30cbd1 Binary files /dev/null and b/logo.png differ