diff --git a/Lab01/Data-Science-in-Python-the-Simple-Way.iml b/Lab01/Data-Science-in-Python-the-Simple-Way.iml
deleted file mode 100644
index a42dedc..0000000
--- a/Lab01/Data-Science-in-Python-the-Simple-Way.iml
+++ /dev/null
@@ -1,14 +0,0 @@
-
-
-
-
-
-
-
-
-
-
-
-
-
-
\ No newline at end of file
diff --git a/Lab01/misc.xml b/Lab01/misc.xml
deleted file mode 100644
index 6ab0bd6..0000000
--- a/Lab01/misc.xml
+++ /dev/null
@@ -1,7 +0,0 @@
-
-
-
-
-
-
-
\ No newline at end of file
diff --git a/Lab01/modules.xml b/Lab01/modules.xml
deleted file mode 100644
index 53fb054..0000000
--- a/Lab01/modules.xml
+++ /dev/null
@@ -1,8 +0,0 @@
-
-
-
-
-
-
-
-
\ No newline at end of file
diff --git a/Lab01/vcs.xml b/Lab01/vcs.xml
deleted file mode 100644
index 94a25f7..0000000
--- a/Lab01/vcs.xml
+++ /dev/null
@@ -1,6 +0,0 @@
-
-
-
-
-
-
\ No newline at end of file
diff --git a/lab_guides/Lab_1.md b/lab_guides/Lab_1.md
new file mode 100644
index 0000000..a9259e0
--- /dev/null
+++ b/lab_guides/Lab_1.md
@@ -0,0 +1,1365 @@
+
+1. Introduction to Data Science in Python
+=========================================
+
+
+
+Overview
+
+This very first lab will introduce you to the field of data science
+and walk you through an overview of Python\'s core concepts and their
+application in the world of data science.
+
+By the end of this lab, you will be able to explain what data
+science is and distinguish between supervised and unsupervised learning.
+You will also be able to explain what machine learning is and
+distinguish between regression, classification, and clustering problems.
+You\'ll have learnt to create and manipulate different types of Python
+variable, including core variables, lists, and dictionaries. You\'ll be
+able to build a `for` loop, print results using f-strings,
+define functions, import Python packages and load data in different
+formats using `pandas`. You will also have had your first
+taste of training a model using scikit-learn.
+
+
+Introduction
+============
+
+
+Welcome to the fascinating world of data science! We are sure you must
+be pretty excited to start your journey and learn interesting and
+exciting techniques and algorithms. This is exactly what this book is
+intended for.
+
+But before diving into it, let\'s define what data science is: it is a
+combination of multiple disciplines, including business, statistics, and
+programming, that intends to extract meaningful insights from data by
+running controlled experiments similar to scientific research.
+
+The objective of any data science project is to derive valuable
+knowledge for the business from data in order to make better decisions.
+It is the responsibility of data scientists to define the goals to be
+achieved for a project. This requires business knowledge and expertise.
+In this book, you will be exposed to some examples of data science tasks
+from real-world datasets.
+
+Statistics is a mathematical field used for analyzing and finding
+patterns from data. A lot of the newest and most advanced techniques
+still rely on core statistical approaches. This book will present to you
+the basic techniques required to understand the concepts we will be
+covering.
+
+With an exponential increase in data generation, more computational
+power is required for processing it efficiently. This is the reason why
+programming is a required skill for data scientists. You may wonder why
+we chose Python for this Workshop. That\'s because Python is one of the
+most popular programming languages for data science. It is extremely
+easy to learn how to code in Python thanks to its simple and easily
+readable syntax. It also has an incredible number of packages available
+to anyone for free, such as pandas, scikit-learn, TensorFlow, and
+PyTorch. Its community is expanding at an incredible rate, adding more
+and more new functionalities and improving its performance and
+reliability. It\'s no wonder companies such as Facebook, Airbnb, and
+Google are using it as one of their main stacks. No prior knowledge of
+Python is required for this book. If you do have some experience with
+Python or other programming languages, then this will be an advantage,
+but all concepts will be fully explained, so don\'t worry if you are new
+to programming.
+
+
+Application of Data Science
+===========================
+
+
+As mentioned in the introduction, data science is a multidisciplinary
+approach to analyzing and identifying complex patterns and extracting
+valuable insights from data. Running a data science project usually
+involves multiple steps, including the following:
+
+1. Defining the business problem to be solved
+2. Collecting or extracting existing data
+3. Analyzing, visualizing, and preparing data
+4. Training a model to spot patterns in data and make predictions
+5. Assessing a model\'s performance and making improvements
+6. Communicating and presenting findings and gained insights
+7. Deploying and maintaining a model
+
+As its name implies, data science projects require data, but it is
+actually more important to have defined a clear business problem to
+solve first. If it\'s not framed correctly, a project may lead to
+incorrect results as you may have used the wrong information, not
+prepared the data properly, or led a model to learn the wrong patterns.
+So, it is absolutely critical to properly define the scope and objective
+of a data science project with your stakeholders.
+
+There are a lot of data science applications in real-world situations or
+in business environments. For example, healthcare providers may train a
+model for predicting a medical outcome or its severity based on medical
+measurements, or a high school may want to predict which students are at
+risk of dropping out within a year\'s time based on their historical
+grades and past behaviors. Corporations may be interested to know the
+likelihood of a customer buying a certain product based on his or her
+past purchases. They may also need to better understand which customers
+are more likely to stop using existing services and churn. These are
+examples where data science can be used to achieve a clearly defined
+goal, such as increasing the number of patients detected with a heart
+condition at an early stage or reducing the number of customers
+canceling their subscriptions after six months. That sounds exciting,
+right? Soon enough, you will be working on such interesting projects.
+
+
+
+What Is Machine Learning?
+-------------------------
+
+When we mention data science, we usually think about machine learning,
+and some people may not understand the difference between them. Machine
+learning is the field of building algorithms that can learn patterns by
+themselves without being programmed explicitly. So machine learning is a
+family of techniques that can be used at the modeling stage of a data
+science project.
+
+Machine learning is composed of three different types of learning:
+
+- Supervised learning
+- Unsupervised learning
+- Reinforcement learning
+
+
+
+### Supervised Learning
+
+Supervised learning refers to a type of task where an algorithm is
+trained to learn patterns based on prior knowledge. That means this kind
+of learning requires the labeling of the outcome (also called the
+response variable, dependent variable, or target variable) to be
+predicted beforehand. For instance, if you want to train a model that
+will predict whether a customer will cancel their subscription, you will
+need a dataset with a column (or variable) that already contains the
+churn outcome (cancel or not cancel) for past or existing customers.
+This outcome has to be labeled by someone prior to the training of a
+model. If this dataset contains 5,000 observations, then all of them
+need to have the outcome being populated. The objective of the model is
+to learn the relationship between this outcome column and the other
+features (also called independent variables or predictor variables).
+Following is an example of such a dataset:
+
+
+
+Caption: Example of customer churn dataset
+
+The `Cancel` column is the response variable. This is the
+column you are interested in, and you want the model to predict
+accurately the outcome for new input data (in this case, new customers).
+All the other columns are the predictor variables.
+
+The model, after being trained, may find the following pattern: a
+customer is more likely to cancel their subscription after 12 months and
+if their average monthly spent is over `$50`. So, if a new
+customer has gone through 15 months of subscription and is spending \$85
+per month, the model will predict this customer will cancel their
+contract in the future.
+
+When the response variable contains a limited number of possible values
+(or classes), it is a classification problem (you will learn more about
+this in *Lab 3, Binary Classification*, and *Lab 4, Multiclass
+Classification with RandomForest*). The model will learn how to predict
+the right class given the values of the independent variables. The churn
+example we just mentioned is a classification problem as the response
+variable can only take two different values: `yes` or
+`no`.
+
+On the other hand, if the response variable can have a value from an
+infinite number of possibilities, it is called a regression problem.
+
+An example of a regression problem is where you are trying to predict
+the exact number of mobile phones produced every day for some
+manufacturing plants. This value can potentially range from 0 to an
+infinite number (or a number big enough to have a large range of
+potential values), as shown in *Figure 1.2*.
+
+
+
+Caption: Example of a mobile phone production dataset
+
+In the preceding figure, you can see that the values for
+`Daily output` can take any value from `15000` to
+more than `50000`. This is a regression problem, which we will
+look at in *Lab 2, Regression*.
+
+
+
+### Unsupervised Learning
+
+Unsupervised learning is a type of algorithm that doesn\'t require any
+response variables at all. In this case, the model will learn patterns
+from the data by itself. You may ask what kind of pattern it can find if
+there is no target specified beforehand.
+
+This type of algorithm usually can detect similarities between variables
+or records, so it will try to group those that are very close to each
+other. This kind of algorithm can be used for clustering (grouping
+records) or dimensionality reduction (reducing the number of variables).
+Clustering is very popular for performing customer segmentation, where
+the algorithm will look to group customers with similar behaviors
+together from the data. *Lab 5*, *Performing Your First Cluster
+Analysis*, will walk you through an example of clustering analysis.
+
+
+
+### Reinforcement Learning
+
+Reinforcement learning is another type of algorithm that learns how to
+act in a specific environment based on the feedback it receives. You may
+have seen some videos where algorithms are trained to play Atari games
+by themselves. Reinforcement learning techniques are being used to teach
+the agent how to act in the game based on the rewards or penalties it
+receives from the game.
+
+For instance, in the game Pong, the agent will learn to not let the ball
+drop after multiple rounds of training in which it receives high
+penalties every time the ball drops.
+
+Note
+
+Reinforcement learning algorithms are out of scope and will not be
+covered in this book.
+
+
+Overview of Python
+==================
+
+
+As mentioned earlier, Python is one of the most popular programming
+languages for data science. But before diving into Python\'s data
+science applications, let\'s have a quick introduction to some core
+Python concepts.
+
+
+
+Types of Variable
+-----------------
+
+In Python, you can handle and manipulate different types of variables.
+Each has its own specificities and benefits. We will not go through
+every single one of them but rather focus on the main ones that you will
+have to use in this book. For each of the following code examples, you
+can run the code in Google Colab to view the given output.
+
+
+
+### Numeric Variables
+
+The most basic variable type is numeric. This can contain integer or
+decimal (or float) numbers, and some mathematical operations can be
+performed on top of them.
+
+Let\'s use an integer variable called `var1` that will take
+the value `8` and another one called `var2` with the
+value `160.88`, and add them together with the `+`
+operator, as shown here:
+
+```
+var1 = 8
+var2 = 160.88
+var1 + var2
+```
+You should get the following output:
+
+
+
+Caption: Output of the addition of two variables
+
+In Python, you can perform other mathematical operations on numerical
+variables, such as multiplication (with the `*` operator) and
+division (with `/`).
+
+
+
+### Text Variables
+
+Another interesting type of variable is `string`, which
+contains textual information. You can create a variable with some
+specific text using the single or double quote, as shown in the
+following example:
+
+```
+var3 = 'Hello, '
+var4 = 'World'
+```
+
+In order to display the content of a variable, you can call the
+`print()` function:
+
+```
+print(var3)
+print(var4)
+```
+You should get the following output:
+
+
+
+Caption: Printing the two text variables
+
+Python also provides an interface called f-strings for printing text
+with the value of defined variables. It is very handy when you want to
+print results with additional text to make it more readable and
+interpret results. It is also quite common to use f-strings to print
+logs. You will need to add `f` before the quotes (or double
+quotes) to specify that the text will be an f-string. Then you can add
+an existing variable inside the quotes and display the text with the
+value of this variable. You need to wrap the variable with curly
+brackets, `{}`.
+
+For instance, if we want to print `Text:` before the values of
+`var3` and `var4`, we will write the following code:
+
+```
+print(f"Text: {var3} {var4}!")
+```
+You should get the following output:
+
+
+
+Caption: Printing with f-strings
+
+You can also perform some text-related transformations with string
+variables, such as capitalizing or replacing characters. For instance,
+you can concatenate the two variables together with the `+`
+operator:
+
+```
+var3 + var4
+```
+You should get the following output:
+
+
+
+Caption: Concatenation of the two text variables
+
+
+
+### Python List
+
+Another very useful type of variable is the list. It is a collection of
+items that can be changed (you can add, update, or remove items). To
+declare a list, you will need to use square brackets, `[]`,
+like this:
+
+```
+var5 = ['I', 'love', 'data', 'science']
+print(var5)
+```
+You should get the following output:
+
+
+
+Caption: List containing only string items
+
+A list can have different item types, so you can mix numerical and text
+variables in it:
+
+```
+var6 = ['Fenago', 15019, 2020, 'Data Science']
+print(var6)
+```
+
+
+An item in a list can be accessed by its index (its position in the
+list). To access the first (index 0) and third elements (index 2) of a
+list, you do the following:
+
+```
+print(var6[0])
+print(var6[2])
+```
+Note
+
+In Python, all indexes start at `0`.
+
+
+Python provides an API to access a range of items using the
+`:` operator. You just need to specify the starting index on
+the left side of the operator and the ending index on the right side.
+The ending index is always excluded from the range. So, if you want to
+get the first three items (index 0 to 2), you should do as follows:
+
+```
+print(var6[0:3])
+```
+
+You can also iterate through every item of a list using a
+`for` loop. If you want to print every item of the
+`var6` list, you should do this:
+
+```
+for item in var6:
+ print(item)
+```
+You should get the following output:
+
+
+
+You can add an item at the end of the list using the
+`.append()` method:
+
+```
+var6.append('Python')
+print(var6)
+```
+
+
+
+To delete an item from the list, you use the `.remove()`
+method:
+
+```
+var6.remove(15019)
+print(var6)
+```
+
+
+### Python Dictionary
+
+A dictionary contains multiple elements, like a **list**, but each element
+is organized as a key-value pair. A dictionary is not indexed by numbers
+but by keys. So, to access a specific value, you will have to call the
+item by its corresponding key. To define a dictionary in Python, you
+will use curly brackets, `{}`, and specify the keys and values
+separated by `:`, as shown here:
+
+```
+var7 = {'Topic': 'Data Science', 'Language': 'Python'}
+print(var7)
+```
+You should get the following output:
+
+
+
+Caption: Output of var7
+
+To access a specific value, you need to provide the corresponding key
+name. For instance, if you want to get the value `Python`, you
+do this:
+
+```
+var7['Language']
+```
+You should get the following output:
+
+
+
+Caption: Value for the \'Language\' key
+
+Note
+
+Each key-value pair in a dictionary needs to be unique.
+
+Python provides a method to access all the key names from a dictionary,
+`.keys()`, which is used as shown in the following code
+snippet:
+
+```
+var7.keys()
+```
+You should get the following output:
+
+
+
+Caption: List of key names
+
+There is also a method called `.values()`, which is used to
+access all the values of a dictionary:
+
+```
+var7.values()
+```
+You should get the following output:
+
+
+
+Caption: List of values
+
+You can iterate through all items from a dictionary using a
+`for` loop and the `.items()` method, as shown in
+the following code snippet:
+
+```
+for key, value in var7.items():
+ print(key)
+ print(value)
+```
+You should get the following output:
+
+
+
+Caption: Output after iterating through the items of a dictionary
+
+You can add a new element in a dictionary by providing the key name like
+this:
+
+```
+var7['Publisher'] = 'Fenago'
+print(var7)
+```
+
+
+You can delete an item from a dictionary with the `del`
+command:
+
+```
+del var7['Publisher']
+print(var7)
+```
+You should get the following output:
+
+
+
+Caption: Output of a dictionary after removing an item
+
+In *Exercise 1.01*, *Creating a Dictionary That Will Contain Machine
+Learning Algorithms*, we will be looking to use these concepts that
+we\'ve just looked at.
+
+
+
+Exercise 1.01: Creating a Dictionary That Will Contain Machine Learning Algorithms
+----------------------------------------------------------------------------------
+
+In this exercise, we will create a dictionary using Python that will
+contain a collection of different machine learning algorithms that will
+be covered in this book.
+
+The following steps will help you complete the exercise:
+
+Note
+
+Every exercise and activity in this book is to be executed on Google
+Colab.
+
+1. Open on a new Colab notebook.
+
+2. Create a list called `algorithm` that will contain the
+ following elements: `Linear Regression`,
+ `Logistic Regression`, `RandomForest`, and
+ `a3c`:
+
+ ```
+ algorithm = ['Linear Regression', 'Logistic Regression', \
+ 'RandomForest', 'a3c']
+ ```
+
+
+ Note
+
+ The code snippet shown above uses a backslash ( `\` ) to
+ split the logic across multiple lines. When the code is executed,
+ Python will ignore the backslash, and treat the code on the next
+ line as a direct continuation of the current line.
+
+3. Now, create a list called `learning` that will contain the
+ following elements: `Supervised`, `Supervised`,
+ `Supervised`, and `Reinforcement`:
+ ```
+ learning = ['Supervised', 'Supervised', 'Supervised', \
+ 'Reinforcement']
+ ```
+
+
+4. Create a list called `algorithm_type` that will contain
+ the following elements: `Regression`,
+ `Classification`,
+ `Regression or Classification`, and `Game AI`:
+ ```
+ algorithm_type = ['Regression', 'Classification', \
+ 'Regression or Classification', 'Game AI']
+ ```
+
+
+5. Add an item called `k-means` into the
+ `algorithm` list using the `.append()` method:
+ ```
+ algorithm.append('k-means')
+ ```
+
+
+6. Display the content of `algorithm` using the
+ `print()` function:
+
+ ```
+ print(algorithm)
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+ Caption: Output of \'algorithm\'
+
+ From the preceding output, we can see that we added the
+ `k-means` item to the list.
+
+7. Now, add the `Unsupervised` item into the
+ `learning` list using the `.append()` method:
+ ```
+ learning.append('Unsupervised')
+ ```
+
+
+8. Display the content of `learning` using the
+ `print()` function:
+
+ ```
+ print(learning)
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+ Caption: Output of \'learning\'
+
+ From the preceding output, we can see that we added the
+ `Unsupervised` item into the list.
+
+9. Add the `Clustering` item into the
+ `algorithm_type` list using the `.append()`
+ method:
+ ```
+ algorithm_type.append('Clustering')
+ ```
+
+
+10. Display the content of `algorithm_type` using the
+ `print()` function:
+
+ ```
+ print(algorithm_type)
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+ Caption: Output of \'algorithm\_type\'
+
+ From the preceding output, we can see that we added the
+ `Clustering` item into the list.
+
+11. Create an empty dictionary called `machine_learning` using
+ curly brackets, `{}`:
+ ```
+ machine_learning = {}
+ ```
+
+
+12. Create a new item in `machine_learning` with the key as
+ `algorithm` and the value as all the items from the
+ `algorithm` list:
+ ```
+ machine_learning['algorithm'] = algorithm
+ ```
+
+
+13. Display the content of `machine_learning` using the
+ `print()` function.
+
+ ```
+ print(machine_learning)
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+ Caption: Output of \'machine\_learning\'
+
+ From the preceding output, we notice that we have created a
+ dictionary from the `algorithm` list.
+
+14. Create a new item in `machine_learning` with the key as
+ `learning` and the value as all the items from the
+ `learning` list:
+ ```
+ machine_learning['learning'] = learning
+ ```
+
+
+15. Now, create a new item in `machine_learning` with the key
+ as `algorithm_type` and the value as all the items from
+ the algorithm\_type list:
+ ```
+ machine_learning['algorithm_type'] = algorithm_type
+ ```
+
+
+16. Display the content of `machine_learning` using the
+ `print()` function.
+
+ ```
+ print(machine_learning)
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+ Caption: Output of \'machine\_learning\'
+
+17. Remove the `a3c` item from the `algorithm` key
+ using the `.remove()` method:
+ ```
+ machine_learning['algorithm'].remove('a3c')
+ ```
+
+
+18. Display the content of the `algorithm` item from the
+ `machine_learning` dictionary using the
+ `print()` function:
+
+ ```
+ print(machine_learning['algorithm'])
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+ Caption: Output of \'algorithm\' from \'machine\_learning\'
+
+19. Remove the `Reinforcement` item from the
+ `learning` key using the `.remove()` method:
+ ```
+ machine_learning['learning'].remove('Reinforcement')
+ ```
+
+
+20. Remove the `Game AI` item from the
+ `algorithm_type` key using the `.remove()`
+ method:
+ ```
+ machine_learning['algorithm_type'].remove('Game AI')
+ ```
+
+
+21. Display the content of `machine_learning` using the
+ `print()` function:
+
+ ```
+ print(machine_learning)
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+Caption: Output of \'machine\_learning\'
+
+
+
+Python for Data Science
+=======================
+
+
+In this section, we will present to you two of the most popular ones:
+`pandas` and `scikit-learn`.
+
+
+
+The pandas Package
+------------------
+
+The pandas package provides an incredible amount of APIs for
+manipulating data structures. The two main data structures defined in
+the `pandas` package are `DataFrame` and
+`Series`.
+
+
+
+### DataFrame and Series
+
+
+
+
+Caption: Components of a DataFrame
+
+
+In pandas, a DataFrame is represented by the `DataFrame`
+class. A `pandas` DataFrame is composed of `pandas`
+Series, which are 1-dimensional arrays. A `pandas` Series is
+basically a single column in a DataFrame.
+
+
+### CSV Files
+
+CSV files use the comma character---`,`---to separate columns
+and newlines for a new row. The previous example of a DataFrame would
+look like this in a CSV file:
+
+```
+algorithm,learning,type
+Linear Regression,Supervised,Regression
+Logistic Regression,Supervised,Classification
+RandomForest,Supervised,Regression or Classification
+k-means,Unsupervised,Clustering
+```
+
+In Python, you need to first import the packages you require before
+being able to use them. To do so, you will have to use the
+`import` command. You can create an alias of each imported
+package using the `as` keyword. It is quite common to import
+the `pandas` package with the alias `pd`:
+
+```
+import pandas as pd
+```
+`pandas` provides a `.read_csv()` method to easily
+load a CSV file directly into a DataFrame. You just need to provide the
+path or the URL to the CSV file, as shown below.
+
+Note
+
+Watch out for the slashes in the string below. Remember that the
+backslashes ( `\` ) are used to split the code across multiple
+lines, while the forward slashes ( `/` ) are part of the URL.
+
+```
+pd.read_csv('https://raw.githubusercontent.com/fenago'\
+ '/data-science/master/Lab01/'\
+ 'Dataset/csv_example.csv')
+```
+You should get the following output:
+
+
+
+
+
+
+### Excel Spreadsheets
+
+Excel is a Microsoft tool and is very popular in the industry. It has
+its own internal structure for recording additional information, such as
+the data type of each cell or even Excel formulas. There is a specific
+method in `pandas` to load Excel spreadsheets called
+`.read_excel()`:
+
+```
+pd.read_excel('https://github.com/fenago'\
+ '/data-science/blob/master'\
+ '/Lab01/Dataset/excel_example.xlsx?raw=true')
+```
+You should get the following output:
+
+
+
+Caption: Dataframe after loading an Excel spreadsheet
+
+
+
+### JSON
+
+JSON is a very popular file format, mainly used for transferring data
+from web APIs. Its structure is very similar to that of a Python
+dictionary with key-value pairs. The example DataFrame we used before
+would look like this in JSON format:
+
+```
+{
+ "algorithm":{
+ "0":"Linear Regression",
+ "1":"Logistic Regression",
+ "2":"RandomForest",
+ "3":"k-means"
+ },
+ "learning":{
+ "0":"Supervised",
+ "1":"Supervised",
+ "2":"Supervised",
+ "3":"Unsupervised"
+ },
+ "type":{
+ "0":"Regression",
+ "1":"Classification",
+ "2":"Regression or Classification",
+ "3":"Clustering"
+ }
+}
+```
+As you may have guessed, there is a `pandas` method for
+reading JSON data as well, and it is called `.read_json()`:
+
+```
+pd.read_json('https://raw.githubusercontent.com/fenago'\
+ '/data-science/master/Lab01'\
+ '/Dataset/json_example.json')
+```
+
+You should get the following output:
+
+
+
+Caption: Dataframe after loading JSON data
+
+
+
+Exercise 1.02: Loading Data of Different Formats into a pandas DataFrame
+------------------------------------------------------------------------
+
+In this exercise, we will practice loading different data formats, such
+as CSV, TSV, and XLSX, into pandas DataFrames. The dataset we will use
+is the Top 10 Postcodes for the First Home Owner Grants dataset (this is
+a grant provided by the Australian government to help first-time real
+estate buyers). It lists the 10 postcodes (also known as zip codes) with
+the highest number of First Home Owner grants.
+
+In this dataset, you will find the number of First Home Owner grant
+applications for each postcode and the corresponding suburb.
+
+
+The following steps will help you complete the exercise:
+
+1. Open a new Colab notebook.
+
+2. Import the pandas package, as shown in the following code snippet:
+ ```
+ import pandas as pd
+ ```
+
+
+3. Create a new variable called `csv_url` containing the URL
+ to the raw CSV file:
+ ```
+ csv_url = 'https://raw.githubusercontent.com/fenago'\
+ '/data-science/master/Lab01'\
+ '/Dataset/overall_topten_2012-2013.csv'
+ ```
+
+
+4. Load the CSV file into a DataFrame using the pandas
+ `.read_csv()` method. The first row of this CSV file
+ contains the name of the file, which you can see if you open the
+ file directly. You will need to exclude this row by using the
+ `skiprows=1` parameter. Save the result in a variable
+ called `csv_df` and print it:
+
+ ```
+ csv_df = pd.read_csv(csv_url, skiprows=1)
+ csv_df
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+ Caption: The DataFrame after loading the CSV file
+
+5. Create a new variable called `tsv_url` containing the URL
+ to the raw TSV file:
+
+ ```
+ tsv_url = 'https://raw.githubusercontent.com/fenago'\
+ '/data-science/master/Lab01'\
+ '/Dataset/overall_topten_2012-2013.tsv'
+ ```
+
+
+ Note
+
+ A TSV file is similar to a CSV file but instead of using the comma
+ character (`,`) as a separator, it uses the tab character
+ (`\t`).
+
+6. Load the TSV file into a DataFrame using the pandas
+ .`read_csv()` method and specify the
+ `skiprows=1` and `sep='\t'` parameters. Save the
+ result in a variable called `tsv_df` and print it:
+
+ ```
+ tsv_df = pd.read_csv(tsv_url, skiprows=1, sep='\t')
+ tsv_df
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+ Caption: The DataFrame after loading the TSV file
+
+7. Create a new variable called `xlsx_url` containing the URL
+ to the raw Excel spreadsheet:
+ ```
+ xlsx_url = 'https://github.com/fenago'\
+ '/data-science/blob/master/'\
+ 'Lab01/Dataset'\
+ '/overall_topten_2012-2013.xlsx?raw=true'
+ ```
+
+
+8. Load the Excel spreadsheet into a DataFrame using the pandas
+ `.read_excel()` method. Save the result in a variable
+ called `xlsx_df` and print it:
+
+ ```
+ xlsx_df = pd.read_excel(xlsx_url)
+ xlsx_df
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+
+ By default, `.read_excel()` loads the first sheet of an
+ Excel spreadsheet. In this example, the data we\'re looking for is
+ actually stored in the second sheet.
+
+9. Load the Excel spreadsheet into a Dataframe using the pandas
+ `.read_excel()` method and specify the
+ `skiprows=1` and `sheet_name=1` parameters.
+ (Note that the `sheet_name` parameter is zero-indexed, so
+ `sheet_name=0` returns the first sheet, while
+ `sheet_name=1` returns the second sheet.) Save the result
+ in a variable called `xlsx_df1` and print it:
+
+ ```
+ xlsx_df1 = pd.read_excel(xlsx_url, skiprows=1, sheet_name=1)
+ xlsx_df1
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+
+### The sklearn API
+
+
+`sklearn` groups algorithms by family. For instance,
+`RandomForest` and `GradientBoosting` are part of
+the `ensemble` module. In order to make use of an algorithm,
+you will need to import it first like this:
+
+```
+from sklearn.ensemble import RandomForestClassifier
+```
+
+
+It is recommended to at least set the `random_state`
+hyperparameter in order to get reproducible results every time that you
+have to run the same code:
+
+```
+rf_model = RandomForestClassifier(random_state=1)
+```
+
+The second step is to train the model with some data. In this example,
+we will use a simple dataset that classifies 178 instances of Italian
+wines into 3 categories based on 13 features. This dataset is part of
+the few examples that `sklearn` provides within its API. We
+need to load the data first:
+
+```
+from sklearn.datasets import load_wine
+features, target = load_wine(return_X_y=True)
+```
+
+Then using the `.fit()` method to train the model, you will
+provide the features and the target variable as input:
+
+```
+rf_model.fit(features, target)
+```
+You should get the following output:
+
+
+
+Caption: Logs of the trained Random Forest model
+
+In the preceding output, we can see a Random Forest model with the
+default hyperparameters. You will be introduced to some of them in
+*Lab 4*, *Multiclass Classification with RandomForest*.
+
+Once trained, we can use the `.predict()` method to predict
+the target for one or more observations. Here we will use the same data
+as for the training step:
+
+```
+preds = rf_model.predict(features)
+preds
+```
+You should get the following output:
+
+
+
+Caption: Predictions of the trained Random Forest model
+
+
+
+Finally, we want to assess the model\'s performance by comparing its
+predictions to the actual values of the target variable. There are a lot
+of different metrics that can be used for assessing model performance,
+and you will learn more about them later in this book. For now, though,
+we will just use a metric called **accuracy**. This metric calculates
+the ratio of correct predictions to the total number of observations:
+
+```
+from sklearn.metrics import accuracy_score
+accuracy_score(target, preds)
+```
+You should get the following output
+
+
+
+Caption: Accuracy of the trained Random Forest model
+
+
+
+Exercise 1.03: Predicting Breast Cancer from a Dataset Using sklearn
+--------------------------------------------------------------------
+
+In this exercise, we will build a machine learning classifier using
+`RandomForest` from `sklearn` to predict whether the
+breast cancer of a patient is malignant (harmful) or benign (not
+harmful).
+
+
+The following steps will help you complete the exercise:
+
+1. Open a new Colab notebook.
+
+2. Import the `load_breast_cancer` function from
+ `sklearn.datasets`:
+ ```
+ from sklearn.datasets import load_breast_cancer
+ ```
+
+
+3. Load the dataset from the `load_breast_cancer` function
+ with the `return_X_y=True` parameter to return the
+ features and response variable only:
+ ```
+ features, target = load_breast_cancer(return_X_y=True)
+ ```
+
+
+4. Print the variable features:
+
+ ```
+ print(features)
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+ Caption: Output of the variable features
+
+ The preceding output shows the values of the features. (You can
+ learn more about the features from the link given previously.)
+
+5. Print the `target` variable:
+
+ ```
+ print(target)
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+ Caption: Output of the variable target
+
+ The preceding output shows the values of the target variable. There
+ are two classes shown for each instance in the dataset. These
+ classes are `0` and `1`, representing whether
+ the cancer is malignant or benign.
+
+6. Import the `RandomForestClassifier` class from
+ `sklearn.ensemble`:
+ ```
+ from sklearn.ensemble import RandomForestClassifier
+ ```
+
+
+7. Create a new variable called `seed`, which will take the
+ value `888` (chosen arbitrarily):
+ ```
+ seed = 888
+ ```
+
+
+8. Instantiate `RandomForestClassifier` with the
+ `random_state=seed` parameter and save it into a variable
+ called `rf_model`:
+ ```
+ rf_model = RandomForestClassifier(random_state=seed)
+ ```
+
+
+9. Train the model with the `.fit()` method with
+ `features` and `target` as parameters:
+
+ ```
+ rf_model.fit(features, target)
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+ Caption: Logs of RandomForestClassifier
+
+10. Make predictions with the trained model using the
+ `.predict()` method and `features` as a
+ parameter and save the results into a variable called
+ `preds`:
+ ```
+ preds = rf_model.predict(features)
+ ```
+
+
+11. Print the `preds` variable:
+
+ ```
+ print(preds)
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+ Caption: Predictions of the Random Forest model
+
+ The preceding output shows the predictions for the training set. You
+ can compare this with the actual target variable values shown in
+ *Figure 1.48*.
+
+12. Import the `accuracy_score` method from
+ `sklearn.metrics`:
+ ```
+ from sklearn.metrics import accuracy_score
+ ```
+
+
+13. Calculate `accuracy_score()` with `target` and
+ `preds` as parameters:
+
+ ```
+ accuracy_score(target, preds)
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+
+Activity 1.01: Train a Spam Detector Algorithm
+----------------------------------------------
+
+You are working for an email service provider and have been tasked with
+training an algorithm that recognizes whether an email is spam or not
+from a given dataset and checking its performance.
+
+In this dataset, the authors have already created 57 different features
+based on some statistics for relevant keywords in order to classify
+whether an email is spam or not.
+
+
+The following steps will help you to complete this activity:
+
+1. Import the required libraries.
+
+2. Load the dataset using `.pd.read_csv()`.
+
+3. Extract the response variable using .`pop()` from
+ `pandas`. This method will extract the column provided as
+ a parameter from the DataFrame. You can then assign it a variable
+ name, for example, `target = df.pop('class')`.
+
+4. Instantiate `RandomForestClassifier`.
+
+5. Train a Random Forest model to predict the outcome with
+ .`fit()`.
+
+6. Predict the outcomes from the input data using
+ `.predict()`.
+
+7. Calculate the accuracy score using `accuracy_score`.
+
+ The output will be similar to the following:
+
+
+
+
+
+
+Summary
+=======
+
+
+This lab provided you with an overview of what data science is in
+general. We also learned the different types of machine learning
+algorithms, including supervised and unsupervised, as well as regression
+and classification. We had a quick introduction to Python and how to
+manipulate the main data structures (lists and dictionaries) that will
+be used in this book.
+
+Then we walked you through what a DataFrame is and how to create one by
+loading data from different file formats using the famous pandas
+package. Finally, we learned how to use the sklearn package to train a
+machine learning model and make predictions with it.
+
+This was just a quick glimpse into the fascinating world of data
+science. In this book, you will learn much more and discover new
+techniques for handling data science projects from end to end.
+
+The next lab will show you how to perform a regression task on a
+real-world dataset.
diff --git a/lab_guides/Lab_10.md b/lab_guides/Lab_10.md
new file mode 100644
index 0000000..97c40c5
--- /dev/null
+++ b/lab_guides/Lab_10.md
@@ -0,0 +1,1641 @@
+
+10. Analyzing a Dataset
+=======================
+
+
+
+Overview
+
+By the end of this lab, you will be able to explain the key steps
+involved in performing exploratory data analysis; identify the types of
+data contained in the dataset; summarize the dataset and at a detailed
+level for each variable; visualize the data distribution in each column;
+find relationships between variables and analyze missing values and
+outliers for each variable
+
+This lab will introduce you to the art of performing exploratory
+data analysis and visualizing the data in order to identify quality
+issues, potential data transformations, and interesting patterns.
+
+
+
+Exploring Your Data
+===================
+
+
+If you are running your project by following the CRISP-DM methodology,
+the first step will be to discuss the project with the stakeholders and
+clearly define their requirements and expectations. Only once this is
+clear can you start having a look at the data and see whether you will
+be able to achieve these objectives.
+
+After receiving a dataset, you may want to make sure that the dataset
+contains the information you need for your project. For instance, if you
+are working on a supervised project, you will check whether this dataset
+contains the target variable you need and whether there are any missing
+or incorrect values for this field. You may also check how many
+observations (rows) and variables (columns) there are. These are the
+kind of questions you will have initially with a new dataset. This
+section will introduce you to some techniques you can use to get the
+answers to these questions.
+
+For the rest of this section, we will be working with a dataset
+containing transactions from an online retail store.
+
+
+
+Our dataset is an Excel spreadsheet. Luckily, the `pandas`
+package provides a method we can use to load this type of file:
+`read_excel()`.
+
+Let\'s read the data using the `.read_excel()` method and
+store it in a `pandas` DataFrame, as shown in the following
+code snippet:
+
+```
+import pandas as pd
+file_url = 'https://github.com/fenago/'\
+ 'data-science/blob/'\
+ 'master/Lab10/dataset/'\
+ 'Online%20Retail.xlsx?raw=true'
+df = pd.read_excel(file_url)
+```
+After loading the data into a DataFrame, we want to know the size of
+this dataset, that is, its number of rows and columns. To get this
+information, we just need to call the `.shape` attribute from
+`pandas`:
+
+```
+df.shape
+```
+You should get the following output:
+
+```
+(541909, 8)
+```
+This attribute returns a tuple containing the number of rows as the
+first element and the number of columns as the second element. The
+loaded dataset contains `541909` rows and `8`
+different columns.
+
+Since this attribute returns a tuple, we can access each of its elements
+independently by providing the relevant index. Let\'s extract the number
+of rows (index `0`):
+
+```
+df.shape[0]
+```
+You should get the following output:
+
+```
+541909
+```
+Similarly, we can get the number of columns with the second index:
+
+```
+df.shape[1]
+```
+You should get the following output:
+
+```
+8
+```
+Usually, the first row of a dataset is the header. It contains the name
+of each column. By default, the `read_excel()` method assumes
+that the first row of the file is the header. If the `header`
+is stored in a different row, you can specify a different index for the
+header with the parameter header from `read_excel()`, such as
+`pd.read_excel(header=1)` for specifying the header column is
+the second row.
+
+Once loaded into a `pandas` DataFrame, you can print out its
+content by calling it directly:
+
+```
+df
+```
+You should get the following output:
+
+
+
+Caption: First few rows of the loaded online retail DataFrame
+
+To access the names of the columns for this DataFrame, we can call the
+`.columns` attribute:
+
+```
+df.columns
+```
+You should get the following output:
+
+
+
+Caption: List of the column names for the online retail DataFrame
+
+The columns from this dataset are `InvoiceNo`,
+`StockCode`, `Description`, `Quantity`,
+`InvoiceDate`, `UnitPrice`, `CustomerID`,
+and `Country`. We can infer that a row from this dataset
+represents the sale of an article for a given quantity and price for a
+specific customer at a particular date.
+
+Looking at these names, we can potentially guess what types of
+information are contained in these columns, however, to be sure, we can
+use the `dtypes` attribute, as shown in the following code
+snippet:
+
+```
+df.dtypes
+```
+You should get the following output:
+
+
+
+Caption: Description of the data type for each column of the
+DataFrame
+
+From this output, we can see that the `InvoiceDate` column is
+a date type (`datetime64[ns]`), `Quantity` is an
+integer (`int64`), and that `UnitPrice` and
+`CustomerID` are decimal numbers (`float64`). The
+remaining columns are text (`object`).
+
+The `pandas` package provides a single method that can display
+all the information we have seen so far, that is, the `info()`
+method:
+
+```
+df.info()
+```
+You should get the following output:
+
+
+
+Caption: Output of the info() method
+
+In just a few lines of code, we learned some high-level information
+about this dataset, such as its size, the column names, and their types.
+
+In the next section, we will analyze the content of a dataset.
+
+
+Analyzing Your Dataset
+======================
+
+
+Previously, we learned about the overall structure of a dataset and the
+kind of information it contains. Now, it is time to really dig into it
+and look at the values of each column.
+
+First, we need to import the `pandas` package:
+
+```
+import pandas as pd
+```
+
+Then, we\'ll load the data into a `pandas` DataFrame:
+
+```
+file_url = 'https://github.com/fenago/'\
+ 'data-science/blob/'\
+ 'master/Lab10/dataset/'\
+ 'Online%20Retail.xlsx?raw=true'
+df = pd.read_excel(file_url)
+```
+
+The `pandas` package provides several methods so that you can
+display a snapshot of your dataset. The most popular ones are
+`head()`, `tail()`, and `sample()`.
+
+The `head()` method will show the top rows of your dataset. By
+default, `pandas` will display the first five rows:
+
+```
+df.head()
+```
+You should get the following output:
+
+
+
+Caption: Displaying the first five rows using the head() method
+
+The output of the `head()` method shows that the
+`InvoiceNo`, `StockCode`, and `CustomerID`
+columns are unique identifier fields for each purchasing invoice, item
+sold, and customer. The `Description` field is text describing
+the item sold. `Quantity` and `UnitPrice` are the
+number of items sold and their unit price, respectively.
+`Country` is a text field that can be used for specifying
+where the customer or the item is located or from which country version
+of the online store the order has been made. In a real project, you may
+reach out to the team who provided this dataset and confirm what the
+meaning of the `Country` column is, or any other column
+details that you may need, for that matter.
+
+With `pandas`, you can specify the number of top rows to be
+displayed with the `head()` method by providing an integer as
+its parameter. Let\'s try this by displaying the first `10`
+rows:
+
+```
+df.head(10)
+```
+You should get the following output:
+
+
+
+Caption: Displaying the first 10 rows using the head() method
+
+Looking at this output, we can assume that the data is sorted by the
+`InvoiceDate` column and grouped by `CustomerID` and
+`InvoiceNo`. We can only see one value in the
+`Country` column: `United Kingdom`. Let\'s check
+whether this is really the case by looking at the last rows of the
+dataset. This can be achieved by calling the `tail()` method.
+Like `head()`, this method, by default, will display only five
+rows, but you can specify the number of rows you want as a parameter.
+Here, we will display the last eight rows:
+
+```
+df.tail(8)
+```
+You should get the following output:
+
+
+
+Caption: Displaying the last eight rows using the tail() method
+
+It seems that we were right in assuming that the data is sorted in
+ascending order by the `InvoiceDate` column. We can also
+confirm that there is actually more than one value in the
+`Country` column.
+
+We can also use the `sample()` method to randomly pick a given
+number of rows from the dataset with the `n` parameter. You
+can also specify a **seed** (which we covered in *Lab 5*,
+*Performing Your First Cluster Analysis*) in order to get reproducible
+results if you run the same code again with the `random_state`
+parameter:
+
+```
+df.sample(n=5, random_state=1)
+```
+You should get the following output:
+
+
+
+Caption: Displaying five random sampled rows using the sample()
+method
+
+In this output, we can see an additional value in the
+`Country` column: `Germany`. We can also notice a
+few interesting points:
+
+- `InvoiceNo` can also contain alphabetical letters (row
+ `94,801` starts with a `C`, which may have a
+ special meaning).
+- `Quantity` can have negative values: `-2` (row
+ `94801`).
+- `CustomerID` contains missing values: `NaN` (row
+ `210111`).
+
+
+
+Exercise 10.01: Exploring the Ames Housing Dataset with Descriptive Statistics
+------------------------------------------------------------------------------
+
+In this exercise, we will explore the `Ames Housing dataset`
+in order to get a good understanding of it by analyzing its structure
+and looking at some of its rows.
+
+
+The following steps will help you to complete this exercise:
+
+1. Open a new Colab notebook.
+
+2. Import the `pandas` package:
+ ```
+ import pandas as pd
+ ```
+
+
+3. Assign the link to the AMES dataset to a variable called
+ `file_url`:
+ ```
+ file_url = 'https://raw.githubusercontent.com/'\
+ 'fenago/data-science/'\
+ 'master/Lab10/dataset/ames_iowa_housing.csv'
+ ```
+
+
+4. Use the `.read_csv()` method from the
+ `pandas `package and load the dataset into a new variable
+ called `df`:
+ ```
+ df = pd.read_csv(file_url)
+ ```
+
+
+5. Print the number of rows and columns of the DataFrame using the
+ `shape` attribute from the `pandas` package:
+
+ ```
+ df.shape
+ ```
+
+
+ You should get the following output:
+
+ ```
+ (1460, 81)
+ ```
+
+
+ We can see that this dataset contains `1460` rows and
+ `81` different columns.
+
+6. Print the names of the variables contained in this DataFrame using
+ the `columns` attribute from the `pandas`
+ package:
+
+ ```
+ df.columns
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+ Caption: List of columns in the housing dataset
+
+ We can infer the type of information contained in some of the
+ variables by looking at their names, such as `LotArea`
+ (property size), `YearBuilt` (year of construction), and
+ `SalePrice` (property sale price).
+
+7. Print out the type of each variable contained in this DataFrame
+ using the `dtypes` attribute from the `pandas`
+ package:
+
+ ```
+ df.dtypes
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+ Caption: List of columns and their type from the housing
+ dataset
+
+ We can see that the variables are either numerical or text types.
+ There is no date column in this dataset.
+
+8. Display the top rows of the DataFrame using the `head()`
+ method from `pandas`:
+
+ ```
+ df.head()
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+ Caption: First five rows of the housing dataset
+
+9. Display the last five rows of the DataFrame using the
+ `tail()` method from `pandas`:
+
+ ```
+ df.tail()
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+ Caption: Last five rows of the housing dataset
+
+ It seems that the `Alley` column has a lot of missing
+ values, which are represented by the `NaN` value (which
+ stands for `Not a Number`). The `Street` and
+ `Utilities` columns seem to have only one value.
+
+10. Now, display `5` random sampled rows of the DataFrame
+ using the `sample()` method from `pandas` and
+ pass it a `'random_state'` of `8`:
+
+ ```
+ df.sample(n=5, random_state=8)
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+
+We learned quite a lot about this dataset in just a few lines of code,
+such as the number of rows and columns, the data type of each variable,
+and their information. We also identified some issues with missing
+values.
+
+
+Analyzing the Content of a Categorical Variable
+===============================================
+
+
+Now that we\'ve got a good feel for the kind of information contained in
+the `online retail dataset`, we want to dig a little deeper
+into each of its columns:
+
+```
+import pandas as pd
+file_url = 'https://github.com/fenago/'\
+ 'data-science/blob'\
+ '/master/Lab10/dataset/'\
+ 'Online%20Retail.xlsx?raw=true'
+df = pd.read_excel(file_url)
+```
+For instance, we would like to know how many different values are
+contained in each of the variables by calling the `nunique()`
+method. This is particularly useful for a categorical variable with a
+limited number of values, such as `Country`:
+
+```
+df['Country'].nunique()
+```
+You should get the following output:
+
+```
+38
+```
+We can see that there are 38 different countries in this dataset. It
+would be great if we could get a list of all the values in this column.
+Thankfully, the `pandas` package provides a method to get
+these results: `unique()`:
+
+```
+df['Country'].unique()
+```
+You should get the following output:
+
+
+
+Caption: List of unique values for the \'Country\' column
+
+We can see that there are multiple countries from different continents,
+but most of them come from Europe. We can also see that there is a value
+called `Unspecified` and another one for
+`European Community`, which may be for all the countries of
+the eurozone that are not listed separately.
+
+Another very useful method from `pandas `is
+`value_counts()`. This method lists all the values from a
+given column but also their occurrence. By providing the
+`dropna=False` and `normalise=True` parameters, this
+method will include the missing value in the listing and calculate the
+number of occurrences as a ratio, respectively:
+
+```
+df['Country'].value_counts(dropna=False, normalize=True)
+```
+You should get the following output:
+
+
+
+
+From this output, we can see that the `United Kingdom` value
+is totally dominating this column as it represents over 91% of the rows
+and that other values such as `Austria` and
+`Denmark` are quite rare as they represent less than 1% of
+this dataset.
+
+
+
+Exercise 10.02: Analyzing the Categorical Variables from the Ames Housing Dataset
+---------------------------------------------------------------------------------
+
+In this exercise, we will continue our dataset exploration by analyzing
+the categorical variables of this dataset. To do so, we will implement
+our own `describe` functions.
+
+
+1. Open a new Colab notebook.
+
+2. Import the `pandas `package:
+ ```
+ import pandas as pd
+ ```
+
+
+3. Assign the following link to the AMES dataset to a variable called
+ `file_url`:
+ ```
+ file_url = 'https://raw.githubusercontent.com/'\
+ 'fenago/data-science/'\
+ 'master/Lab10/dataset/ames_iowa_housing.csv'
+ ```
+
+
+4. Use the `.read_csv()` method from the `pandas`
+ package and load the dataset into a new variable called
+ `df`:
+ ```
+ df = pd.read_csv(file_url)
+ ```
+
+
+5. Create a new DataFrame called `obj_df` with only the
+ columns that are of numerical types using the
+ `select_dtypes` method from `pandas` package.
+ Then, pass in the `object` value to the
+ `include `parameter:
+ ```
+ obj_df = df.select_dtypes(include='object')
+ ```
+
+
+6. Using the `columns` attribute from `pandas`,
+ extract the list of columns of this DataFrame, `obj_df`,
+ assign it to a new variable called `obj_cols`, and print
+ its content:
+
+ ```
+ obj_cols = obj_df.columns
+ obj_cols
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+ Caption: List of categorical variables
+
+7. Create a function called `describe_object` that takes a
+ `pandas `DataFrame and a column name as input parameters.
+ Then, inside the function, print out the name of the given column,
+ its number of unique values using the `nunique()` method,
+ and the list of values and their occurrence using the
+ `value_counts()` method, as shown in the following code
+ snippet:
+ ```
+ def describe_object(df, col_name):
+ print(f"\nCOLUMN: {col_name}")
+ print(f"{df[col_name].nunique()} different values")
+ print(f"List of values:")
+ print(df[col_name].value_counts\
+ (dropna=False, normalize=True))
+ ```
+
+
+8. Test this function by providing the `df` DataFrame and the
+ `'MSZoning'` column:
+
+ ```
+ describe_object(df, 'MSZoning')
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+ Caption: Display of the created function for the MSZoning
+ column
+
+ For the `MSZoning` column, the `RL` value
+ represents almost `79%` of the values, while `C`
+ `(all)` is only present in less than `1%` of the
+ rows.
+
+9. Create a `for `loop that will call the created function
+ for every element from the `obj_cols` list:
+
+ ```
+ for col_name in obj_cols:
+ describe_object(df, col_name)
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+
+
+Summarizing Numerical Variables
+===============================
+
+
+Now, let\'s have a look at a numerical column and get a good
+understanding of its content. We will use some statistical measures that
+summarize a variable. All of these measures are referred to as
+descriptive statistics. In this lab, we will introduce you to the
+most popular ones.
+
+With the `pandas` package, a lot of these measures have been
+implemented as methods. For instance, if we want to know what the
+highest value contained in the `'Quantity'` column is, we can
+use the `.max()` method:
+
+```
+df['Quantity'].max()
+```
+You should get the following output:
+
+```
+80995
+```
+We can see that the maximum quantity of an item sold in this dataset is
+`80995`, which seems extremely high for a retail business. In
+a real project, this kind of unexpected value will have to be discussed
+and confirmed with the data owner or key stakeholders to see whether
+this is a genuine or an incorrect value. Now, let\'s have a look at the
+lowest value for `'Quantity'` using the `.min()`
+method:
+
+```
+df['Quantity'].min()
+```
+You should get the following output:
+
+```
+-80995
+```
+
+The lowest value in this variable is extremely low. We can think that
+having negative values is possible for returned items, but here, the
+minimum (`-80995`) is very low. This, again, will be something
+to be confirmed with the relevant people in your organization.
+
+Now, we are going to have a look at the central tendency of this column.
+**Central tendency** is a statistical term referring to the central
+point where the data will cluster around. The most famous central
+tendency measure is the average (or mean). The average is calculated by
+summing all the values of a column and dividing them by the number of
+values.
+
+If we plot the `Quantity `column on a graph with its average,
+it would look as follows:
+
+
+
+Caption: Average value for the \'Quantity\' column
+
+We can see the average for the `Quantity `column is very close
+to 0 and most of the data is between `-50` and
+`+50`.
+
+We can get the average value of a feature by using the
+`mean()` method from `pandas`:
+
+```
+df['Quantity'].mean()
+```
+You should get the following output:
+
+```
+9.55224954743324
+```
+
+In this dataset, the average quantity of items sold is around
+`9.55`. The average measure is very sensitive to outliers and,
+as we saw previously, the minimum and maximum values of the
+`Quantity` column are quite extreme
+(`-80995 to +80995`).
+
+We can use the median instead as another measure of central tendency.
+The median is calculated by splitting the column into two groups of
+equal lengths and getting the value of the middle point by separating
+these two groups, as shown in the following example:
+
+
+
+Caption: Sample median example
+
+In `pandas`, you can call the `median()` method to
+get this value:
+
+```
+df['Quantity'].median()
+```
+You should get the following output:
+
+```
+3.0
+```
+
+The median value for this column is 3, which is quite different from the
+mean (`9.55`) we found earlier. This tells us that there are
+some outliers in this dataset and we will have to decide on how to
+handle them after we\'ve done more investigation (this will be covered
+in *Lab 11*, *Data Preparation*).
+
+We can also evaluate the spread of this column (how much the data points
+vary from the central point). A common measure of spread is the standard
+deviation. The smaller this measure is, the closer the data is to its
+mean. On the other hand, if the standard deviation is high, this means
+there are some observations that are far from the average. We will use
+the `std()` method from `pandas `to calculate this
+measure:
+
+```
+df['Quantity'].std()
+```
+You should get the following output:
+
+```
+218.08115784986612
+```
+As expected, the standard deviation for this column is quite high, so
+the data is quite spread from the average, which is `9.55` in
+this example.
+
+In the `pandas `package, there is a method that can display
+most of these descriptive statistics with one single line of code:
+`describe()`:
+
+```
+df.describe()
+```
+You should get the following output:
+
+
+
+Caption: Output of the describe() method
+
+We got the exact same values for the `Quantity` column as we
+saw previously. This method has calculated the descriptive statistics
+for the three numerical columns (`Quantity`,
+`UnitPrice`, and `CustomerID`).
+
+Even though the `CustomerID` column contains only numerical
+data, we know these values are used to identify each customer and have
+no mathematical meaning. For instance, it will not make sense to add
+customer ID `12680 to 17850` in the table or calculate the
+mean of these identifiers. This column is not actually numerical but
+categorical.
+
+The `describe()` method doesn\'t know this information and
+just noticed there are numbers, so it assumed this is a numerical
+variable. This is the perfect example of why you should understand your
+dataset perfectly and identify the issues to be fixed before feeding the
+data to an algorithm. In this case, we will have to change the type of
+this column to categorical. In *Lab 11*, *Data Preparation*, we will
+see how we can handle this kind of issue, but for now, we will look at
+some graphical tools and techniques that will help us have an even
+better understanding of the data.
+
+
+
+Exercise 10.03: Analyzing Numerical Variables from the Ames Housing Dataset
+---------------------------------------------------------------------------
+
+In this exercise, we will continue our dataset exploration by analyzing
+the numerical variables of this dataset. To do so, we will implement our
+own `describe `functions.
+
+
+1. Open a new Colab notebook.
+
+2. Import the `pandas` package:
+ ```
+ import pandas as pd
+ ```
+
+
+3. Assign the link to the AMES dataset to a variable called
+ `file_url`:
+ ```
+ file_url = 'https://raw.githubusercontent.com/'\
+ 'fenago/data-science/'\
+ 'master/Lab10/dataset/ames_iowa_housing.csv'
+ ```
+
+
+4. Use the `.read_csv()` method from the
+ `pandas `package and load the dataset into a new variable
+ called `df`:
+ ```
+ df = pd.read_csv(file_url)
+ ```
+
+
+5. Create a new DataFrame called `num_df` with only the
+ columns that are numerical using the `select_dtypes`
+ method from the `pandas `package and pass in the
+ `'number'` value to the `include` parameter:
+ ```
+ num_df = df.select_dtypes(include='number')
+ ```
+
+
+6. Using the `columns` attribute from `pandas`,
+ extract the list of columns of this DataFrame, `num_df`,
+ assign it to a new variable called `num_cols`, and print
+ its content:
+
+ ```
+ num_cols = num_df.columns
+ num_cols
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+ Caption: List of numerical columns
+
+7. Create a function called `describe_numeric` that takes a
+ `pandas `DataFrame and a column name as input parameters.
+ Then, inside the function, print out the name of the given column,
+ its minimum value using `min()`, its maximum value using
+ `max()`, its average value using `mean()`, its
+ standard deviation using `std()`, and its
+ `median` using `median()`:
+ ```
+ def describe_numeric(df, col_name):
+ print(f"\nCOLUMN: {col_name}")
+ print(f"Minimum: {df[col_name].min()}")
+ print(f"Maximum: {df[col_name].max()}")
+ print(f"Average: {df[col_name].mean()}")
+ print(f"Standard Deviation: {df[col_name].std()}")
+ print(f"Median: {df[col_name].median()}")
+ ```
+
+
+8. Now, test this function by providing the `df` DataFrame
+ and the `SalePrice` column:
+
+ ```
+ describe_numeric(df, 'SalePrice')
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+ Caption: The display of the created function for the
+ \'SalePrice\' column
+
+ The sale price ranges from `34,900` to
+ `755,000 `with an average of `180,921`. The
+ median is slightly lower than the average, which tells us there are
+ some outliers with high sales prices.
+
+9. Create a `for `loop that will call the created function
+ for every element from the `num_cols` list:
+
+ ```
+ for col_name in num_cols:
+ describe_numeric(df, col_name)
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+
+Visualizing Your Data
+=====================
+
+
+In the previous section, we saw how to explore a new dataset and
+calculate some simple descriptive statistics. These measures helped
+summarize the dataset into interpretable metrics, such as the average or
+maximum values. Now it is time to dive even deeper and get a more
+granular view of each column using data visualization.
+
+In a data science project, data visualization can be used either for
+data analysis or communicating gained insights. Presenting results in a
+visual way that stakeholders can easily understand and interpret them in
+is definitely a must-have skill for any good data scientist.
+
+However, in this lab, we will be focusing on using data
+visualization for analyzing data. Most people tend to interpret
+information more easily on a graph than reading written information. For
+example, when looking at the following descriptive statistics and the
+scatter plot for the same variable, which one do you think is easier to
+interpret? Let\'s take a look:
+
+
+
+Caption: Sample visual data analysis
+
+Even though the information shown with the descriptive statistics are
+more detailed, by looking at the graph, you have already seen that the
+data is stretched and mainly concentrated around the value 0. It
+probably took you less than 1 or 2 seconds to come up with this
+conclusion, that is, there is a cluster of points near the 0 value and
+that it gets reduced while moving away from it. Coming to this
+conclusion would have taken you more time if you were interpreting the
+descriptive statistics. This is the reason why data visualization is a
+very powerful tool for effectively analyzing data.
+
+
+
+Using the Altair API
+--------------------
+
+We will be using a package called `altair` (if you recall, we
+already briefly used it in *Lab 5*, *Performing Your First Cluster
+Analysis*). There are quite a lot of Python packages for data
+visualization on the market, such as `matplotlib`,
+`seaborn`, or `Bokeh`, and compared to them,
+`altair` is relatively new, but its community of users is
+growing fast thanks to its simple API syntax.
+
+Let\'s see how we can display a bar chart step by step on the online
+retail dataset.
+
+First, import the `pandas` and `altair` packages:
+
+```
+import pandas as pd
+import altair as alt
+```
+
+Then, load the data into a `pandas` DataFrame:
+
+```
+file_url = 'https://github.com/fenago/'\
+ 'data-science/blob/'\
+ 'master/Lab10/dataset/'\
+ 'Online%20Retail.xlsx?raw=true'
+df = pd.read_excel(file_url)
+```
+We will randomly sample 5,000 rows of this DataFrame using the
+`sample()` method (`altair `requires additional
+steps in order to display a larger dataset):
+
+```
+sample_df = df.sample(n=5000, random_state=8)
+```
+Now instantiate a `Chart` object from `altair` with
+the `pandas `DataFrame as its input parameter:
+
+```
+base = alt.Chart(sample_df)
+```
+Next, we call the `mark_circle()` method to specify the type
+of graph we want to plot: a scatter plot:
+
+```
+chart = base.mark_circle()
+```
+Finally, we specify the names of the columns that will be displayed on
+the *x* and *y* axes using the `encode()` method:
+
+```
+chart.encode(x='Quantity', y='UnitPrice')
+```
+We just plotted a scatter plot in seven lines of code:
+
+
+
+Caption: Output of the scatter plot
+
+Altair provides the option for combining its methods all together into
+one single line of code, like this:
+
+```
+alt.Chart(sample_df).mark_circle()\
+ .encode(x='Quantity', y='UnitPrice')
+```
+You should get the following output:
+
+
+
+Caption: Output of the scatter plot with combined altair methods
+
+We can see that we got the exact same output as before. This graph shows
+us that there are a lot of outliers (extreme values) for both variables:
+most of the values of `UnitPrice` are below 100, but there are
+some over 300, and `Quantity` ranges from -200 to 800, while
+most of the observations are between -50 to 150. We can also notice a
+pattern where items with a high unit price have lower quantity (items
+over 50 in terms of unit price have a quantity close to 0) and the
+opposite is also true (items with a quantity over 100 have a unit price
+close to 0).
+
+Now, let\'s say we want to visualize the same plot while adding the
+`Country` column\'s information. One easy way to do this is to
+use the `color` parameter from the `encode()`
+method. This will color all the data points according to their value in
+the `Country` column:
+
+```
+alt.Chart(sample_df).mark_circle()\
+ .encode(x='Quantity', y='UnitPrice', color='Country')
+```
+You should get the following output:
+
+
+
+Caption: Scatter plot with colors based on the \'Country\' column
+
+We added the information from the `Country` column into the
+graph, but as we can see, there are too many values and it is hard to
+differentiate between countries: there are a lot of blue points, but it
+is hard to tell which countries they are representing.
+
+With `altair`, we can easily add some interactions on the
+graph in order to display more information for each observation; we just
+need to use the `tooltip` parameter from the
+`encode()` method and specify the list of columns to be
+displayed and then call the `interactive()` method to make the
+whole thing interactive (as seen previously in *Lab 5*, *Performing
+Your First Cluster Analysis*):
+
+```
+alt.Chart(sample_df).mark_circle()\
+ .encode(x='Quantity', y='UnitPrice', color='Country', \
+ tooltip=['InvoiceNo','StockCode','Description',\
+ 'InvoiceDate','CustomerID']).interactive()
+```
+You should get the following output:
+
+
+
+Caption: Interactive scatter plot with tooltip
+
+Now, if we hover on the observation with the highest
+`UnitPrice` value (the one near 600), we can see the
+information displayed by the tooltip: this observation doesn\'t have any
+value for `StockCode` and its `Description` is
+`Manual`. So, it seems that this is not a normal transaction
+to happen on the website. It may be a special order that has been
+manually entered into the system. This is something you will have to
+discuss with your stakeholder and confirm.
+
+
+
+Histogram for Numerical Variables
+---------------------------------
+
+Now that we are familiar with the `altair` API, let\'s have a
+look at some specific type of charts that will help us analyze and
+understand each variable. First, let\'s focus on numerical variables
+such as `UnitPrice` or `Quantity` in the online
+retail dataset.
+
+For this type of variable, a histogram is usually used to show the
+distribution of a given variable. The x axis of a histogram will show
+the possible values in this column and the y axis will plot the number
+of observations that fall under each value. Since the number of possible
+values can be very high for a numerical variable (potentially an
+infinite number of potential values), it is better to group these values
+by chunks (also called bins). For instance, we can group prices into
+bins of 10 steps (that is, groups of 10 items each) such as 0 to 10, 11
+to 20, 21 to 30, and so on.
+
+Let\'s look at this by using a real example. We will plot a histogram
+for `'UnitPrice'` using the `mark_bar()` and
+`encode()` methods with the following parameters:
+
+- `alt.X("UnitPrice:Q", bin=True)`: This is another
+ `altair `API syntax that allows you to tune some of the
+ parameters for the x axis. Here, we are telling altair to use the
+ `'UnitPrice'` column as the axis. `':Q'`
+ specifies that this column is quantitative data (that is, numerical)
+ and `bin=True` forces the grouping of the possible values
+ into bins.
+- `y='count()'`: This is used for calculating the number of
+ observations and plotting them on the y axis, like so:
+
+```
+alt.Chart(sample_df).mark_bar()\
+ .encode(alt.X("UnitPrice:Q", bin=True), \
+ y='count()')
+```
+You should get the following output:
+
+
+
+Caption: Histogram for UnitPrice with the default bin step size
+
+By default, `altair` grouped the observations by bins of 100
+steps: 0 to 100, then 100 to 200, and so on. The step size that was
+chosen is not optimal as almost all the observations fell under the
+first bin (0 to 100) and we can\'t see any other bin. With
+`altair`, we can specify the values of the parameter bin and
+we will try this with 5, that is, `alt.Bin(step=5)`:
+
+```
+alt.Chart(sample_df).mark_bar()\
+ .encode(alt.X("UnitPrice:Q", bin=alt.Bin(step=5)), \
+ y='count()')
+```
+You should get the following output:
+
+
+
+Caption: Histogram for UnitPrice with a bin step size of 5
+
+This is much better. With this step size, we can see that most of the
+observations have a unit price under 5 (almost 4,200 observations). We
+can also see that a bit more than 500 data points have a unit price
+under 10. The count of records keeps decreasing as the unit price
+increases.
+
+Let\'s plot the histogram for the `Quantity` column with a bin
+step size of 10:
+
+```
+alt.Chart(sample_df).mark_bar()\
+ .encode(alt.X("Quantity:Q", bin=alt.Bin(step=10)), \
+ y='count()')
+```
+You should get the following output:
+
+
+
+Caption: Histogram for Quantity with a bin step size of 10
+
+In this histogram, most of the records have a positive quantity between
+0 and 30 (first three highest bins). There is also a bin with around 50
+observations that have a negative quantity from -10 to 0. As we
+mentioned earlier, these may be returned items from customers.
+
+
+
+Bar Chart for Categorical Variables
+-----------------------------------
+
+Now, we are going to have a look at categorical variables. For such
+variables, there is no need to group the values into bins as, by
+definition, they have a limited number of potential values. We can still
+plot the distribution of such columns using a simple bar chart. In
+`altair`, this is very simple -- it is similar to plotting a
+histogram but without the `bin` parameter. Let\'s try this on
+the `Country` column and look at the number of records for
+each of its values:
+
+```
+alt.Chart(sample_df).mark_bar()\
+ .encode(x='Country',y='count()')
+```
+You should get the following output:
+
+
+
+Caption: Bar chart of the Country column\'s occurrence
+
+We can confirm that `United Kingdom` is the most represented
+country in this dataset (and by far), followed by `Germany`,
+`France`, and `EIRE`. We clearly have imbalanced
+data that may affect the performance of a predictive model. In *Lab
+13*, *Imbalanced Datasets*, we will look at how we can handle this
+situation.
+
+Now, let\'s analyze the datetime column, that is,
+`InvoiceDate`. The `altair` package provides some
+functionality that we can use to group datetime information by period,
+such as day, day of week, month, and so on. For instance, if we want to
+have a monthly view of the distribution of a variable, we can use the
+`yearmonth` function to group datetimes. We also need to
+specify that the type of this variable is ordinal (there is an order
+between the values) by adding `:O` to the column name:
+
+```
+alt.Chart(sample_df).mark_bar()\
+ .encode(alt.X('yearmonth(InvoiceDate):O'),\
+ y='count()')
+```
+You should get the following output:
+
+
+
+Caption: Distribution of InvoiceDate by month
+
+This graph tells us that there was a huge spike of items sold in
+November 2011. It peaked to 800 items sold in this month, while the
+average is around 300. Was there a promotion or an advertising campaign
+run at that time that can explain this increase? These are the questions
+you may want to ask your stakeholders so that they can confirm this
+sudden increase of sales.
+
+
+Boxplots
+========
+
+
+Now, we will have a look at another specific type of chart called a
+**boxplot**. This kind of graph is used to display the distribution of a
+variable based on its quartiles. Quartiles are the values that split a
+dataset into quarters. Each quarter contains exactly 25% of the
+observations. For example, in the following sample data, the quartiles
+will be as follows:
+
+
+
+Caption: Example of quartiles for the given data
+
+So, the first quartile (usually referred to as Q1) is 4; the second one
+(Q2), which is also the median, is 5; and the third quartile (Q3) is 8.
+
+A boxplot will show these quartiles but also additional information,
+such as the following:
+
+- The **interquartile range (or IQR)**, which corresponds to Q3 - Q1
+- The *lowest* value, which corresponds to Q1 - (1.5 \* IQR)
+- The *highest* value, which corresponds to Q3 + (1.5 \* IQR)
+- Outliers, that is, any point outside of the lowest and highest
+ points:
+
+
+
+
+Caption: Example of a boxplot
+
+With a boxplot, it is quite easy to see the central point (median),
+where 50% of the data falls under (IQR), and the outliers.
+
+Another benefit of using a boxplot is to plot the distribution of
+categorical variables against a numerical variable and compare them.
+Let\'s try it with the `Country` and `Quantity`
+columns using the `mark_boxplot()` method:
+
+```
+alt.Chart(sample_df).mark_boxplot()\
+ .encode(x='Country:O', y='Quantity:Q')
+```
+You should receive the following output:
+
+
+
+Caption: Boxplot of the \'Country\' and \'Quantity\' columns
+
+This chart shows us how the `Quantity` variable is distributed
+across the different countries for this dataset. We can see that
+`United Kingdom` has a lot of outliers, especially in the
+negative values. `Eire` is another country that has negative
+outliers. Most of the countries have very low value quantities except
+for `Japan`, `Netherlands`, and `Sweden`,
+who sold more items.
+
+In this section, we saw how to use the `altair` package to
+generate graphs that helped us get additional insights about the dataset
+and identify some potential issues.
+
+
+
+Exercise 10.04: Visualizing the Ames Housing Dataset with Altair
+----------------------------------------------------------------
+
+In this exercise, we will learn how to get a better understanding of a
+dataset and the relationship between variables using data visualization
+features such as histograms, scatter plots, or boxplots.
+
+Note
+
+You will be using the same Ames housing dataset that was used in the
+previous exercises.
+
+1. Open a new Colab notebook.
+
+2. Import the `pandas` and `altair` packages:
+ ```
+ import pandas as pd
+ import altair as alt
+ ```
+
+
+3. Assign the link to the AMES dataset to a variable called
+ `file_url`:
+ ```
+ file_url = 'https://raw.githubusercontent.com/'\
+ 'fenago/data-science/'\
+ 'master/Lab10/dataset/ames_iowa_housing.csv'
+ ```
+
+
+4. Using the `read_csv` method from the pandas package, load
+ the dataset into a new variable called `'df'`:
+
+ ```
+ df = pd.read_csv(file_url)
+ ```
+
+
+ Plot the histogram for the `SalePrice` variable using the
+ `mark_bar()` and `encode()` methods from the
+ `altair` package. Use the `alt.X` and
+ `alt.Bin` APIs to specify the number of bin steps, that
+ is, `50000`:
+
+ ```
+ alt.Chart(df).mark_bar()\
+ .encode(alt.X("SalePrice:Q", bin=alt.Bin(step=50000)),\
+ y='count()')
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+ Caption: Histogram of SalePrice
+
+ This chart shows that most of the properties have a sale price
+ centered around `100,000 – 150,000`. There are also a few
+ outliers with a high sale price over `500,000`.
+
+5. Now, let\'s plot the histogram for `LotArea` but this time
+ with a bin step size of `10000`:
+
+ ```
+ alt.Chart(df).mark_bar()\
+ .encode(alt.X("LotArea:Q", bin=alt.Bin(step=10000)),\
+ y='count()')
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+ Caption: Histogram of LotArea
+
+ `LotArea` has a totally different distribution compared to
+ `SalePrice`. Most of the observations are between
+ `0` and `20,000`. The rest of the observations
+ represent a small portion of the dataset. We can also notice some
+ extreme outliers over `150,000`.
+
+6. Now, plot a scatter plot with `LotArea` as the *x* axis
+ and `SalePrice` as the *y* axis to understand the
+ interactions between these two variables:
+
+ ```
+ alt.Chart(df).mark_circle()\
+ .encode(x='LotArea:Q', y='SalePrice:Q')
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+ Caption: Scatter plot of SalePrice and LotArea
+
+ There is clearly a correlation between the size of the property and
+ the sale price. If we look only at the properties with
+ `LotArea` under 50,000, we can see a linear relationship:
+ if we draw a straight line from the (`0,0`) coordinates to
+ the (`20000,800000`) coordinates, we can say that
+ `SalePrice` increases by 40,000 for each additional
+ increase of 1,000 for `LotArea`. The formula of this
+ straight line (or regression line) will be
+ `SalePrice = 40000 * LotArea / 1000`. We can also see
+ that, for some properties, although their size is quite high, their
+ price didn\'t follow this pattern. For instance, the property with a
+ size of 160,000 has been sold for less than 300,000.
+
+7. Now, let\'s plot the histogram for `OverallCond`, but this
+ time with the default bin step size, that is,
+ (`bin=True`):
+
+ ```
+ alt.Chart(df).mark_bar()\
+ .encode(alt.X("OverallCond", bin=True), \
+ y='count()')
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+ Caption: Histogram of OverallCond
+
+ We can see that the values contained in this column are discrete:
+ they can only take a finite number of values (any integer between
+ `1` and `9`). This variable is not numerical,
+ but ordinal: the order matters, but you can\'t perform some
+ mathematical operations on it such as adding value `2` to
+ value `8`. This column is an arbitrary mapping to assess
+ the overall quality of the property. In the next lab, we will
+ look at how we can change the type of such a column.
+
+8. Build a boxplot with `OverallCond:O` (`':O'` is
+ for specifying that this column is ordinal) on the *x* axis and
+ `SalePrice` on the *y* axis using the
+ `mark_boxplot()` method, as shown in the following code
+ snippet:
+
+ ```
+ alt.Chart(df).mark_boxplot()\
+ .encode(x='OverallCond:O', y='SalePrice:Q')
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+ Caption: Boxplot of OverallCond
+
+ It seems that the `OverallCond` variable is in ascending
+ order: the sales price is higher if the condition value is high.
+ However, we will notice that `SalePrice` is quite high for
+ the value 5, which seems to represent a medium condition. There may
+ be other factors impacting the sales price for this category, such
+ as higher competition between buyers for such types of properties.
+
+9. Now, let\'s plot a bar chart for `YrSold` as its *x* axis
+ and `count()` as its *y* axis. Don\'t forget to specify
+ that `YrSold` is an ordinal variable and not numerical
+ using `':O'`:
+
+ ```
+ alt.Chart(df).mark_bar()\
+ .encode(alt.X('YrSold:O'), y='count()')
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+ Caption: Bar chart of YrSold
+
+ We can see that, roughly, the same number of properties are sold
+ every year, except for 2010. From 2006 to 2009, there was, on
+ average, 310 properties sold per year, while there were only 170
+ in 2010.
+
+10. Plot a boxplot similar to the one shown in *Step 8* but for
+ `YrSold` as its *x* axis:
+
+ ```
+ alt.Chart(df).mark_boxplot()\
+ .encode(x='YrSold:O', y='SalePrice:Q')
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+ Caption: Boxplot of YrSold and SalePrice
+
+ Overall, the median sale price is quite stable across the years,
+ with a slight decrease in 2010.
+
+11. Let\'s analyze the relationship between `SalePrice` and
+ `Neighborhood` by plotting a bar chart, similar to the one
+ shown in *Step 9*:
+
+ ```
+ alt.Chart(df).mark_bar()\
+ .encode(x='Neighborhood',y='count()')
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+ Caption: Bar chart of Neighborhood
+
+ The number of sold properties differs, depending on their location.
+ The `'NAmes'` neighborhood has the higher number of
+ properties sold: over 220. On the other hand, neighborhoods such as
+ `'Blueste'` or `'NPkVill'` only had a few
+ properties sold.
+
+12. Let\'s analyze the relationship between `SalePrice` and
+ `Neighborhood` by plotting a boxplot chart similar to the
+ one in *Step 10*:
+
+ ```
+ alt.Chart(df).mark_boxplot()\
+ .encode(x='Neighborhood:O', y='SalePrice:Q')
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+Caption: Boxplot of Neighborhood and SalePrice
+
+
+
+Activity 10.01: Analyzing Churn Data Using Visual Data Analysis Techniques
+--------------------------------------------------------------------------
+
+You are working for a major telecommunications company. The marketing
+department has noticed a recent spike of customer churn (*customers that
+stopped using or canceled their service from the company*).
+
+
+The following steps will help you complete this activity:
+
+1. Download and load the dataset into Python using
+ `.read_csv()`.
+2. Explore the structure and content of the dataset by using
+ `.shape`, `.dtypes`, `.head()`,
+ `.tail()`, or `.sample()`.
+3. Calculate and interpret descriptive statistics with
+ `.describe()`.
+4. Analyze each variable using data visualization with bar charts,
+ histograms, or boxplots.
+5. Identify areas that need clarification from the marketing department
+ and potential data quality issues.
+
+**Expected Output**
+
+Here is the expected bar chart output:
+
+
+
+Caption: Expected bar chart output
+
+Here is the expected histogram output:
+
+
+
+Caption: Expected histogram output
+
+Here is the expected boxplot output:
+
+
+
+Caption: Expected boxplot output
+
+
+
+Summary
+=======
+
+
+You just learned a lot regarding how to analyze a dataset. This a very
+critical step in any data science project. Getting a deep understanding
+of the dataset will help you to better assess the feasibility of
+achieving the requirements from the business.
+
+You learned how to use descriptive statistics to summarize key
+attributes of the dataset such as the average value of a numerical
+column, its spread with standard deviation or its range (minimum and
+maximum values), the unique values of a categorical variable, and its
+most frequent values. You also saw how to use data visualization to get
+valuable insights for each variable. Now, you know how to use scatter
+plots, bar charts, histograms, and boxplots to understand the
+distribution of a column.
+
diff --git a/lab_guides/Lab_11.md b/lab_guides/Lab_11.md
new file mode 100644
index 0000000..8889cb8
--- /dev/null
+++ b/lab_guides/Lab_11.md
@@ -0,0 +1,1794 @@
+
+11. Data Preparation
+====================
+
+
+
+Overview
+
+By the end of this lab you will be able to filter DataFrames with
+specific conditions; remove duplicate or irrelevant records or columns;
+convert variables into different data types; replace values in a column
+and handle missing values and outlier observations.
+
+This lab will introduce you to the main techniques you can use to
+handle data issues in order to achieve high quality for your dataset
+prior to modeling it.
+
+
+Introduction
+============
+
+
+In the previous lab, you saw how critical it was to get a very good
+understanding of your data and learned about different techniques and
+tools to achieve this goal. While performing **Exploratory Data
+Analysis** (**EDA**) on a given **dataset**, you may find some potential
+issues that need to be addressed before the modeling stage. This is
+exactly the topic that will be covered in this lab. You will learn
+how you can handle some of the most frequent data quality issues and
+prepare the dataset properly.
+
+This lab will introduce you to the issues that you will encounter
+frequently during your data scientist career (such as **duplicated**
+**rows**, incorrect data types, incorrect values, and missing values)
+and you will learn about the techniques you can use to easily fix them.
+But be careful -- some issues that you come across don\'t necessarily
+need to be fixed. Some of the suspicious or unexpected values you find
+may be genuine from a business point of view. This includes values that
+crop up very rarely but are totally genuine. Therefore, it is extremely
+important to get confirmation either from your stakeholder or the data
+engineering team before you alter the dataset. It is your responsibility
+to make sure you are making the right decisions for the business while
+preparing the dataset.
+
+For instance, in *Lab 10*, *Analyzing a Dataset*, you looked at the
+*Online Retail dataset*, which had some negative values in the
+`Quantity` column. Here, we expected only positive values. But
+before fixing this issue straight away (by either dropping the records
+or transforming them into positive values), it is preferable to get in
+touch with your stakeholders first and get confirmation that these
+values are not significant for the business. They may tell you that
+these values are extremely important as they represent returned items
+and cost the company a lot of money, so they want to analyze these cases
+in order to reduce these numbers. If you had moved to the data cleaning
+stage straight away, you would have missed this critical piece of
+information and potentially came up with incorrect results.
+
+
+Handling Row Duplication
+========================
+
+
+Most of the time, the datasets you will receive or have access to will
+not have been 100% cleaned. They usually have some issues that need to
+be fixed. One of these issues could be duplicated rows. Row duplication
+means that several observations contain the exact same information in
+the dataset. With the `pandas` package, it is extremely easy
+to find these cases.
+
+Let\'s use the example that we saw in *Lab 10*, *Analyzing a
+Dataset*.
+
+Start by **importing** the dataset into a DataFrame:
+
+```
+import pandas as pd
+file_url = 'https://github.com/fenago/'\
+ 'data-science/blob/'\
+ 'master/Lab10/dataset/'\
+ 'Online%20Retail.xlsx?raw=true'
+df = pd.read_excel(file_url)
+```
+
+The `duplicated()` method from `pandas` checks
+whether any of the rows are duplicates and returns a **boolean** value
+for each row, `True` if the row is a duplicate and
+`False` if not:
+
+```
+df.duplicated()
+```
+You should get the following output:
+
+
+
+Caption: Output of the duplicated() method
+
+Note
+
+The outputs in this lab have been truncated to effectively use the
+page area.
+
+In Python, the `True` and `False` binary values
+correspond to the numerical values 1 and 0, respectively. To find out
+how many rows have been identified as duplicates, you can use the
+`sum()` method on the output of `duplicated()`. This
+will add all the 1s (that is, `True` values) and gives us the
+count of duplicates:
+
+```
+df.duplicated().sum()
+```
+You should get the following output:
+
+```
+5268
+```
+Since the output of the `duplicated()` method is a
+`pandas` series of binary values for each row, you can also
+use it to subset the rows of a DataFrame. The `pandas` package
+provides different APIs for subsetting a DataFrame, as follows:
+
+- df\[\ or \\]
+- df.loc\[\, \\]
+- df.iloc\[\, \\]
+
+The first API subsets the DataFrame by **row** or **column**. To filter
+specific columns, you can provide a list that contains their names. For
+instance, if you want to keep only the variables, that is,
+`InvoiceNo`, `StockCode`, `InvoiceDate`,
+and `CustomerID`, you need to use the following code:
+
+```
+df[['InvoiceNo', 'StockCode', 'InvoiceDate', 'CustomerID']]
+```
+You should get the following output:
+
+
+
+Caption: Subsetting columns
+
+If you only want to filter the rows that are considered duplicates, you
+can use the same API call with the output of the
+`duplicated()` method. It will only keep the rows with
+**True** as a value:
+
+```
+df[df.duplicated()]
+```
+You should get the following output:
+
+
+
+Caption: Subsetting the duplicated rows
+
+If you want to subset the rows and columns at the same time, you must
+use one of the other two available APIs: `.loc` or
+`.iloc`. These APIs do the exact same thing but
+`.loc` uses labels or names while `.iloc` only takes
+indices as input. You will use the `.loc` API to subset the
+duplicated rows and keep only the selected four columns, as shown in the
+previous example:
+
+```
+df.loc[df.duplicated(), ['InvoiceNo', 'StockCode', \
+ 'InvoiceDate', 'CustomerID']]
+```
+You should get the following output:
+
+
+
+Caption: Subsetting the duplicated rows and selected columns using
+.loc
+
+This preceding output shows that the first few duplicates are row
+numbers `517`, `527`, `537`, and so on. By
+default, `pandas` doesn\'t mark the first occurrence of
+duplicates as duplicates: all the same, duplicates will have a value of
+`True` except for the first occurrence. You can change this
+behavior by specifying the `keep` parameter. If you want to
+keep the last duplicate, you need to specify `keep='last'`:
+
+```
+df.loc[df.duplicated(keep='last'), ['InvoiceNo', 'StockCode', \
+ 'InvoiceDate', 'CustomerID']]
+```
+You should get the following output:
+
+
+
+Caption: Subsetting the last duplicated rows
+
+As you can see from the previous outputs, row `485` has the
+same value as row `539`. As expected, row `539` is
+not marked as a duplicate anymore. If you want to mark all the duplicate
+records as duplicates, you will have to use `keep=False`:
+
+```
+df.loc[df.duplicated(keep=False), ['InvoiceNo', 'StockCode',\
+ 'InvoiceDate', 'CustomerID']]
+```
+You should get the following output:
+
+
+
+Caption: Subsetting all the duplicated rows
+
+This time, rows `485` and `539` have been listed as
+duplicates. Now that you know how to identify duplicate observations,
+you can decide whether you wish to remove them from the dataset. As we
+mentioned previously, you must be careful when changing the data. You
+may want to confirm with the business that they are comfortable with you
+doing so. You will have to explain the reason why you want to remove
+these rows. In the Online Retail dataset, if you take rows
+`485` and `539` as an example, these two
+observations are identical. From a business perspective, this means that
+a specific customer (`CustomerID 17908`) has bought the same
+item (`StockCode 22111`) at the exact same date and time
+(`InvoiceDate 2010-12-01 11:45:00`) on the same invoice
+(`InvoiceNo 536409`). This is highly suspicious. When you\'re
+talking with the business, they may tell you that new software was
+released at that time and there was a bug that was splitting all the
+purchased items into single-line items.
+
+In this case, you know that you shouldn\'t remove these rows. On the
+other hand, they may tell you that duplication shouldn\'t happen and
+that it may be due to human error as the data was entered or during the
+data extraction step. Let\'s assume this is the case; now, it is safe
+for you to remove these rows.
+
+To do so, you can use the `drop_duplicates()` method from
+`pandas`. It has the same `keep` parameter as
+`duplicated()`, which specifies which duplicated record you
+want to keep or if you want to remove all of them. In this case, we want
+to keep at least one duplicate row. Here, we want to keep the first
+occurrence:
+
+```
+df.drop_duplicates(keep='first')
+```
+You should get the following output:
+
+
+
+Caption: Dropping duplicate rows with keep=\'first\'
+
+The output of this method is a new DataFrame that contains unique
+records where only the first occurrence of duplicates has been kept. If
+you want to replace the existing DataFrame rather than getting a new
+DataFrame, you need to use the `inplace=True` parameter.
+
+The `drop_duplicates()` and `duplicated()` methods
+also have another very useful parameter: `subset`. This
+parameter allows you to specify the list of columns to consider while
+looking for duplicates. By default, all the columns of a DataFrame are
+used to find duplicate rows. Let\'s see how many duplicate rows there
+are while only looking at the `InvoiceNo`,
+`StockCode`, `invoiceDate`, and
+`CustomerID` columns:
+
+```
+df.duplicated(subset=['InvoiceNo', 'StockCode', 'InvoiceDate',\
+ 'CustomerID'], keep='first').sum()
+```
+You should get the following output:
+
+```
+10677
+```
+
+By looking only at these four columns instead of all of them, we can see
+that the number of duplicate rows has increased from `5268` to
+`10677`. This means that there are rows that have the exact
+same values as these four columns but have different values in other
+columns, which means they may be different records. In this case, it is
+better to use all the columns to identify duplicate records.
+
+
+
+Exercise 11.01: Handling Duplicates in a Breast Cancer Dataset
+--------------------------------------------------------------
+
+In this exercise, you will learn how to identify duplicate records and
+how to handle such issues so that the dataset only contains **unique**
+records. Let\'s get started:
+
+
+1. Open a new **Colab** notebook.
+
+2. Import the `pandas` package:
+ ```
+ import pandas as pd
+ ```
+
+
+3. Assign the link to the `Breast Cancer` dataset to a
+ variable called `file_url`:
+ ```
+ file_url = 'https://raw.githubusercontent.com/'\
+ 'fenago/data-science/'\
+ 'master/Lab11/dataset/'\
+ 'breast-cancer-wisconsin.data'
+ ```
+
+
+4. Using the `read_csv()` method from the `pandas`
+ package, load the dataset into a new variable called `df`
+ with the `header=None` parameter. We\'re doing this
+ because this file doesn\'t contain column names:
+ ```
+ df = pd.read_csv(file_url, header=None)
+ ```
+
+
+5. Create a variable called `col_names` that contains the
+ names of the columns:
+ `Sample code number, Clump Thickness, Uniformity of Cell Size, Uniformity of Cell Shape, Marginal Adhesion, Single Epithelial Cell Size, Bare Nuclei, Bland Chromatin, Normal Nucleoli, Mitoses`,
+ and `Class`:
+
+
+
+ ```
+ col_names = ['Sample code number','Clump Thickness',\
+ 'Uniformity of Cell Size',\
+ 'Uniformity of Cell Shape',\
+ 'Marginal Adhesion','Single Epithelial Cell Size',\
+ 'Bare Nuclei','Bland Chromatin',\
+ 'Normal Nucleoli','Mitoses','Class']
+ ```
+
+
+6. Assign the column names of the DataFrame using the
+ `columns` attribute:
+ ```
+ df.columns = col_names
+ ```
+
+
+7. Display the shape of the DataFrame using the `.shape`
+ attribute:
+
+ ```
+ df.shape
+ ```
+
+
+ You should get the following output:
+
+ ```
+ (699, 11)
+ ```
+
+
+ This DataFrame contains `699` rows and `11`
+ columns.
+
+8. Display the first five rows of the DataFrame using the
+ `head()` method:
+
+ ```
+ df.head()
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+ Caption: The first five rows of the Breast Cancer dataset
+
+ All the variables are numerical. The Sample code number column is an
+ identifier for the measurement samples.
+
+9. Find the number of duplicate rows using the `duplicated()`
+ and `sum()` methods:
+
+ ```
+ df.duplicated().sum()
+ ```
+
+
+ You should get the following output:
+
+ ```
+ 8
+ ```
+
+
+ Looking at the 11 columns in this dataset, we can see that there are
+ `8` duplicate rows.
+
+10. Display the duplicate rows using the `loc()` and
+ `duplicated()` methods:
+
+ ```
+ df.loc[df.duplicated()]
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+ Caption: Duplicate records
+
+ The following rows are duplicates: `208`, `253`,
+ `254`, `258`, `272`, `338`,
+ `561`, and `684`.
+
+11. Display the duplicate rows just like we did in *Step 9*, but with
+ the `keep='last'` parameter instead:
+
+ ```
+ df.loc[df.duplicated(keep='last')]
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+ Caption: Duplicate records with keep=\'last\'
+
+ By using the `keep='last'` parameter, the following rows
+ are considered duplicates: `42`, `62`,
+ `168`, `207`, `267`, `314`,
+ `560`, and `683`. By comparing this output to
+ the one from the previous step, we can see that rows 253 and 42 are
+ identical.
+
+12. Remove the duplicate rows using the `drop_duplicates()`
+ method along with the `keep='first'` parameter and save
+ this into a new DataFrame called `df_unique`:
+ ```
+ df_unique = df.drop_duplicates(keep='first')
+ ```
+
+
+13. Display the shape of `df_unique` with the
+ `.shape` attribute:
+
+ ```
+ df_unique.shape
+ ```
+
+
+ You should get the following output:
+
+ ```
+ (691, 11)
+ ```
+
+
+ Now that we have removed the eight duplicate records, only
+ `691` rows remain. Now, the dataset only contains unique
+ observations.
+
+
+
+In this exercise, you learned how to identify and remove duplicate
+records from a real-world dataset.
+
+
+Converting Data Types
+=====================
+
+
+Another problem you may face in a project is incorrect data types being
+inferred for some columns. As we saw in *Lab 10*, *Analyzing a
+Dataset*, the `pandas` package provides us with a very easy
+way to display the data type of each column using the
+`.dtypes` attribute. You may be wondering, when did
+`pandas` identify the type of each column? The types are
+detected when you load the dataset into a `pandas` DataFrame
+using methods such as `read_csv()`, `read_excel()`,
+and so on.
+
+When you\'ve done this, `pandas` will try its best to
+automatically find the best type according to the values contained in
+each column. Let\'s see how this works on the `Online Retail`
+dataset.
+
+First, you must import `pandas`:
+
+```
+import pandas as pd
+```
+
+Then, you need to assign the URL to the dataset to a new variable:
+
+```
+file_url = 'https://github.com/fenago/'\
+ 'data-science/blob/'\
+ 'master/Lab10/dataset/'\
+ 'Online%20Retail.xlsx?raw=true'
+```
+Let\'s load the dataset into a `pandas` DataFrame using
+`read_excel()`:
+
+```
+df = pd.read_excel(file_url)
+```
+Finally, let\'s print the data type of each column:
+
+```
+df.dtypes
+```
+You should get the following output:
+
+
+
+Caption: The data type of each column of the Online Retail dataset
+
+The preceding output shows the data types that have been assigned to
+each column. `Quantity`, `UnitPrice`, and
+`CustomerID` have been identified as numerical variables
+(`int64`, `float64`), `InvoiceDate` is a
+`datetime` variable, and all the other columns are considered
+text (`object`). This is not too bad. `pandas` did a
+great job of recognizing non-text columns.
+
+But what if you want to change the types of some columns? You have two
+ways to achieve this.
+
+The first way is to reload the dataset, but this time, you will need to
+specify the data types of the columns of interest using the
+`dtype` parameter. This parameter takes a dictionary with the
+column names as keys and the correct data types as values, such as
+{\'col1\': np.float64, \'col2\': np.int32}, as input. Let\'s try this on
+`CustomerID`. We know this isn\'t a numerical variable as it
+contains a unique **identifier** (code). Here, we are going to change
+its type to **object**:
+
+```
+df = pd.read_excel(file_url, dtype={'CustomerID': 'category'})
+df.dtypes
+```
+You should get the following output:
+
+
+
+Caption: The data types of each column after converting CustomerID
+
+As you can see, the data type for `CustomerID` has effectively
+changed to a `category` type.
+
+Now, let\'s look at the second way of converting a single column into a
+different type. In `pandas`, you can use the
+`astype()` method and specify the new data type that it will
+be converted into as its **parameter**. It will return a new column (a
+new `pandas` series, to be more precise), so you need to
+reassign it to the same column of the DataFrame. For instance, if you
+want to change the `InvoiceNo` column into a categorical
+variable, you would do the following:
+
+```
+df['InvoiceNo'] = df['InvoiceNo'].astype('category')
+df.dtypes
+```
+You should get the following output:
+
+
+
+Caption: The data types of each column after converting InvoiceNo
+
+As you can see, the data type for `InvoiceNo` has changed to a
+categorical variable. The difference between `object` and
+`category` is that the latter has a finite number of possible
+values (also called discrete variables). Once these have been changed
+into categorical variables, `pandas` will automatically list
+all the values. They can be accessed using the
+`.cat.categories` attribute:
+
+```
+df['InvoiceNo'].cat.categories
+```
+You should get the following output:
+
+
+
+Caption: List of categories (possible values) for the InvoiceNo
+categorical variable
+
+`pandas` has identified that there are 25,900 different values
+in this column and has listed all of them. Depending on the data type
+that\'s assigned to a variable, `pandas` provides different
+attributes and methods that are very handy for data transformation or
+feature engineering (this will be covered in *Lab 12*, *Feature
+Engineering*).
+
+As a final note, you may be wondering when you would use the first way
+of changing the types of certain columns (while loading the dataset). To
+find out the current type of each variable, you must load the data
+first, so why will you need to reload the data again with new data
+types? It will be easier to change the type with the
+`astype()` method after the first load. There are a few
+reasons why you would use it. One reason could be that you have already
+explored the dataset on a different tool, such as Excel, and already
+know what the correct data types are.
+
+The second reason could be that your dataset is big, and you cannot load
+it in its entirety. As you may have noticed, by default,
+`pandas` use 64-bit encoding for numerical variables. This
+requires a lot of memory and may be overkill.
+
+For example, the `Quantity` column has an int64 data type,
+which means that the range of possible values is
+-9,223,372,036,854,775,808 to 9,223,372,036,854,775,807. However, in
+*Lab 10*, *Analyzing a Dataset* while analyzing the distribution of
+this column, you learned that the range of values for this column is
+only from -80,995 to 80,995. You don\'t need to use so much space. By
+reducing the data type of this variable to int32 (which ranges from
+-2,147,483,648 to 2,147,483,647), you may be able to reload the entire
+dataset.
+
+
+
+Exercise 11.02: Converting Data Types for the Ames Housing Dataset
+------------------------------------------------------------------
+
+In this exercise, you will prepare a dataset by converting its variables
+into the correct data types.
+
+You will use the Ames Housing dataset to do this, which we also used in
+*Lab 10*, *Analyzing a Dataset*. For more information about this
+dataset, refer to the following note. Let\'s get started:
+
+
+
+1. Open a new Colab notebook.
+
+2. Import the `pandas` package:
+ ```
+ import pandas as pd
+ ```
+
+
+3. Assign the link to the Ames dataset to a variable called
+ `file_url`:
+ ```
+ file_url = 'https://raw.githubusercontent.com/'\
+ 'fenago/data-science/'\
+ 'master/Lab10/dataset/ames_iowa_housing.csv'
+ ```
+
+
+4. Using the `read_csv` method from the `pandas`
+ package, load the dataset into a new variable called `df`:
+ ```
+ df = pd.read_csv(file_url)
+ ```
+
+
+5. Print the data type of each column using the `dtypes`
+ attribute:
+
+ ```
+ df.dtypes
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+ Caption: List of columns and their assigned data types
+
+ Note
+
+ The preceding output has been truncated.
+
+ From *Lab 10*, *Analyzing a Dataset* you know that the
+ `Id`, `MSSubClass`, `OverallQual`, and
+ `OverallCond` columns have been incorrectly classified as
+ numerical variables. They have a finite number of unique values and
+ you can\'t perform any mathematical operations on them. For example,
+ it doesn\'t make sense to add, remove, multiply, or divide two
+ different values from the `Id` column. Therefore, you need
+ to convert them into categorical variables.
+
+6. Using the `astype()` method, convert the `'Id'`
+ column into a categorical variable, as shown in the following code
+ snippet:
+ ```
+ df['Id'] = df['Id'].astype('category')
+ ```
+
+
+7. Convert the `'MSSubClass'`, `'OverallQual'`, and
+ `'OverallCond'` columns into categorical variables, like
+ we did in the previous step:
+ ```
+ df['MSSubClass'] = df['MSSubClass'].astype('category')
+ df['OverallQual'] = df['OverallQual'].astype('category')
+ df['OverallCond'] = df['OverallCond'].astype('category')
+ ```
+
+
+8. Create a for loop that will iterate through the four categorical
+ columns
+ `('Id', 'MSSubClass', 'OverallQual', `and` 'OverallCond'`)
+ and print their names and categories using the
+ `.cat.categories` attribute:
+
+ ```
+ for col_name in ['Id', 'MSSubClass', 'OverallQual', \
+ 'OverallCond']:
+ print(col_name)
+ print(df[col_name].cat.categories)
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+ Caption: List of categories for the four newly converted
+ variables
+
+ Now, these four columns have been converted into categorical
+ variables. From the output of *Step 5*, we can see that there are a
+ lot of variables of the `object` type. Let\'s have a look
+ at them and see if they need to be converted as well.
+
+9. Create a new DataFrame called `obj_df` that will only
+ contain variables of the `object` type using the
+ `select_dtypes` method along with the
+ `include='object'` parameter:
+ ```
+ obj_df = df.select_dtypes(include='object')
+ ```
+
+
+10. Create a new variable called `obj_cols` that contains a
+ list of column names from the `obj_df` DataFrame using the
+ `.columns` attribute and display its content:
+
+ ```
+ obj_cols = obj_df.columns
+ obj_cols
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+ Caption: List of variables of the \'object\' type
+
+11. Like we did in *Step 8*, create a `for` loop that will
+ iterate through the column names contained in `obj_cols`
+ and print their names and unique values using the
+ `unique()` method:
+
+ ```
+ for col_name in obj_cols:
+ print(col_name)
+ print(df[col_name].unique())
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+ Caption: List of unique values for each variable of the
+ \'object\' type
+
+ As you can see, all these columns have a finite number of unique
+ values that are composed of text, which shows us that they are
+ categorical variables.
+
+12. Now, create a `for` loop that will iterate through the
+ column names contained in `obj_cols` and convert each of
+ them into a categorical variable using the `astype()`
+ method:
+ ```
+ for col_name in obj_cols:
+ df[col_name] = df[col_name].astype('category')
+ ```
+
+
+13. Print the data type of each column using the `dtypes`
+ attribute:
+
+ ```
+ df.dtypes
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+Caption: List of variables and their new data types
+
+
+You have successfully converted the columns that have incorrect data
+types (numerical or object) into categorical variables. Your dataset is
+now one step closer to being prepared for modeling.
+
+In the next section, we will look at handling incorrect values.
+
+
+Handling Incorrect Values
+=========================
+
+
+Let\'s learn how to detect such issues in real life by using the
+`Online Retail` dataset.
+
+First, you need to load the data into a `pandas` DataFrame:
+
+```
+import pandas as pd
+file_url = 'https://github.com/fenago/'\
+ 'data-science/blob/'\
+ 'master/Lab10/dataset/'\
+ 'Online%20Retail.xlsx?raw=true'
+df = pd.read_excel(file_url)
+```
+
+In this dataset, there are two variables that seem to be related to each
+other: `StockCode` and `Description`. The first one
+contains the identifier code of the items sold and the other one
+contains their descriptions. However, if you look at some of the
+examples, such as `StockCode 23131`, the
+`Description` column has different values:
+
+```
+df.loc[df['StockCode'] == 23131, 'Description'].unique()
+```
+You should get the following output
+
+
+
+Caption: List of unique values for the Description column and
+StockCode 23131
+
+There are multiple issues in the preceding output. One issue is that the
+word `Mistletoe` has been misspelled so that it reads
+`Miseltoe`. The other errors are unexpected values and missing
+values, which will be covered in the next section. It seems that the
+`Description` column has been used to record comments such as
+`had been put aside`.
+
+Let\'s focus on the misspelling issue. What we need to do here is modify
+the incorrect spelling and replace it with the correct value. First,
+let\'s create a new column called `StockCodeDescription`,
+which is an exact copy of the `Description` column:
+
+```
+df['StockCodeDescription'] = df['Description']
+```
+You will use this new column to fix the misspelling issue. To do this,
+use the subsetting technique you learned about earlier in this lab.
+You need to use `.loc` and filter the rows and columns you
+want, that is, all rows with `StockCode == 21131` and
+`Description == MISELTOE HEART WREATH CREAM` and the
+`Description` column:
+
+```
+df.loc[(df['StockCode'] == 23131) \
+ & (df['StockCodeDescription'] \
+ == 'MISELTOE HEART WREATH CREAM'), \
+ 'StockCodeDescription'] = 'MISTLETOE HEART WREATH CREAM'
+```
+If you reprint the value for this issue, you will see that the
+misspelling value has been fixed and is not present anymore:
+
+```
+df.loc[df['StockCode'] == 23131, 'StockCodeDescription'].unique()
+```
+You should get the following output:
+
+
+
+Caption: List of unique values for the Description column and
+StockCode 23131 after fixing the first misspelling issue
+
+As you can see, there are still five different values for this product,
+but for one of them, that is, `MISTLETOE`, has been spelled
+incorrectly: `MISELTOE`.
+
+This time, rather than looking at an exact match (a word must be the
+same as another one), we will look at performing a partial match (part
+of a word will be present in another word). In our case, instead of
+looking at the spelling of `MISELTOE`, we will only look at
+`MISEL`. The `pandas` package provides a method
+called `.str.contains()` that we can use to look for
+observations that partially match with a given expression.
+
+Let\'s use this to see if we have the same misspelling issue
+(`MISEL`) in the entire dataset. You will need to add one
+additional parameter since this method doesn\'t handle missing values.
+You will also have to subset the rows that don\'t have missing values
+for the `Description` column. This can be done by providing
+the `na=False` parameter to the `.str.contains()`
+method:
+
+```
+df.loc[df['StockCodeDescription']\
+ .str.contains('MISEL', na=False),]
+```
+You should get the following output:
+
+
+
+Caption: Displaying all the rows containing the misspelling
+\'MISELTOE\'
+
+This misspelling issue (`MISELTOE`) is not only related to
+`StockCode 23131`, but also to other items. You will need to
+fix all of these using the `str.replace()` method. This method
+takes the string of characters to be replaced and the replacement string
+as parameters:
+
+```
+df['StockCodeDescription'] = df['StockCodeDescription']\
+ .str.replace\
+ ('MISELTOE', 'MISTLETOE')
+```
+Now, if you print all the rows that contain the misspelling of
+`MISEL`, you will see that no such rows exist anymore:
+
+```
+df.loc[df['StockCodeDescription']\
+ .str.contains('MISEL', na=False),]
+```
+You should get the following output
+
+
+
+
+You just saw how easy it is to clean observations that have incorrect
+values using the `.str.contains` and
+`.str.replace()` methods that are provided by the
+`pandas` package. These methods can only be used for variables
+containing strings, but the same logic can be applied to numerical
+variables and can also be used to handle extreme values or outliers. You
+can use the ==, \>, \<, \>=, or \<= operator to subset the rows you want
+and then replace the observations with the correct values.
+
+
+
+Exercise 11.03: Fixing Incorrect Values in the State Column
+-----------------------------------------------------------
+
+In this exercise, you will clean the `State` variable in a
+modified version of a dataset by listing all the finance officers in the
+USA. We are doing this because the dataset contains some incorrect
+values. Let\'s get started:
+
+1. Open a new Colab notebook.
+
+2. Import the `pandas` package:
+ ```
+ import pandas as pd
+ ```
+
+
+3. Assign the link to the dataset to a variable called
+ `file_url`:
+ ```
+ file_url = 'https://raw.githubusercontent.com/'\
+ 'fenago/data-science/'\
+ 'master/Lab11/dataset/officers.csv'
+ ```
+
+
+4. Using the `read_csv()` method from the `pandas`
+ package, load the dataset into a new variable called `df`:
+ ```
+ df = pd.read_csv(file_url)
+ ```
+
+
+5. Print the first five rows of the DataFrame using the
+ `.head()` method:
+
+ ```
+ df.head()
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+ Caption: The first five rows of the finance officers dataset
+
+6. Print out all the unique values of the `State` variable:
+
+ ```
+ df['State'].unique()
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+ Caption: List of unique values in the State column
+
+ All the states have been encoded into a two-capitalized character
+ format. As you can see, there are some incorrect values with
+ non-capitalized characters, such as `il` and
+ `iL` (they look like spelling errors for Illinois), and
+ unexpected values such as `8I`, `I`, and
+ `60`. In the next few steps, you are going to fix these
+ issues.
+
+7. Print out the rows that have the `il` value in the
+ `State` column using the `pandas`
+ `.str.contains()` method and the subsetting API, that is,
+ DataFrame \[condition\]. You will also have to set the
+ `na` parameter to `False` in
+ `str.contains()` in order to exclude observations with
+ missing values:
+
+ ```
+ df[df['State'].str.contains('il', na=False)]
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+ Caption: Observations with a value of il
+
+ As you can see, all the cities with the `il` value are
+ from the state of Illinois. So, the correct `State` value
+ should be `IL`. You may be thinking that the following
+ values are also referring to Illinois: `Il`,
+ `iL`, and `Il`. We\'ll have a look at them next.
+
+8. Now, create a `for` loop that will iterate through the
+ following values in the `State` column: `Il`,
+ `iL`, `Il`. Then, print out the values of the
+ City and State variables using the `pandas` method for
+ subsetting, that is, `.loc()`:
+ DataFrame.loc\[row\_condition, column condition\]. Do this for each
+ observation:
+
+ ```
+ for state in ['Il', 'iL', 'Il']:
+ print(df.loc[df['State'] == state, ['City', 'State']])
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+ Caption: Observations with the il value
+
+ Note
+
+ The preceding output has been truncated.
+
+ As you can see, all these cities belong to the state of Illinois.
+ Let\'s replace them with the correct values.
+
+9. Create a condition mask (`il_mask`) to subset all the rows
+ that contain the four incorrect values (`il`,
+ `Il`, `iL`, and `Il`) by using the
+ `isin()` method and a list of these values as a parameter.
+ Then, save the result into a variable called `il_mask`:
+ ```
+ il_mask = df['State'].isin(['il', 'Il', 'iL', 'Il'])
+ ```
+
+
+10. Print the number of rows that match the condition we set in
+ `il_mask` using the `.sum()` method. This will
+ sum all the rows that have a value of `True` (they match
+ the condition):
+
+ ```
+ il_mask.sum()
+ ```
+
+
+ You should get the following output:
+
+ ```
+ 672
+ ```
+
+
+11. Using the `pandas` `.loc()` method, subset the
+ rows with the `il_mask` condition mask and replace the
+ value of the `State` column with `IL`:
+ ```
+ df.loc[il_mask, 'State'] = 'IL'
+ ```
+
+
+12. Print out all the unique values of the `State` variable
+ once more:
+
+ ```
+ df['State'].unique()
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+ Caption: List of unique values for the \'State\' column
+
+ As you can see, the four incorrect values are not present anymore.
+ Let\'s have a look at the other remaining incorrect values:
+ `II`, `I`, `8I`, and `60`.
+ We will look at dealing `II` in the next step.
+
+ Print out the rows that have a value of `II` into the
+ `State` column using the `pandas` subsetting
+ API, that is, DataFrame.loc\[row\_condition, column\_condition\]:
+
+ ```
+ df.loc[df['State'] == 'II',]
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+ Caption: Subsetting the rows with a value of IL in the State
+ column
+
+ There are only two cases where the `II` value has been
+ used for the `State` column and both have
+ `Bloomington` as the city, which is in Illinois. Here, the
+ correct `State` value should be `IL`.
+
+13. Now, create a `for` loop that iterates through the three
+ incorrect values (`I`, `8I`, and `60`)
+ and print out the subsetted rows using the same logic that we used
+ in *Step 12*. Only display the `City` and
+ `State` columns:
+
+ ```
+ for val in ['I', '8I', '60']:
+ print(df.loc[df['State'] == val, ['City', 'State']])
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+ Caption: Observations with incorrect values (I, 8I, and 60)
+
+ All the observations that have incorrect values are cities in
+ Illinois. Let\'s fix them now.
+
+14. Create a `for` loop that iterates through the four
+ incorrect values (`II`, `I`, `8I`, and
+ `60`) and reuse the subsetting logic from *Step 12* to
+ replace the value in `State` with `IL`:
+ ```
+ for val in ['II', 'I', '8I', '60']:
+ df.loc[df['State'] == val, 'State'] = 'IL'
+ ```
+
+
+15. Print out all the unique values of the `State` variable:
+
+ ```
+ df['State'].unique()
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+ Caption: List of unique values for the State column
+
+ You fixed the issues for the state of Illinois. However, there are
+ two more incorrect values in this column: `In` and
+ `ng`.
+
+16. Repeat *Step 13*, but iterate through the `In` and
+ `ng` values instead:
+
+ ```
+ for val in ['In', 'ng']:
+ print(df.loc[df['State'] == val, ['City', 'State']])
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+ Caption: Observations with incorrect values (In, ng)
+
+ The rows that have the `ng` value in `State` are
+ missing values. We will cover this topic in the next section. The
+ observation that has `In` as its `State` is a
+ city in Indiana, so the correct value should be `IN`.
+ Let\'s fix this.
+
+17. Subset the rows containing the `In` value in
+ `State` using the `.loc()` and
+ `.str.contains()` methods and replace the state value with
+ `IN`. Don\'t forget to specify the `na=False`
+ parameter as `.str.contains()`:
+
+ ```
+ df.loc[df['State']\
+ .str.contains('In', na=False), 'State'] = 'IN'
+ ```
+
+
+ Print out all the unique values of the `State` variable:
+
+ ```
+ df['State'].unique()
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+Caption: List of unique values for the State column
+
+
+You just fixed all the incorrect values for the `State`
+variable using the methods provided by the `pandas` package.
+In the next section, we are going to look at handling missing values.
+
+
+Handling Missing Values
+=======================
+
+
+So far, you have looked at a variety of issues when it comes to
+datasets. Now it is time to discuss another issue that occurs quite
+frequently: missing values. As you may have guessed, this type of issue
+means that certain values are missing for certain variables.
+
+The `pandas` package provides a method that we can use to
+identify missing values in a DataFrame: `.isna()`. Let\'s see
+it in action on the `Online Retail` dataset. First, you need
+to import `pandas` and load the data into a DataFrame:
+
+```
+import pandas as pd
+file_url = 'https://github.com/fenago/'\
+ 'data-science/blob/'\
+ 'master/Lab10/dataset/'\
+ 'Online%20Retail.xlsx?raw=true'
+df = pd.read_excel(file_url)
+```
+
+The `.isna()` method returns a `pandas` series with
+a binary value for each cell of a DataFrame and states whether it is
+missing a value (`True`) or not (`False`):
+
+```
+df.isna()
+```
+You should get the following output:
+
+
+
+Caption: Output of the .isna() method
+
+As we saw previously, we can give the output of a binary variable to the
+`.sum()` method, which will add all the `True`
+values together (cells that have missing values) and provide a summary
+for each column:
+
+```
+df.isna().sum()
+```
+You should get the following output:
+
+
+
+Caption: Summary of missing values for each variable
+
+As you can see, there are `1454` missing values in the
+`Description` column and `135080` in the
+`CustomerID` column. Let\'s have a look at the missing value
+observations for `Description`. You can use the output of the
+`.isna()` method to subset the rows with missing values:
+
+```
+df[df['Description'].isna()]
+```
+You should get the following output:
+
+
+
+Caption: Subsetting the rows with missing values for Description
+
+From the preceding output, you can see that all the rows with missing
+values have `0.0` as the unit price and are missing the
+`CustomerID` column. In a real project, you will have to
+discuss these cases with the business and check whether these
+transactions are genuine or not. If the business confirms that these
+observations are irrelevant, then you will need to remove them from the
+dataset.
+
+The `pandas` package provides a method that we can use to
+easily remove missing values: `.dropna()`. This method returns
+a new DataFrame without all the rows that have missing values. By
+default, it will look at all the columns. You can specify a list of
+columns for it to look for with the `subset` parameter:
+
+```
+df.dropna(subset=['Description'])
+```
+This method returns a new DataFrame with no missing values for the
+specified columns. If you want to replace the original dataset directly,
+you can use the `inplace=True` parameter:
+
+```
+df.dropna(subset=['Description'], inplace=True)
+```
+Now, look at the summary of the missing values for each variable:
+
+```
+df.isna().sum()
+```
+You should get the following output:
+
+
+
+Caption: Summary of missing values for each variable
+
+As you can see, there are no more missing values in the
+`Description` column. Let\'s have a look at the
+`CustomerID` column:
+
+```
+df[df['CustomerID'].isna()]
+```
+You should get the following output:
+
+
+
+Caption: Rows with missing values in CustomerID
+
+This time, all the transactions look normal, except they are missing
+values for the `CustomerID` column; all the other variables
+have been filled in with values that seem genuine. There is no other way
+to infer the missing values for the `CustomerID` column. These
+rows represent almost 25% of the dataset, so we can\'t remove them.
+
+However, most algorithms require a value for each observation, so you
+need to provide one for these cases. We will use the
+`.fillna()` method from `pandas` to do this. Provide
+the value to be imputed as `Missing` and use
+`inplace=True` as a parameter:
+
+```
+df['CustomerID'].fillna('Missing', inplace=True)
+df[1443:1448]
+```
+You should get the following output:
+
+
+
+Caption: Examples of rows where missing values for CustomerID have
+been replaced with Missing
+
+Let\'s see if we have any missing values in the dataset:
+
+```
+df.isna().sum()
+```
+You should get the following output:
+
+
+
+Caption: Summary of missing values for each variable
+
+You have successfully fixed all the missing values in this dataset.
+These methods also work when we want to handle missing numerical
+variables. We will look at this in the following exercise. All you need
+to do is provide a numerical value when you want to impute a value with
+`.fillna()`.
+
+
+
+Exercise 11.04: Fixing Missing Values for the Horse Colic Dataset
+-----------------------------------------------------------------
+
+In this exercise, you will be cleaning out all the missing values for
+all the numerical variables in the `Horse Colic` dataset.
+
+Colic is a painful condition that horses can suffer from, and this
+dataset contains various pieces of information related to specific cases
+of this condition. You can use the link provided in the Note section if
+you want to find out more about the dataset\'s attributes. Let\'s get
+started:
+
+
+1. Open a new Colab notebook.
+
+2. Import the `pandas` package:
+ ```
+ import pandas as pd
+ ```
+
+
+3. Assign the link to the dataset to a variable called
+ `file_url`:
+ ```
+ file_url = 'http://raw.githubusercontent.com/'\
+ 'fenago/data-science/'\
+ 'master/Lab11/dataset/horse-colic.data'
+ ```
+
+
+4. Using the `.read_csv()` method from the `pandas`
+ package, load the dataset into a new variable called `df`
+ and specify the `header=None`,` sep='\s+'`,
+ and` prefix='X'` parameters:
+ ```
+ df = pd.read_csv(file_url, header=None, \
+ sep='\s+', prefix='X')
+ ```
+
+
+5. Print the first five rows of the DataFrame using the
+ `.head()` method:
+
+ ```
+ df.head()
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+ Caption: The first five rows of the Horse Colic dataset
+
+ As you can see, the authors have used the `?` character
+ for missing values, but the `pandas` package thinks that
+ this is a normal value. You need to transform them into missing
+ values.
+
+6. Reload the dataset into a `pandas` DataFrame using the
+ `.read_csv()` method, but this time, add the
+ `na_values='?'` parameter in order to specify that this
+ value needs to be treated as a missing value:
+ ```
+ df = pd.read_csv(file_url, header=None, sep='\s+', \
+ prefix='X', na_values='?')
+ ```
+
+
+7. Print the first five rows of the DataFrame using the
+ `.head()` method:
+
+ ```
+ df.head()
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+ Caption: The first five rows of the Horse Colic dataset
+
+ Now, you can see that `pandas` have converted all the
+ `?` values into missing values.
+
+8. Print the data type of each column using the `dtypes`
+ attribute:
+
+ ```
+ df.dtypes
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+ Caption: Data type of each column
+
+9. Print the number of missing values for each column by combining the
+ `.isna()` and `.sum()` methods:
+
+ ```
+ df.isna().sum()
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+ Caption: Number of missing values for each column
+
+10. Create a condition mask called `x0_mask` so that you can
+ find the missing values in the `X0` column using the
+ `.isna()` method:
+ ```
+ x0_mask = df['X0'].isna()
+ ```
+
+
+11. Display the number of missing values for this column by using the
+ `.sum()` method on `x0_mask`:
+
+ ```
+ x0_mask.sum()
+ ```
+
+
+ You should get the following output:
+
+ ```
+ 1
+ ```
+
+
+ Here, you got the exact same number of missing values for
+ `X0` that you did in *Step 9*.
+
+12. Extract the mean of `X0` using the `.median()`
+ method and store it in a new variable called `x0_median`.
+ Print its value:
+
+ ```
+ x0_median = df['X0'].median()
+ print(x0_median)
+ ```
+
+
+ You should get the following output:
+
+ ```
+ 1.0
+ ```
+
+
+ The median value for this column is `1`. You will replace
+ all the missing values with this value in the `X0` column.
+
+13. Replace all the missing values in the `X0` variable with
+ their median using the `.fillna()` method, along with the
+ `inplace=True` parameter:
+ ```
+ df['X0'].fillna(x0_median, inplace=True)
+ ```
+
+
+14. Print the number of missing values for `X0` by combining
+ the `.isna()` and `.sum()` methods:
+
+ ```
+ df['X0'].isna().sum()
+ ```
+
+
+ You should get the following output:
+
+ ```
+ 0
+ ```
+
+
+ There are no more missing values in the variables.
+
+15. Create a `for` loop that will iterate through all the
+ columns of the DataFrame. In the for loop, calculate the median for
+ each and save them into a variable called `col_median`.
+ Then, impute missing values with this median value using the
+ `.fillna()` method, along with the
+ `inplace=True` parameter, and print the name of the column
+ and its median value:
+
+ ```
+ for col_name in df.columns:
+ col_median = df[col_name].median()
+ df[col_name].fillna(col_median, inplace=True)
+ print(col_name)
+ print(col_median)
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+ Caption: Median values for each column
+
+16. Print the number of missing values for each column by combining the
+ `.isna()` and `.sum()` methods:
+
+ ```
+ df.isna().sum()
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+Caption: Number of missing values for each column
+
+
+You have successfully fixed the missing values for all the numerical
+variables using the methods provided by the `pandas` package:
+`.isna()` and `.fillna()`.
+
+
+
+Activity 11.01: Preparing the Speed Dating Dataset
+--------------------------------------------------
+
+As an entrepreneur, you are planning to launch a new dating app into the
+market. The key feature that will differentiate your app from other
+competitors will be your high performing user-matching algorithm. Before
+building this model, you have partnered with a speed dating company to
+collect data from real events. You just received the dataset from your
+partner company but realized it is not as clean as you expected; there
+are missing and incorrect values. Your task is to fix the main data
+quality issues in this dataset.
+
+The following steps will help you complete this activity:
+
+1. Download and load the dataset into Python using
+ `.read_csv()`.
+
+2. Print out the dimensions of the DataFrame using `.shape`.
+
+3. Check for duplicate rows by using `.duplicated()` and
+ `.sum()` on all the columns.
+
+4. Check for duplicate rows by using `.duplicated() `and
+ `.sum()` for the identifier columns (`iid`,
+ `id`, `partner`, and `pid`).
+
+5. Check for unexpected values for the following numerical variables:
+ `'imprace', 'imprelig', 'sports', 'tvsports', 'exercise', 'dining', 'museums', 'art', 'hiking', 'gaming', 'clubbing', 'reading', 'tv', 'theater', 'movies', 'concerts', 'music', 'shopping',`
+ and `'yoga'`.
+
+6. Replace the identified incorrect values.
+
+7. Check the data type of the different columns using
+ `.dtypes`.
+
+8. Change the data types to categorical for the columns that don\'t
+ contain numerical values using `.astype()`.
+
+9. Check for any missing values using `.isna()` and
+ `.sum()` for each numerical variable.
+
+10. Replace the missing values for each numerical variable with their
+ corresponding mean or median values using `.fillna()`,
+ `.mean()`, and `.median()`.
+
+
+
+You should get the following output. The figure represents the number of
+rows with unexpected values for `imprace` and a list of
+unexpected values:
+
+
+
+
+The following figure illustrates the number of rows with unexpected
+values and a list of unexpected values for each column:
+
+
+
+The following figure illustrates a list of unique values for gaming:
+
+
+
+Caption: List of unique values for gaming
+
+The following figure displays the data types of each column:
+
+
+
+Caption: Data types of each column
+
+The following figure displays the updated data types of each column:
+
+
+
+Caption: Data types of each column
+
+The following figure displays the number of missing values for numerical
+variables:
+
+
+
+Caption: Number of missing values for numerical variables
+
+The following figure displays the list of unique values for
+`int_corr`:
+
+
+
+Caption: List of unique values for \'int\_corr\'
+
+The following figure displays the list of unique values for numerical
+variables:
+
+
+
+Caption: List of unique values for numerical variables
+
+The following figure displays the number of missing values for numerical
+variables:
+
+
+
+Caption: Number of missing values for numerical variables
+
+
+Summary
+=======
+
+
+In this lab, you learned how important it is to prepare any given
+dataset and fix the main quality issues it has. This is critical because
+the cleaner a dataset is, the easier it will be for any machine learning
+model to easily learn about the relevant patterns. On top of this, most
+algorithms can\'t handle issues such as missing values, so they must be
+handled prior to the modeling phase. In this lab, you covered the
+most frequent issues that are faced in data science projects: duplicate
+rows, incorrect data types, unexpected values, and missing values.
diff --git a/lab_guides/Lab_12.md b/lab_guides/Lab_12.md
new file mode 100644
index 0000000..4510e4d
--- /dev/null
+++ b/lab_guides/Lab_12.md
@@ -0,0 +1,1749 @@
+
+12. Feature Engineering
+=======================
+
+
+
+Overview
+
+By the end of this lab, you will be able to merge multiple datasets
+together; bin categorical and numerical variables; perform aggregation
+on data; and manipulate dates using `pandas`.
+
+This lab will introduce you to some of the key techniques for
+creating new variables on an existing dataset.
+
+
+Merging Datasets
+----------------
+
+
+First, we need to import the Online Retail dataset into a
+`pandas` DataFrame:
+
+```
+import pandas as pd
+file_url = 'https://github.com/fenago/'\
+ 'data-science/blob/'\
+ 'master/Lab12/Dataset/'\
+ 'Online%20Retail.xlsx?raw=true'
+df = pd.read_excel(file_url)
+df.head()
+```
+You should get the following output.
+
+
+
+Caption: First five rows of the Online Retail dataset
+
+Next, we are going to load all the public holidays in the UK into
+another `pandas` DataFrame. From *Lab 10*, *Analyzing a
+Dataset* we know the records of this dataset are only for the years 2010
+and 2011. So we are going to extract public holidays for those two
+years, but we need to do so in two different steps as the API provided
+by `date.nager` is split into single years only.
+
+Let\'s focus on 2010 first:
+
+```
+uk_holidays_2010 = pd.read_csv\
+ ('https://date.nager.at/PublicHoliday/'\
+ 'Country/GB/2010/CSV')
+```
+We can print its shape to see how many rows and columns it has:
+
+```
+uk_holidays_2010.shape
+```
+You should get the following output.
+
+```
+(13, 8)
+```
+We can see there were `13` public holidays in that year and
+there are `8` different columns.
+
+Let\'s print the first five rows of this DataFrame:
+
+```
+uk_holidays_2010.head()
+```
+You should get the following output:
+
+
+
+Caption: First five rows of the UK 2010 public holidays DataFrame
+
+Now that we have the list of public holidays for 2010, let\'s extract
+the ones for 2011:
+
+```
+uk_holidays_2011 = pd.read_csv\
+ ('https://date.nager.at/PublicHoliday/'\
+ 'Country/GB/2011/CSV')
+uk_holidays_2011.shape
+```
+You should get the following output.
+
+```
+(15, 8)
+```
+
+There were `15` public holidays in 2011. Now we need to
+combine the records of these two DataFrames. We will use the
+`.append()` method from `pandas` and assign the
+results into a new DataFrame:
+
+```
+uk_holidays = uk_holidays_2010.append(uk_holidays_2011)
+```
+Let\'s check we have the right number of rows after appending the two
+DataFrames:
+
+```
+uk_holidays.shape
+```
+You should get the following output:
+
+```
+(28, 8)
+```
+We got `28` records, which corresponds with the total number
+of public holidays in 2010 and 2011.
+
+In order to merge two DataFrames together, we need to have at least one
+common column between them, meaning the two DataFrames should have at
+least one column that contains the same type of information. In our
+example, we are going to merge this DataFrame using the `Date`
+column with the Online Retail DataFrame on the `InvoiceDate`
+column. We can see that the data format of these two columns is
+different: one is a date (`yyyy-mm-dd`) and the other is a
+datetime (`yyyy-mm-dd hh:mm:ss`).
+
+So, we need to transform the `InvoiceDate` column into date
+format (`yyyy-mm-dd`). One way to do it (we will see another
+one later in this lab) is to transform this column into text and
+then extract the first 10 characters for each cell using the
+`.str.slice()` method.
+
+For example, the date 2010-12-01 08:26:00 will first be converted into a
+string and then we will keep only the first 10 characters, which will be
+2010-12-01. We are going to save these results into a new column called
+`InvoiceDay`:
+
+```
+df['InvoiceDay'] = df['InvoiceDate'].astype(str)\
+ .str.slice(stop=10)
+df.head()
+```
+
+The output is as follows:
+
+
+
+Caption: First five rows after creating InvoiceDay
+
+Now `InvoiceDay` from the online retail DataFrame and
+`Date` from the UK public holidays DataFrame have similar
+information, so we can merge these two DataFrames together using
+`.merge()` from `pandas`.
+
+There are multiple ways to join two tables together:
+
+- The left join
+- The right join
+- The inner join
+- The outer join
+
+
+
+### The Left Join
+
+The left join will keep all the rows from the first DataFrame, which is
+the *Online Retail* dataset (the left-hand side) and join it to the
+matching rows from the second DataFrame, which is the *UK Public
+Holidays* dataset (the right-hand side), as shown in *Figure 12.04*:
+
+
+
+Caption: Venn diagram for left join
+
+To perform a left join, we need to specify to the .merge() method the
+following parameters:
+
+- `how = 'left'` for a left join
+- `left_on = InvoiceDay` to specify the column used for
+ merging from the left-hand side (here, the `Invoiceday`
+ column from the Online Retail DataFrame)
+- `right_on = Date` to specify the column used for merging
+ from the right-hand side (here, the `Date` column from the
+ UK Public Holidays DataFrame)
+
+These parameters are clubbed together as shown in the following code
+snippet:
+
+```
+df_left = pd.merge(df, uk_holidays, left_on='InvoiceDay', \
+ right_on='Date', how='left')
+df_left.shape
+```
+You should get the following output:
+
+```
+(541909, 17)
+```
+We got the exact same number of rows as the original Online Retail
+DataFrame, which is expected for a left join. Let\'s have a look at the
+first five rows:
+
+```
+df_left.head()
+```
+You should get the following output:
+
+
+
+Caption: First five rows of the left-merged DataFrame
+
+We can see that the eight columns from the public holidays DataFrame
+have been merged to the original one. If no row has been matched from
+the second DataFrame (in this case, the public holidays one),
+`pandas` will fill all the cells with missing values
+(`NaT` or `NaN`), as shown in *Figure 12.05*.
+
+
+
+### The Right Join
+
+The right join is similar to the left join except it will keep all the
+rows from the second DataFrame (the right-hand side) and tries to match
+it with the first one (the left-hand side), as shown in *Figure 12.06*:
+
+
+
+Caption: Venn diagram for right join
+
+We just need to specify the parameters:
+
+- `how` `= 'right`\' to the `.merge()`
+ method to perform this type of join.
+- We will use the exact same columns used for merging as the previous
+ example, which is `InvoiceDay` for the Online Retail
+ DataFrame and `Date` for the UK Public Holidays one.
+
+These parameters are clubbed together as shown in the following code
+snippet:
+
+```
+df_right = df.merge(uk_holidays, left_on='InvoiceDay', \
+ right_on='Date', how='right')
+df_right.shape
+```
+You should get the following output:
+
+```
+(9602, 17)
+```
+We can see there are fewer rows as a result of the right join, but it
+doesn\'t get the same number as for the Public Holidays DataFrame. This
+is because there are multiple rows from the Online Retail DataFrame that
+match one single date in the public holidays one.
+
+For instance, looking at the first rows of the merged DataFrame, we can
+see there were multiple purchases on January 4, 2011, so all of them
+have been matched with the corresponding public holiday. Have a look at
+the following code snippet:
+
+```
+df_right.head()
+```
+You should get the following output:
+
+
+
+Caption: First five rows of the right-merged DataFrame
+
+There are two other types of merging: inner and outer.
+
+An inner join will only keep the rows that match between the two tables:
+
+
+
+Caption: Venn diagram for inner join
+
+You just need to specify the `how = 'inner'` parameter in the
+`.merge()` method.
+
+These parameters are clubbed together as shown in the following code
+snippet:
+
+```
+df_inner = df.merge(uk_holidays, left_on='InvoiceDay', \
+ right_on='Date', how='inner')
+df_inner.shape
+```
+You should get the following output:
+
+```
+(9579, 17)
+```
+We can see there are only 9,579 observations that happened during a
+public holiday in the UK.
+
+The outer join will keep all rows from both tables (matched and
+unmatched), as shown in *Figure 12.09*:
+
+
+
+Caption: Venn diagram for outer join
+
+As you may have guessed, you just need to specify the
+`how == 'outer'` parameter in the `.merge()` method:
+
+```
+df_outer = df.merge(uk_holidays, left_on='InvoiceDay', \
+ right_on='Date', how='outer')
+df_outer.shape
+```
+You should get the following output:
+
+```
+(541932, 17)
+```
+Before merging two tables, it is extremely important for you to know
+what your focus is. If your objective is to expand the number of
+features from an original dataset by adding the columns from another
+one, then you will probably use a left or right join. But be aware you
+may end up with more observations due to potentially multiple matches
+between the two tables. On the other hand, if you are interested in
+knowing which observations matched or didn\'t match between the two
+tables, you will either use an inner or outer join.
+
+
+
+Exercise 12.01: Merging the ATO Dataset with the Postcode Data
+--------------------------------------------------------------
+
+In this exercise, we will merge the ATO dataset (28 columns) with the
+Postcode dataset (150 columns) to get a richer dataset with an increased
+number of columns.
+
+
+The following steps will help you complete the exercise:
+
+1. Open up a new Colab notebook.
+
+2. Now, begin with the `import` of the `pandas`
+ package:
+ ```
+ import pandas as pd
+ ```
+
+
+3. Assign the link to the ATO dataset to a variable called
+ `file_url`:
+ ```
+ file_url = 'https://raw.githubusercontent.com/'\
+ 'fenago/data-science/'\
+ 'master/Lab12/Dataset/taxstats2015.csv'
+ ```
+
+
+4. Using the `.read_csv()` method from the `pandas`
+ package, load the dataset into a new DataFrame called
+ `df`:
+ ```
+ df = pd.read_csv(file_url)
+ ```
+
+
+5. Display the dimensions of this DataFrame using the
+ `.shape` attribute:
+
+ ```
+ df.shape
+ ```
+
+
+ You should get the following output:
+
+ ```
+ (2473, 28)
+ ```
+
+
+ The ATO dataset contains `2471` rows and `28`
+ columns.
+
+6. Display the first five rows of the ATO DataFrame using the
+ `.head()` method:
+
+ ```
+ df.head()
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+ Caption: First five rows of the ATO dataset
+
+ Both DataFrames have a column called `Postcode` containing
+ postcodes, so we will use it to merge them together.
+
+ Note
+
+ Postcode is the name used in Australia for zip code. It is an
+ identifier for postal areas.
+
+ We are interested in learning more about each of these postcodes.
+ Let\'s make sure they are all unique in this dataset.
+
+7. Display the number of unique values for the `Postcode`
+ variable using the `.nunique()` method:
+
+ ```
+ df['Postcode'].nunique()
+ ```
+
+
+ You should get the following output:
+
+ ```
+ 2473
+ ```
+
+
+ There are `2473` unique values in this column and the
+ DataFrame has `2473` rows, so we are sure the
+ `Postcode` variable contains only unique values.
+
+8. Now, assign the link to the second Postcode dataset to a variable
+ called `postcode_df`:
+ ```
+ postcode_url = 'https://github.com/fenago/'\
+ 'data-science/blob/'\
+ 'master/Lab12/Dataset/'\
+ 'taxstats2016individual06taxablestatusstate'\
+ 'territorypostcodetaxableincome%20(2).xlsx?'\
+ 'raw=true'
+ ```
+
+
+9. Load the second Postcode dataset into a new DataFrame called
+ `postcode_df` using the `.read_excel()` method.
+
+ We will only load the *Individuals Table 6B* sheet as this is where
+ the data is located so we need to provide this name to the
+ `sheet_name` parameter. Also, the header row (containing
+ the name of the variables) in this spreadsheet is located on the
+ third row so we need to specify it to the header parameter.
+
+ Note
+
+ Don\'t forget the `index` starts with 0 in Python.
+
+ Have a look at the following code snippet:
+
+ ```
+ postcode_df = pd.read_excel(postcode_url, \
+ sheet_name='Individuals Table 6B', \
+ header=2)
+ ```
+
+
+10. Print the dimensions of `postcode_df` using the
+ `.shape` attribute:
+
+ ```
+ postcode_df.shape
+ ```
+
+
+ You should get the following output:
+
+ ```
+ (2567, 150)
+ ```
+
+
+ This DataFrame contains `2567` rows for `150`
+ columns. By merging it with the ATO dataset, we will get additional
+ information for each postcode.
+
+11. Print the first five rows of `postcode_df` using the
+ `.head()` method:
+
+ ```
+ postcode_df.head()
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+ Caption: First five rows of the Postcode dataset
+
+ We can see that the second column contains the postcode value, and
+ this is the one we will use to merge on with the ATO dataset. Let\'s
+ check if they are unique.
+
+12. Print the number of unique values in this column using the
+ `.nunique()` method as shown in the following code
+ snippet:
+
+ ```
+ postcode_df['Postcode'].nunique()
+ ```
+
+
+ You should get the following output:
+
+ ```
+ 2567
+ ```
+
+
+ There are `2567` unique values, and this corresponds
+ exactly to the number of rows of this DataFrame, so we\'re
+ absolutely sure this column contains unique values. This also means
+ that after merging the two tables, there will be only one-to-one
+ matches. We won\'t have a case where we get multiple rows from one
+ of the datasets matching with only one row of the other one. For
+ instance, postcode `2029` from the ATO DataFrame will have
+ exactly one match in the second Postcode DataFrame.
+
+13. Perform a left join on the two DataFrames using the
+ `.merge()` method and save the results into a new
+ DataFrame called `merged_df`. Specify the
+ `how='left'` and `on='Postcode'` parameters:
+ ```
+ merged_df = pd.merge(df, postcode_df, \
+ how='left', on='Postcode')
+ ```
+
+
+14. Print the dimensions of the new merged DataFrame using the
+ `.shape` attribute:
+
+ ```
+ merged_df.shape
+ ```
+
+
+ You should get the following output:
+
+ ```
+ (2473, 177)
+ ```
+
+
+ We got exactly `2473` rows after merging, which is what we
+ expect as we used a left join and there was a one-to-one match on
+ the `Postcode` column from both original DataFrames. Also,
+ we now have `177` columns, which is the objective of this
+ exercise. But before concluding it, we want to see whether there are
+ any postcodes that didn\'t match between the two datasets. To do so,
+ we will be looking at one column from the right-hand side DataFrame
+ (the Postcode dataset) and see if there are any missing values.
+
+15. Print the total number of missing values from the
+ `'State/Territory1'` column by combining the
+ `.isna()` and `.sum()` methods:
+
+ ```
+ merged_df['State/ Territory1'].isna().sum()
+ ```
+
+
+ You should get the following output:
+
+ ```
+ 4
+ ```
+
+
+ There are four postcodes from the ATO dataset that didn\'t match the
+ Postcode code.
+
+ Let\'s see which ones they are.
+
+16. Print the missing postcodes using the `.iloc()` method, as
+ shown in the following code snippet:
+
+ ```
+ merged_df.loc[merged_df['State/ Territory1'].isna(), \
+ 'Postcode']
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+Caption: List of unmatched postcodes
+
+The missing postcodes from the Postcode dataset are `3010`,
+`4462`, `6068`, and `6758`. In a real
+project, you would have to get in touch with your stakeholders or the
+data team to see if you are able to get this data.
+
+We have successfully merged the two datasets of interest and have
+expanded the number of features from `28` to `177`.
+We now have a much richer dataset and will be able to perform a more
+detailed analysis of it.
+
+
+In the next topic, you will be introduced to the binning variables.
+
+
+
+Binning Variables
+-----------------
+
+As mentioned earlier, feature engineering is not only about getting
+information not present in a dataset. Quite often, you will have to
+create new features from existing ones. One example of this is
+consolidating values from an existing column to a new list of values.
+
+For instance, you may have a very high number of unique values for some
+of the categorical columns in your dataset, let\'s say over 1,000 values
+for each variable. This is actually quite a lot of information that will
+require extra computation power for an algorithm to process and learn
+the patterns from. This can have a significant impact on the project
+cost if you are using cloud computing services or on the delivery time
+of the project.
+
+One possible solution is to not use these columns and drop them, but in
+that case, you may lose some very important and critical information for
+the business. Another solution is to create a more consolidated version
+of these columns by reducing the number of unique values to a smaller
+number, let\'s say 100. This would drastically speed up the training
+process for the algorithm without losing too much information. This kind
+of transformation is called binning and, traditionally, it refers to
+numerical variables, but the same logic can be applied to categorical
+variables as well.
+
+Let\'s see how we can achieve this on the Online Retail dataset. First,
+we need to load the data:
+
+```
+import pandas as pd
+file_url = 'https://github.com/fenago/'\
+ 'data-science/blob/'\
+ 'master/Lab12/Dataset/'\
+ 'Online%20Retail.xlsx?raw=true'
+df = pd.read_excel(file_url)
+```
+
+In *Lab 10*, *Analyzing a Dataset* we learned that the
+`Country` column contains `38` different unique
+values:
+
+```
+df['Country'].unique()
+```
+You should get the following output:
+
+
+
+Caption: List of unique values for the Country column
+
+We are going to group some of the countries together into regions such
+as Asia, the Middle East, and America. We will leave the European
+countries as is.
+
+First, let\'s create a new column called `Country_bin` by
+copying the `Country` column:
+
+```
+df['Country_bin'] = df['Country']
+```
+
+Then, we are going to create a list called `asian_countries`
+containing the name of Asian countries from the list of unique values
+for the `Country` column:
+
+```
+asian_countries = ['Japan', 'Hong Kong', 'Singapore']
+```
+And finally, using the `.loc()` and `.isin()`
+methods from `pandas`, we are going to change the value of
+`Country_bin` to `Asia` for all of the countries
+that are present in the `asian_countries` list:
+
+```
+df.loc[df['Country'].isin(asian_countries), \
+ 'Country_bin'] = 'Asia'
+```
+Now, if we print the list of unique values for this new column, we will
+see the three Asian countries (`Japan`, `Hong Kong`,
+and `Singapore`) have been replaced by the value
+`Asia`:
+
+```
+df['Country_bin'].unique()
+```
+You should get the following output:
+
+
+
+Caption: List of unique values for the Country\_bin column after
+binning Asian countries
+
+Let\'s perform the same process for Middle Eastern countries:
+
+```
+m_east_countries = ['Israel', 'Bahrain', 'Lebanon', \
+ 'United Arab Emirates', 'Saudi Arabia']
+df.loc[df['Country'].isin(m_east_countries), \
+ 'Country_bin'] = 'Middle East'
+df['Country_bin'].unique()
+```
+You should get the following output:
+
+
+
+
+
+Finally, let\'s group all countries from North and South America
+together:
+
+```
+american_countries = ['Canada', 'Brazil', 'USA']
+df.loc[df['Country'].isin(american_countries), \
+ 'Country_bin'] = 'America'
+df['Country_bin'].unique()
+```
+You should get the following output:
+
+
+
+Caption: List of unique values for the Country\_bin column after
+binning countries from North and South America
+
+```
+df['Country_bin'].nunique()
+```
+You should get the following output:
+
+```
+30
+```
+`30` is the number of unique values for the
+`Country_bin` column. So we reduced the number of unique
+values in this column from `38` to `30`:
+
+We just saw how to group categorical values together, but the same
+process can be applied to numerical values as well. For instance, it is
+quite common to group people\'s ages into bins such as 20s (20 to 29
+years old), 30s (30 to 39), and so on.
+
+Have a look at *Exercise 12.02*, *Binning the YearBuilt variable from
+the AMES Housing dataset*.
+
+
+
+Exercise 12.02: Binning the YearBuilt Variable from the AMES Housing Dataset
+----------------------------------------------------------------------------
+
+In this exercise, we will create a new feature by binning an existing
+numerical column in order to reduce the number of unique values from
+`112` to `15`.
+
+Note
+
+The dataset we will be using in this exercise is the Ames Housing
+dataset.
+This dataset contains the list of residential home sales in the city of
+Ames, Iowa between 2010 and 2016.
+
+
+1. Open up a new Colab notebook.
+
+2. Import the `pandas` and `altair` packages:
+ ```
+ import pandas as pd
+ import altair as alt
+ ```
+
+
+3. Assign the link to the dataset to a variable called
+ `file_url`:
+ ```
+ file_url = 'https://raw.githubusercontent.com/'\
+ 'fenago/data-science/'\
+ 'master/Lab12/Dataset/ames_iowa_housing.csv'
+ ```
+
+
+4. Using the `.read_csv()` method from the `pandas`
+ package, load the dataset into a new DataFrame called
+ `df`:
+ ```
+ df = pd.read_csv(file_url)
+ ```
+
+
+5. Display the first five rows using the` .head()` method:
+
+ ```
+ df.head()
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+ Caption: First five rows of the AMES housing DataFrame
+
+6. Display the number of unique values on the column using
+ `.nunique()`:
+
+ ```
+ df['YearBuilt'].nunique()
+ ```
+
+
+ You should get the following output:
+
+ ```
+ 112
+ ```
+
+
+ There are `112` different or unique values in the
+ `YearBuilt` column:
+
+7. Print a scatter plot using `altair` to visualize the
+ number of records built per year. Specify `YearBuilt:O` as
+ the x-axis and `count()` as the y-axis in the
+ `.encode()` method:
+
+ ```
+ alt.Chart(df).mark_circle().encode(alt.X('YearBuilt:O'),\
+ y='count()')
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+ Caption: First five rows of the AMES housing DataFrame
+
+ Note
+
+ The output is not shown on GitHub due to its limitations. If you run
+ this on your Colab file, the graph will be displayed.
+
+ There weren\'t many properties sold in some of the years. So, you
+ can group them by decades (groups of 10 years).
+
+8. Create a list called `year_built` containing all the
+ unique values in the `YearBuilt `column:
+ ```
+ year_built = df['YearBuilt'].unique()
+ ```
+
+
+9. Create another list that will compute the decade for each year in
+ `year_built`. Use list comprehension to loop through each
+ year and apply the following formula:
+ `year - (year % 10)`.
+
+ For example, this formula applied to the year 2015 will give 2015 -
+ (2015 % 10), which is 2015 -- 5 equals 2010.
+
+ Note
+
+ \% corresponds to the modulo operator and will return the last digit
+ of each year.
+
+ Have a look at the following code snippet:
+
+ ```
+ decade_list = [year - (year % 10) for year in year_built]
+ ```
+
+
+10. Create a sorted list of unique values from `decade_list`
+ and save the result into a new variable called
+ `decade_built`. To do so, transform
+ `decade_list` into a set (this will exclude all
+ duplicates) and then use the `sorted()` function as shown
+ in the following code snippet:
+ ```
+ decade_built = sorted(set(decade_list))
+ ```
+
+
+11. Print the values of `decade_built`:
+
+ ```
+ decade_built
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+ Caption: List of decades
+
+ Now we have the list of decades we are going to bin the
+ `YearBuilt` column with.
+
+12. Create a new column on the `df` DataFrame called
+ `DecadeBuilt` that will bin each value from
+ `YearBuilt` into a decade. You will use the
+ `.cut()` method from `pandas` and specify the
+ `bins=decade_built` parameter:
+ ```
+ df['DecadeBuilt'] = pd.cut(df['YearBuilt'], \
+ bins=decade_built)
+ ```
+
+
+13. Print the first five rows of the DataFrame but only for the
+ `'YearBuilt'` and `'DecadeBuilt'` columns:
+
+ ```
+ df[['YearBuilt', 'DecadeBuilt']].head()
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+
+
+Manipulating Dates
+------------------
+
+
+In *Lab 10*, *Analyzing a Dataset* you were introduced to the
+concept of data types in `pandas`. At that time, we mainly
+focused on numerical variables and categorical ones but there is another
+important one: `datetime`. Let\'s have a look again at the
+type of each column from the Online Retail dataset:
+
+```
+import pandas as pd
+file_url = 'https://github.com/fenago/'\
+ 'data-science/blob/'\
+ 'master/Lab12/Dataset/'\
+ 'Online%20Retail.xlsx?raw=true'
+df = pd.read_excel(file_url)
+df.dtypes
+```
+You should get the following output:
+
+
+
+Caption: Data types for the variables in the Online Retail dataset
+
+We can see that `pandas` automatically detected that
+`InvoiceDate` is of type `datetime`. But for some
+other datasets, it may not recognize dates properly. In this case, you
+will have to manually convert them using the `.to_datetime()`
+method:
+
+```
+df['InvoiceDate'] = pd.to_datetime(df['InvoiceDate'])
+```
+Once the column is converted to `datetime`, pandas provides a
+lot of attributes and methods for extracting time-related information.
+For instance, if you want to get the year of a date, you use the
+`.dt.year` attribute:
+
+```
+df['InvoiceDate'].dt.year
+```
+You should get the following output:
+
+
+
+Caption: Extracted year for each row for the InvoiceDate column
+
+As you may have guessed, there are attributes for extracting the month
+and day of a date: `.dt.month` and `.dt.day`
+respectively. You can get the day of the week from a date using the
+`.dt.dayofweek` attribute:
+
+```
+df['InvoiceDate'].dt.dayofweek
+```
+You should get the following output.
+
+
+
+Caption: Extracted day of the week for each row for the InvoiceDate column
+
+
+With datetime columns, you can also perform some mathematical
+operations. We can, for instance, add `3` days to each date by
+using pandas time-series offset object,
+`pd.tseries.offsets.Day(3)`:
+
+```
+df['InvoiceDate'] + pd.tseries.offsets.Day(3)
+```
+You should get the following output:
+
+
+
+Caption: InvoiceDate column offset by three days
+
+You can also offset days by business days using
+`pd.tseries.offsets.BusinessDay()`. For instance, if we want
+to get the previous business days, we do:
+
+```
+df['InvoiceDate'] + pd.tseries.offsets.BusinessDay(-1)
+```
+You should get the following output:
+
+
+
+Caption: InvoiceDate column offset by -1 business day
+
+Another interesting date manipulation operation is to apply a specific
+time-frequency using `pd.Timedelta()`. For instance, if you
+want to get the first day of the month from a date, you do:
+
+```
+df['InvoiceDate'] + pd.Timedelta(1, unit='MS')
+```
+You should get the following output:
+
+
+
+Caption: InvoiceDate column transformed to the start of the month
+
+As you have seen in this section, the `pandas` package
+provides a lot of different APIs for manipulating dates. You have
+learned how to use a few of the most popular ones. You can now explore
+the other ones on your own.
+
+
+
+Exercise 12.03: Date Manipulation on Financial Services Consumer Complaints
+---------------------------------------------------------------------------
+
+In this exercise, we will learn how to extract time-related information
+from two existing date columns using `pandas` in order to
+create six new columns:
+
+Note
+
+The dataset we will be using in this exercise is the Financial Services
+Customer Complaints dataset
+
+
+1. Open up a new Colab notebook.
+
+2. Import the `pandas` package:
+ ```
+ import pandas as pd
+ ```
+
+
+3. Assign the link to the dataset to a variable called
+ `file_url`:
+ ```
+ file_url = 'https://raw.githubusercontent.com/'\
+ 'fenago/data-science/'\
+ 'master/Lab12/Dataset/Consumer_Complaints.csv'
+ ```
+
+
+4. Use the `.read_csv()` method from the `pandas`
+ package and load the dataset into a new DataFrame called
+ `df`:
+ ```
+ df = pd.read_csv(file_url)
+ ```
+
+
+5. Display the first five rows using the `.head()` method:
+
+ ```
+ df.head()
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+ Caption: First five rows of the Customer Complaint DataFrame
+
+6. Print out the data types for each column using
+ the` .dtypes` attribute:
+
+ ```
+ df.dtypes
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+ Caption: Data types for the Customer Complaint DataFrame
+
+ The `Date received` and `Date sent to company`
+ columns haven\'t been recognized as datetime, so we need to manually
+ convert them.
+
+7. Convert the `Date received` and
+ `Date sent to company` columns to datetime using the
+ `pd.to_datetime()` method:
+ ```
+ df['Date received'] = pd.to_datetime(df['Date received'])
+ df['Date sent to company'] = pd.to_datetime\
+ (df['Date sent to company'])
+ ```
+
+
+8. Print out the data types for each column using the
+ `.dtypes` attribute:
+
+ ```
+ df.dtypes
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+ Caption: Data types for the Customer Complaint DataFrame after
+ conversion
+
+ Now these two columns have the right data types. Now let\'s create
+ some new features from these two dates.
+
+9. Create a new column called `YearReceived`, which will
+ contain the year of each date from the `Date Received`
+ column using the `.dt.year` attribute:
+ ```
+ df['YearReceived'] = df['Date received'].dt.year
+ ```
+
+
+10. Create a new column called `MonthReceived`, which will
+ contain the month of each date using the `.dt.month`
+ attribute:
+ ```
+ df['MonthReceived'] = df['Date received'].dt.month
+ ```
+
+
+11. Create a new column called `DayReceived`, which will
+ contain the day of the month for each date using the
+ `.dt.day` attribute:
+ ```
+ df['DomReceived'] = df['Date received'].dt.day
+ ```
+
+
+12. Create a new column called `DowReceived`, which will
+ contain the day of the week for each date using the
+ `.dt.dayofweek` attribute:
+ ```
+ df['DowReceived'] = df['Date received'].dt.dayofweek
+ ```
+
+
+13. Display the first five rows using the `.head()` method:
+
+ ```
+ df.head()
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+ Caption: First five rows of the Customer Complaint DataFrame
+ after creating four new features
+
+ We can see we have successfully created four new features:
+ `YearReceived`, `MonthReceived`,
+ `DayReceived`, and `DowReceived`. Now let\'s
+ create another that will indicate whether the date was during a
+ weekend or not.
+
+14. Create a new column called `IsWeekendReceived`, which will
+ contain binary values indicating whether the `DowReceived`
+ column is over or equal to `5` (`0` corresponds
+ to Monday, `5` and `6` correspond to Saturday
+ and Sunday respectively):
+ ```
+ df['IsWeekendReceived'] = df['DowReceived'] >= 5
+ ```
+
+
+15. Display the first `5` rows using the `.head()`
+ method:
+
+ ```
+ df.head()
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+ Caption: First five rows of the Customer Complaint DataFrame
+ after creating the weekend feature
+
+ We have created a new feature stating whether each complaint was
+ received during a weekend or not. Now we will feature engineer a new
+ column with the numbers of days between
+ `Date sent to company` and `Date received`.
+
+16. Create a new column called `RoutingDays`, which will
+ contain the difference between `Date sent to company` and
+ `Date received`:
+ ```
+ df['RoutingDays'] = df['Date sent to company'] \
+ - df['Date received']
+ ```
+
+
+17. Print out the data type of the new `'RoutingDays'` column
+ using the `.dtype` attribute:
+
+ ```
+ df['RoutingDays'].dtype
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+ Caption: Data type of the RoutingDays column
+
+ The result of subtracting two datetime columns is a new datetime
+ column (`dtype(' 72) \
+ & (bankData['balance'] < 448), \
+ 'balanceClass'] = 'Quant2'
+ bankData.loc[(bankData['balance'] > 448) \
+ & (bankData['balance'] < 1428), \
+ 'balanceClass'] = 'Quant3'
+ bankData.loc[bankData['balance'] > 1428, \
+ 'balanceClass'] = 'Quant4'
+ bankData.head()
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+ Caption: New features from bank balance data
+
+ We did this is by looking at the quantile thresholds we took in the
+ *Step 4*, and categorizing the numerical data into the corresponding
+ quantile class. For example, all values lower than the
+ 25[th] quantile value, 72, were classified as
+ `Quant1`, values between 72 and 448 were classified as
+ `Quant2`, and so on. To store the quantile categories, we
+ created a new feature in the bank dataset called
+ `balanceClass` and set its default value to
+ `Quan1`. After this, based on each value threshold, the
+ data points were classified to the respective quantile class.
+
+9. Next, we need to find the propensity of term deposit purchases based
+ on each quantile the customers fall into. This task is similar to
+ what we did in *Exercise 3.02*, *Business Hypothesis Testing for Age
+ versus Propensity for a Term Loan*:
+
+ ```
+ # Calculating the customers under each quantile
+ balanceTot = bankData.groupby(['balanceClass'])['y']\
+ .agg(balanceTot='count').reset_index()
+ balanceTot
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+ Caption: Classification based on quantiles
+
+10. Calculate the total number of customers categorized by quantile and
+ propensity classification, as mentioned in the following code
+ snippet:
+
+ ```
+ """
+ Calculating the total customers categorised as per quantile
+ and propensity classification
+ """
+ balanceProp = bankData.groupby(['balanceClass', 'y'])['y']\
+ .agg(balanceCat='count').reset_index()
+ balanceProp
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+ Caption: Total number of customers categorized by quantile and
+ propensity classification
+
+11. Now, `merge` both DataFrames:
+
+ ```
+ # Merging both the data frames
+ balanceComb = pd.merge(balanceProp, balanceTot, \
+ on = ['balanceClass'])
+ balanceComb['catProp'] = (balanceComb.balanceCat \
+ / balanceComb.balanceTot)*100
+ balanceComb
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+Caption: Propensity versus balance category
+
+
+
+In the next exercise, we will use these intuitions to derive a new
+feature.
+
+
+
+Exercise 3.04: Feature Engineering -- Creating New Features from Existing Ones
+------------------------------------------------------------------------------
+
+In this exercise, we will combine the individual variables we analyzed
+in *Exercise 3.03*, *Feature Engineering -- Exploration of Individual
+Features* to derive a new feature called an asset index. One methodology
+to create an asset index is by assigning weights based on the asset or
+liability of the customer.
+
+For instance, a higher bank balance or home ownership will have a
+positive bearing on the overall asset index and, therefore, will be
+assigned a higher weight. In contrast, the presence of a loan will be a
+liability and, therefore, will have to have a lower weight. Let\'s give
+a weight of 5 if the customer has a house and 1 in its absence.
+Similarly, we can give a weight of 1 if the customer has a loan and 5 in
+case of no loans:
+
+1. Open a new Colab notebook.
+
+2. Import the pandas and numpy package:
+ ```
+ import pandas as pd
+ import numpy as np
+ ```
+
+
+3. Assign the link to the dataset to a variable called \'file\_url\'.
+ ```
+ file_url = 'https://raw.githubusercontent.com'\
+ '/fenago/data-science'\
+ '/master/Lab03/bank-full.csv'
+ ```
+
+
+4. Read the banking dataset using the `.read_csv()` function:
+ ```
+ # Reading the banking data
+ bankData = pd.read_csv(file_url,sep=";")
+ ```
+
+
+5. The first step we will follow is to normalize the numerical
+ variables. This is implemented using the following code snippet:
+ ```
+ # Normalizing data
+ from sklearn import preprocessing
+ x = bankData[['balance']].values.astype(float)
+ ```
+
+
+6. As the bank balance dataset contains numerical values, we need to
+ first normalize the data. The purpose of normalization is to bring
+ all of the variables that we are using to create the new feature
+ into a common scale. One effective method we can use here for the
+ normalizing function is called `MinMaxScaler()`, which
+ converts all of the numerical data between a scaled range of 0 to 1.
+ The `MinMaxScaler` function is available within the
+ `preprocessing` method in `sklearn`:
+ ```
+ minmaxScaler = preprocessing.MinMaxScaler()
+ ```
+
+
+7. Transform the balance data by normalizing it with
+ `minmaxScaler`:
+
+ ```
+ bankData['balanceTran'] = minmaxScaler.fit_transform(x)
+ ```
+
+
+ In this step, we created a new feature called
+ `'balanceTran'` to store the normalized bank balance
+ values.
+
+8. Print the head of the data using the `.head()` function:
+
+ ```
+ bankData.head()
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+ Caption: Normalizing the bank balance data
+
+9. After creating the normalized variable, add a small value of
+ `0.001` so as to eliminate the 0 values in the variable.
+ This is mentioned in the following code snippet:
+
+ ```
+ # Adding a small numerical constant to eliminate 0 values
+ bankData['balanceTran'] = bankData['balanceTran'] + 0.00001
+ ```
+
+
+ The purpose of adding this small value is because, in the subsequent
+ steps, we will be multiplying three transformed variables together
+ to form a composite index. The small value is added to avoid the
+ variable values becoming 0 during the multiplying operation.
+
+10. Now, add two additional columns for introducing the transformed
+ variables for loans and housing, as per the weighting approach
+ discussed at the start of this exercise:
+
+ ```
+ # Let us transform values for loan data
+ bankData['loanTran'] = 1
+ # Giving a weight of 5 if there is no loan
+ bankData.loc[bankData['loan'] == 'no', 'loanTran'] = 5
+ bankData.head()
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+ Caption: Additional columns with the transformed variables
+
+ We transformed values for the loan data as per the weighting
+ approach. When a customer has a loan, it is given a weight of
+ `1`, and when there\'s no loan, the weight assigned is
+ `5`. The value of `1` and `5` are
+ intuitive weights we are assigning. What values we assign can vary
+ based on the business context you may be provided with.
+
+11. Now, transform values for the `Housing data`, as mentioned
+ here:
+ ```
+ # Let us transform values for Housing data
+ bankData['houseTran'] = 5
+ ```
+
+
+12. Give a weight of `1` if the customer has a house and print
+ the results, as mentioned in the following code snippet:
+
+ ```
+ bankData.loc[bankData['housing'] == 'no', 'houseTran'] = 1
+ print(bankData.head())
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+ Caption: Transforming loan and housing data
+
+ Once all the transformed variables are created, we can multiply all
+ of the transformed variables together to create a new index called
+ `assetIndex`. This is a composite index that represents
+ the combined effect of all three variables.
+
+13. Now, create a new variable, which is the product of all of the
+ transformed variables:
+
+ ```
+ """
+ Let us now create the new variable which is a product of all
+ these
+ """
+ bankData['assetIndex'] = bankData['balanceTran'] \
+ * bankData['loanTran'] \
+ * bankData['houseTran']
+ bankData.head()
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+ Caption: Creating a composite index
+
+14. Explore the propensity with respect to the composite index.
+
+ We observe the relationship between the asset index and the
+ propensity of term deposit purchases. We adopt a similar strategy of
+ converting the numerical values of the asset index into ordinal
+ values by taking the quantiles and then mapping the quantiles to the
+ propensity of term deposit purchases, as mentioned in *Exercise
+ 3.03*, *Feature Engineering -- Exploration of Individual Features*:
+
+ ```
+ # Finding the quantile
+ np.quantile(bankData['assetIndex'],[0.25,0.5,0.75])
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+ Caption: Conversion of numerical values into ordinal values
+
+15. Next, create quantiles from the `assetindex` data, as
+ mentioned in the following code snippet:
+
+ ```
+ bankData['assetClass'] = 'Quant1'
+ bankData.loc[(bankData['assetIndex'] > 0.38) \
+ & (bankData['assetIndex'] < 0.57), \
+ 'assetClass'] = 'Quant2'
+ bankData.loc[(bankData['assetIndex'] > 0.57) \
+ & (bankData['assetIndex'] < 1.9), \
+ 'assetClass'] = 'Quant3'
+ bankData.loc[bankData['assetIndex'] > 1.9, \
+ 'assetClass'] = 'Quant4'
+ bankData.head()
+ bankData.assetClass[bankData['assetIndex'] > 1.9] = 'Quant4'
+ bankData.head()
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+ Caption: Quantiles for the asset index
+
+16. Calculate the total of each asset class and the category-wise
+ counts, as mentioned in the following code snippet:
+ ```
+ # Calculating total of each asset class
+ assetTot = bankData.groupby('assetClass')['y']\
+ .agg(assetTot='count').reset_index()
+ # Calculating the category wise counts
+ assetProp = bankData.groupby(['assetClass', 'y'])['y']\
+ .agg(assetCat='count').reset_index()
+ ```
+
+
+17. Next, merge both DataFrames:
+
+ ```
+ # Merging both the data frames
+ assetComb = pd.merge(assetProp, assetTot, on = ['assetClass'])
+ assetComb['catProp'] = (assetComb.assetCat \
+ / assetComb.assetTot)*100
+ assetComb
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+Caption: Composite index relationship mapping
+
+
+
+A Quick Peek at Data Types and a Descriptive Summary
+----------------------------------------------------
+
+Looking at the data types such as categorical or numeric and then
+deriving summary statistics is a good way to take a quick peek into data
+before you do some of the downstream feature engineering steps. Let\'s
+take a look at an example from our dataset:
+
+```
+# Looking at Data types
+print(bankData.dtypes)
+# Looking at descriptive statistics
+print(bankData.describe())
+```
+You should get the following output:
+
+
+
+Caption: Output showing the different data types in the dataset
+
+In the preceding output, you see the different types of information in
+the dataset and its corresponding data types. For instance,
+`age` is an integer and so is `day`.
+
+The following output is that of a descriptive summary statistic, which
+displays some of the basic measures such as `mean`,
+`standard deviation`, `count`, and the
+`quantile values` of the respective features:
+
+
+
+Caption: Data types and a descriptive summary
+
+The purpose of a descriptive summary is to get a quick feel of the data
+with respect to the distribution and some basic statistics such as mean
+and standard deviation. Getting a perspective on the summary statistics
+is critical for thinking about what kind of transformations are required
+for each variable.
+
+For instance, in the earlier exercises, we converted the numerical data
+into categorical variables based on the quantile values. Intuitions for
+transforming variables would come from the quick summary statistics that
+we can derive from the dataset.
+
+In the following sections, we will be looking at the correlation matrix
+and visualization.
+
+
+Correlation Matrix and Visualization
+====================================
+
+
+Correlation, as you know, is a measure that indicates how two variables
+fluctuate together. Any correlation value of 1, or near 1, indicates
+that those variables are highly correlated. Highly correlated variables
+can sometimes be damaging for the veracity of models and, in many
+circumstances, we make the decision to eliminate such variables or to
+combine them to form composite or interactive variables.
+
+Let\'s look at how data correlation can be generated and then visualized
+in the following exercise.
+
+
+
+Exercise 3.05: Finding the Correlation in Data to Generate a Correlation Plot Using Bank Data
+---------------------------------------------------------------------------------------------
+
+In this exercise, we will be creating a correlation plot and analyzing
+the results of the bank dataset.
+
+The following steps will help you to complete the exercise:
+
+1. Open a new Colab notebook, install the `pandas` packages
+ and load the banking data:
+ ```
+ import pandas as pd
+ file_url = 'https://raw.githubusercontent.com'\
+ '/fenago/data-science'\
+ '/master/Lab03/bank-full.csv'
+ bankData = pd.read_csv(file_url, sep=";")
+ ```
+
+
+2. Now, `import` the `set_option` library from
+ `pandas`, as mentioned here:
+
+ ```
+ from pandas import set_option
+ ```
+
+
+ The `set_option` function is used to define the display
+ options for many operations.
+
+3. Next, create a variable that would store numerical variables such as
+ `'age','balance','day','duration','campaign','pdays','previous', `as
+ mentioned in the following code snippet. A correlation plot can be
+ extracted only with numerical data. This is why the numerical data
+ has to be extracted separately:
+ ```
+ bankNumeric = bankData[['age','balance','day','duration',\
+ 'campaign','pdays','previous']]
+ ```
+
+
+4. Now, use the `.corr()` function to find the correlation
+ matrix for the dataset:
+
+ ```
+ set_option('display.width',150)
+ set_option('precision',3)
+ bankCorr = bankNumeric.corr(method = 'pearson')
+ bankCorr
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+ Caption: Correlation matrix
+
+ The method we use for correlation is the **Pearson** correlation
+ coefficient. We can see from the correlation matrix that the
+ diagonal elements have a correlation of 1. This is because the
+ diagonals are a correlation of a variable with itself, which will
+ always be 1. This is the Pearson correlation coefficient.
+
+5. Now, plot the data:
+
+ ```
+ from matplotlib import pyplot
+ corFig = pyplot.figure()
+ figAxis = corFig.add_subplot(111)
+ corAx = figAxis.matshow(bankCorr,vmin=-1,vmax=1)
+ corFig.colorbar(corAx)
+ pyplot.show()
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+Caption: Correlation plot
+
+
+Skewness of Data
+----------------
+
+Another area for feature engineering is skewness. Skewed data means data
+that is shifted in one direction or the other. Skewness can cause
+machine learning models to underperform. Many machine learning models
+assume normally distributed data or data structures to follow the
+Gaussian structure. Any deviation from the assumed Gaussian structure,
+which is the popular bell curve, can affect model performance. A very
+effective area where we can apply feature engineering is by looking at
+the skewness of data and then correcting the skewness through
+normalization of the data. Skewness can be visualized by plotting the
+data using histograms and density plots. We will investigate each of
+these techniques.
+
+Let\'s take a look at the following example. Here, we use the
+`.skew()` function to find the skewness in data. For instance,
+to find the skewness of data in our `bank-full.csv` dataset,
+we perform the following:
+
+```
+# Skewness of numeric attributes
+bankNumeric.skew()
+```
+Note
+
+This code refers to the `bankNumeric` data, so you should
+ensure you are working in the same notebook as the previous exercise.
+
+You should get the following output:
+
+
+
+Caption: Degree of skewness
+
+The preceding matrix is the skewness index. Any value closer to 0
+indicates a low degree of skewness. Positive values indicate right skew
+and negative values, left skew. Variables that show higher values of
+right skew and left skew are candidates for further feature engineering
+by normalization. Let\'s now visualize the skewness by plotting
+histograms and density plots.
+
+
+
+Histograms
+----------
+
+Histograms are an effective way to plot the distribution of data and to
+identify skewness in data, if any. The histogram outputs of two columns
+of `bankData` are listed here. The histogram is plotted with
+the `pyplot` package from `matplotlib` using the
+`.hist()` function. The number of subplots we want to include
+is controlled by the `.subplots()` function. `(1,2)`
+in subplots would mean one row and two columns. The titles are set by
+the `set_title()` function:
+
+```
+# Histograms
+from matplotlib import pyplot as plt
+fig, axs = plt.subplots(1,2)
+axs[0].hist(bankNumeric['age'])
+axs[0].set_title('Distribution of age')
+axs[1].hist(bankNumeric['balance'])
+axs[1].set_title('Distribution of Balance')
+# Ensure plots do not overlap
+plt.tight_layout()
+```
+You should get the following output:
+
+
+
+Caption: Code showing the generation of histograms
+
+From the histogram, we can see that the `age` variable has a
+distribution closer to the bell curve with a lower degree of skewness.
+In contrast, the asset index shows a relatively higher right skew, which
+makes it a more probable candidate for normalization.
+
+
+
+Density Plots
+-------------
+
+Density plots help in visualizing the distribution of data. A density
+plot can be created using the `kind = 'density'` parameter:
+
+```
+from matplotlib import pyplot as plt
+# Density plots
+bankNumeric['age'].plot(kind = 'density', subplots = False, \
+ layout = (1,1))
+plt.title('Age Distribution')
+plt.xlabel('Age')
+plt.ylabel('Normalised age distribution')
+pyplot.show()
+```
+You should get the following output:
+
+
+
+Caption: Code showing the generation of a density plot
+
+Density plots help in getting a smoother visualization of the
+distribution of the data. From the density plot of Age, we can see that
+it has a distribution similar to a bell curve.
+
+
+
+Other Feature Engineering Methods
+---------------------------------
+
+So far, we were looking at various descriptive statistics and
+visualizations that are precursors for applying many feature engineering
+techniques on data structures. We investigated one such feature
+engineering technique in *Exercise 3.02*, *Business Hypothesis Testing
+for Age versus Propensity for a Term Loan* where we applied the **min
+max** scaler for normalizing data.
+
+We will now look into two other similar data transformation techniques,
+namely, standard scaler and normalizer. Standard scaler standardizes
+data to a mean of 0 and standard deviation of 1. The mean is the average
+of the data and the standard deviation is a measure of the spread of
+data. By standardizing to the same mean and standard deviation,
+comparison across different distributions of data is enabled.
+
+The normalizer function normalizes the length of data. This means that
+each value in a row is divided by the normalization of the row vector to
+normalize the row. The normalizer function is applied on the rows while
+standard scaler is applied columnwise. The normalizer and standard
+scaler functions are important feature engineering steps that are
+applied to the data before downstream modeling steps. Let\'s look at
+both of these techniques:
+
+```
+# Standardize data (0 mean, 1 stdev)
+from sklearn.preprocessing import StandardScaler
+from numpy import set_printoptions
+scaling = StandardScaler().fit(bankNumeric)
+rescaledNum = scaling.transform(bankNumeric)
+set_printoptions(precision = 3)
+print(rescaledNum)
+```
+You should get the following output:
+
+
+
+Caption: Output from standardizing the data
+
+The following code uses the normalizer data transmission techniques:
+
+```
+# Normalizing Data (Length of 1)
+from sklearn.preprocessing import Normalizer
+normaliser = Normalizer().fit(bankNumeric)
+normalisedNum = normaliser.transform(bankNumeric)
+set_printoptions(precision = 3)
+print(normalisedNum)
+```
+You should get the following output:
+
+
+
+Figure 3.36 Output by the normalizer
+
+The output from standard scaler is normalized along the columns. The
+output would have 11 columns corresponding to 11 numeric columns (age,
+balance, day, duration, and so on). If we observe the output, we can see
+that each value along a column is normalized so as to have a mean of 0
+and standard deviation of 1. By transforming data in this way, we can
+easily compare across columns.
+
+For instance, in the `age` variable, we have data ranging from
+18 up to 95. In contrast, for the balance data, we have data ranging
+from -8,019 to 102,127. We can see that both of these variables have
+different ranges of data that cannot be compared. The standard scaler
+function converts these data points at very different scales into a
+common scale so as to compare the distribution of data. Normalizer
+rescales each row so as to have a vector with a length of 1.
+
+The big question we have to think about is why do we have to standardize
+or normalize data? Many machine learning algorithms converge faster when
+the features are of a similar scale or are normally distributed.
+Standardizing is more useful in algorithms that assume input variables
+to have a Gaussian structure. Algorithms such as linear regression,
+logistic regression, and linear discriminate analysis fall under this
+genre. Normalization techniques would be more congenial for sparse
+datasets (datasets with lots of zeros) when using algorithms such as
+k-nearest neighbor or neural networks.
+
+
+
+Summarizing Feature Engineering
+-------------------------------
+
+In this section, we investigated the process of feature engineering from
+a business perspective and data structure perspective. Feature
+engineering is a very important step in the life cycle of a data science
+project and helps determine the veracity of the models that we build. As
+seen in *Exercise 3.02*, *Business Hypothesis Testing for Age versus
+Propensity for a Term Loan* we translated our understanding of the
+domain and our intuitions to build intelligent features. Let\'s
+summarize the processes that we followed:
+
+1. We obtain intuitions from a business perspective through EDA
+2. Based on the business intuitions, we devised a new feature that is a
+ combination of three other variables.
+3. We verified the influence of constituent variables of the new
+ feature and devised an approach for weights to be applied.
+4. Converted ordinal data into corresponding weights.
+5. Transformed numerical data by normalizing them using an
+ appropriate normalizer.
+6. Combined all three variables into a new feature.
+7. Observed the relationship between the composite index and the
+ propensity to purchase term deposits and derived our intuitions.
+8. Explored techniques for visualizing and extracting summary
+ statistics from data.
+9. Identified techniques for transforming data into feature engineered
+ data structures.
+
+Now that we have completed the feature engineering step, the next
+question is where do we go from here and what is the relevance of the
+new feature we created? As you will see in the subsequent sections, the
+new features that we created will be used for the modeling process. The
+preceding exercises are an example of a trail we can follow in creating
+new features. There will be multiple trails like these, which should be
+thought of as based on more domain knowledge and understanding. The
+veracity of the models that we build will be dependent on all such
+intelligent features we can build by translating business knowledge into
+data.
+
+
+
+Building a Binary Classification Model Using the Logistic Regression Function
+-----------------------------------------------------------------------------
+
+The essence of data science is about mapping a business problem into its
+data elements and then transforming those data elements to get our
+desired business outcomes. In the previous sections, we discussed how we
+do the necessary transformation on the data elements. The right
+transformation of the data elements can highly influence the generation
+of the right business outcomes by the downstream modeling process.
+
+Let\'s look at the business outcome generation process from the
+perspective of our use case. The desired business outcome, in our use
+case, is to identify those customers who are likely to buy a term
+deposit. To correctly identify which customers are likely to buy a term
+deposit, we first need to learn the traits or features that, when
+present in a customer, helps in the identification process. This
+learning of traits is what is achieved through machine learning.
+
+By now, you may have realized that the goal of machine learning is to
+estimate a mapping function (*f*) between an output variable and input
+variables. In mathematical form, this can be written as follows:
+
+
+
+Caption: A mapping function in mathematical form
+
+Let\'s look at this equation from the perspective of our use case.
+
+*Y* is the dependent variable, which is our prediction as to whether a
+customer has the probability to buy a term deposit or not.
+
+*X* is the independent variable(s), which are those attributes such as
+age, education, and marital status and are part of the dataset.
+
+*f()* is a function that connects various attributes of the data to the
+probability or whether a customer will buy a term deposit or not. This
+function is learned during the machine learning process. This function
+is a combination of different coefficients or parameters applied to each
+of the attributes to get the probability of term deposit purchases.
+Let\'s unravel this concept using a simple example of our bank data
+use case.
+
+For simplicity, let\'s assume that we have only two attributes, age and
+bank balance. Using these, we have to predict whether a customer is
+likely to buy a term deposit or not. Let the age be 40 years and the
+balance \$1,000. With all of these attribute values, let\'s assume that
+the mapping equation is as follows:
+
+
+
+Caption: Updated mapping equation
+
+Using the preceding equation, we get the following:
+
+*Y = 0.1 + 0.4 \* 40 + 0.002 \* 1000*
+
+*Y = 18.1*
+
+Now, you might be wondering, we are getting a real number and how does
+this represent a decision of whether a customer will buy a term deposit
+or not? This is where the concept of a decision boundary comes in.
+Let\'s also assume that, on analyzing the data, we have also identified
+that if the value of *Y* goes above 15 (an assumed value in this case),
+then the customer is likely to buy the term deposit, otherwise they will
+not buy a term deposit. This means that, as per this example, the
+customer is likely to buy a term deposit.
+
+Let\'s now look at the dynamics in this example and try to decipher the
+concepts. The values such as 0.1, 0.4, and 0.002, which are applied to
+each of the attributes, are the coefficients. These coefficients, along
+with the equation connecting the coefficients and the variables, are the
+functions that we are learning from the data. The essence of machine
+learning is to learn all of these from the provided data. All of these
+coefficients along with the functions can also be called by another
+common name called the **model**. A model is an approximation of the
+data generation process. During machine learning, we are trying to get
+as close to the real model that has generated the data we are analyzing.
+To learn or estimate the data generating models, we use different
+machine learning algorithms.
+
+Machine learning models can be broadly classified into two types,
+parametric models and non-parametric models. Parametric models are where
+we assume the form of the function we are trying to learn and then learn
+the coefficients from the training data. By assuming a form for the
+function, we simplify the learning process.
+
+To understand the concept better, let\'s take the example of a linear
+model. For a linear model, the mapping function takes the following
+form:
+
+
+
+Caption: Linear model mapping function
+
+The terms *C0*, *M1*, and *M2* are the coefficients of the line that
+influences the intercept and slope of the line. *X1* and *X2* are the
+input variables. What we are doing here is that we assume that the data
+generating model is a linear model and then, using the data, we estimate
+the coefficients, which will enable the generation of the predictions.
+By assuming the data generating model, we have simplified the whole
+learning process. However, these simple processes also come with their
+pitfalls. Only if the underlying function is linear or similar to linear
+will we get good results. If the assumptions about the form are wrong,
+we are bound to get bad results.
+
+Some examples of parametric models include:
+
+- Linear and logistic regression
+- Naïve Bayes
+- Linear support vector machines
+- Perceptron
+
+Machine learning models that do not make strong assumptions on the
+function are called non-parametric models. In the absence of an assumed
+form, non-parametric models are free to learn any functional form from
+the data. Non-parametric models generally require a lot of training data
+to estimate the underlying function. Some examples of non-parametric
+models include the following:
+
+- Decision trees
+- K --nearest neighbors
+- Neural networks
+- Support vector machines with Gaussian kernels
+
+
+
+Logistic Regression Demystified
+-------------------------------
+
+Logistic regression is a linear model similar to the linear regression
+that was covered in the previous lab. At the core of logistic
+regression is the sigmoid function, which quashes any real-valued number
+to a value between 0 and 1, which renders this function ideal for
+predicting probabilities. The mathematical equation for a logistic
+regression function can be written as follows:
+
+
+
+Caption: Logistic regression function
+
+Here, *Y* is the probability of whether a customer is likely to buy a
+term deposit or not.
+
+The terms *C0 + M1 \* X1 + M2 \* X2* are very similar to the ones we
+have seen in the linear regression function, covered in an earlier
+lab. As you would have learned by now, a linear regression function
+gives a real-valued output. To transform the real-valued output into a
+probability, we use the logistic function, which has the following form:
+
+
+
+Caption: An expression to transform the real-valued output to a
+probability
+
+Here, *e* is the natural logarithm. We will not dive deep into the math
+behind this; however, let\'s realize that, using the logistic function,
+we can transform the real-valued output into a probability function.
+
+Let\'s now look at the logistic regression function from the business
+problem that we are trying to solve. In the business problem, we are
+trying to predict the probability of whether a customer would buy a term
+deposit or not. To do that, let\'s return to the example we derived from
+the problem statement:
+
+
+
+Caption: The logistic regression function updated with the business
+problem statement
+
+Adding the following values, we get *Y = 0.1 + 0.4 \* 40 + 0.002 \*
+100*.
+
+To get the probability, we must transform this problem statement using
+the logistic function, as follows:
+
+
+
+Caption: Transformed problem statement to find the probability of
+using the logistic function
+
+In applying this, we get a value of *Y = 1*, which is a 100% probability
+that the customer will buy the term deposit. As discussed in the
+previous example, the coefficients of the model such as 0.1, 0.4, and
+0.002 are what we learn using the logistic regression algorithm during
+the training process.
+
+
+
+Metrics for Evaluating Model Performance
+----------------------------------------
+
+As a data scientist, you always have to make decisions on the models you
+build. These evaluations are done based on various metrics on the
+predictions. In this section, we introduce some of the important metrics
+that are used for evaluating the performance of models.
+
+Note
+
+Model performance will be covered in much more detail in *Lab 6*,
+*How to Assess Performance*. This section provides you with an
+introduction to work with classification models.
+
+
+
+Confusion Matrix
+----------------
+
+As you will have learned, we evaluate a model based on its performance
+on a test set. A test set will have its labels, which we call the ground
+truth, and, using the model, we also generate predictions for the test
+set. The evaluation of model performance is all about comparison of the
+ground truth and the predictions. Let\'s see this in action with a dummy
+test set:
+
+
+
+Caption: Confusion matrix generation
+
+The preceding table shows a dummy dataset with seven examples. The
+second column is the ground truth, which are the actual labels, and the
+third column contains the results of our predictions. From the data, we
+can see that four have been correctly classified and three were
+misclassified.
+
+A confusion matrix generates the resultant comparison between prediction
+and ground truth, as represented in the following table:
+
+
+
+Caption: Confusion matrix
+
+As you can see from the table, there are five examples whose labels
+(ground truth) are` Yes` and the balance is two examples that
+have the labels of` No`.
+
+The first row of the confusion matrix is the evaluation of the label
+`Yes`. `True positive` shows those examples whose
+ground truth and predictions are `Yes` (examples 1, 3, and 5).
+`False negative` shows those examples whose ground truth is
+`Yes` and who have been wrongly predicted as `No`
+(examples 2 and 7).
+
+Similarly, the second row of the confusion matrix evaluates the
+performance of the label `No`. `False positive` are
+those examples whose ground truth is `No` and who have been
+wrongly classified as `Yes` (example 6).
+`True negative` examples are those examples whose ground truth
+and predictions are both `No` (example 4).
+
+The generation of a confusion matrix is used for calculating many of the
+matrices such as accuracy and classification reports, which are
+explained later. It is based on metrics such as accuracy or other
+detailed metrics shown in the classification report such as precision or
+recall the models for testing. We generally pick models where these
+metrics are the highest.
+
+
+
+Accuracy
+--------
+
+Accuracy is the first level of evaluation, which we will resort to in
+order to have a quick check on model performance. Referring to the
+preceding table, accuracy can be represented as follows:
+
+
+
+Caption: A function that represents accuracy
+
+Accuracy is the proportion of correct predictions out of all of the
+predictions.
+
+
+
+Classification Report
+---------------------
+
+A classification report outputs three key metrics: **precision**,
+**recall**, and the **F1 score**.
+
+Precision is the ratio of true positives to the sum of true positives
+and false positives:
+
+
+
+Caption: The precision ratio
+
+Precision is the indicator that tells you, out of all of the positives
+that were predicted, how many were true positives.
+
+Recall is the ratio of true positives to the sum of true positives and
+false negatives:
+
+
+
+Caption: The recall ratio
+
+Recall manifests the ability of the model to identify all true
+positives.
+
+The F1 score is a weighted score of both precision and recall. An F1
+score of 1 indicates the best performance and 0 indicates the worst
+performance.
+
+In the next section, let\'s take a look at data preprocessing, which is
+an important process to work with data and come to conclusions in data
+analysis.
+
+
+
+Data Preprocessing
+------------------
+
+Data preprocessing has an important role to play in the life cycle of
+data science projects. These processes are often the most time-consuming
+part of the data science life cycle. Careful implementation of the
+preprocessing steps is critical and will have a strong bearing on the
+results of the data science project.
+
+The various preprocessing steps include the following:
+
+- **Data loading**: This involves loading the data from different
+ sources into the notebook.
+
+- **Data cleaning**: Data cleaning process entails removing anomalies,
+ for instance, special characters, duplicate data, and identification
+ of missing data from the available dataset. Data cleaning is one of
+ the most time-consuming steps in the data science process.
+
+- **Data imputation**: Data imputation is filling missing data with
+ new data points.
+
+- **Converting data types**: Datasets will have different types of
+ data such as numerical data, categorical data, and character data.
+ Running models will necessitate the transformation of data types.
+
+ Note
+
+ Data processing will be covered in depth in the following labs
+ of this book.
+
+We will implement some of these preprocessing steps in the subsequent
+sections and in *Exercise 3.06*, *A Logistic Regression Model for
+Predicting the Propensity of Term Deposit Purchases in a Bank*.
+
+
+
+Exercise 3.06: A Logistic Regression Model for Predicting the Propensity of Term Deposit Purchases in a Bank
+------------------------------------------------------------------------------------------------------------
+
+In this exercise, we will build a logistic regression model, which will
+be used for predicting the propensity of term deposit purchases. This
+exercise will have three parts. The first part will be the preprocessing
+of the data, the second part will deal with the training process, and
+the last part will be spent on prediction, analysis of metrics, and
+deriving strategies for further improvement of the model.
+
+You begin with data preprocessing.
+
+In this part, we will first load the data, convert the ordinal data into
+dummy data, and then split the data into training and test sets for the
+subsequent training phase:
+
+1. Open a Colab notebook, mount the drives, install necessary packages,
+ and load the data, as in previous exercises:
+ ```
+ import pandas as pd
+ import altair as alt
+ file_url = 'https://raw.githubusercontent.com'\
+ '/fenago/data-science'\
+ '/master/Lab03/bank-full.csv'
+ bankData = pd.read_csv(file_url, sep=";")
+ ```
+
+
+2. Now, load the library functions and data:
+ ```
+ from sklearn.linear_model import LogisticRegression
+ from sklearn.model_selection import train_test_split
+ ```
+
+
+3. Now, find the data types:
+
+ ```
+ bankData.dtypes
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+ Caption: Data types
+
+4. Convert the ordinal data into dummy data.
+
+ As you can see in the dataset, we have two types of data: the
+ numerical data and the ordinal data. Machine learning algorithms
+ need numerical representation of data and, therefore, we must
+ convert the ordinal data into a numerical form by creating dummy
+ variables. The dummy variable will have values of either 1 or 0
+ corresponding to whether that category is present or not. The
+ function we use for converting ordinal data into numerical form is
+ `pd.get_dummies()`. This function converts the data
+ structure into a long form or horizontal form. So, if there are
+ three categories in a variable, there will be three new variables
+ created as dummy variables corresponding to each of the categories.
+
+ The value against each variable would be either 1 or 0, depending on
+ whether that category was present in the variable as an example.
+ Let\'s look at the code for doing that:
+
+ ```
+ """
+ Converting all the categorical variables to dummy variables
+ """
+ bankCat = pd.get_dummies\
+ (bankData[['job','marital',\
+ 'education','default','housing',\
+ 'loan','contact','month','poutcome']])
+ bankCat.shape
+ ```
+
+
+ You should get the following output:
+
+ ```
+ (45211, 44)
+ ```
+
+
+ We now have a new subset of the data corresponding to the
+ categorical data that was converted into numerical form. Also, we
+ had some numerical variables in the original dataset, which did not
+ need any transformation. The transformed categorical data and the
+ original numerical data have to be combined to get all of the
+ original features. To combine both, let\'s first extract the
+ numerical data from the original DataFrame.
+
+5. Now, separate the numerical variables:
+
+ ```
+ bankNum = bankData[['age','balance','day','duration',\
+ 'campaign','pdays','previous']]
+ bankNum.shape
+ ```
+
+
+ You should get the following output:
+
+ ```
+ (45211, 7)
+ ```
+
+
+6. Now, prepare the `X` and `Y` variables and print
+ the `Y` shape. The `X` variable is the
+ concatenation of the transformed categorical variable and the
+ separated numerical data:
+
+ ```
+ # Preparing the X variables
+ X = pd.concat([bankCat, bankNum], axis=1)
+ print(X.shape)
+ # Preparing the Y variable
+ Y = bankData['y']
+ print(Y.shape)
+ X.head()
+ ```
+
+
+ The output shown below is truncated:
+
+
+
+
+
+ Figure 3.50 Combining categorical and numerical DataFrames
+
+ Once the DataFrame is created, we can split the data into training
+ and test sets. We specify the proportion in which the DataFrame must
+ be split into training and test sets.
+
+7. Split the data into training and test sets:
+
+ ```
+ # Splitting the data into train and test sets
+ X_train, X_test, y_train, y_test = train_test_split\
+ (X, Y, test_size=0.3, \
+ random_state=123)
+ ```
+
+
+ Now, the data is all prepared for the modeling task. Next, we begin
+ with modeling.
+
+ In this part, we will train the model using the training set we
+ created in the earlier step. First, we call the
+ `logistic regression `function and then fit the model with
+ the training set data.
+
+8. Define the `LogisticRegression` function:
+
+ ```
+ bankModel = LogisticRegression()
+ bankModel.fit(X_train, y_train)
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+ Caption: Parameters of the model that fits
+
+9. Now, that the model is created, use it for predicting on the test
+ sets and then getting the accuracy level of the predictions:
+
+ ```
+ pred = bankModel.predict(X_test)
+ print('Accuracy of Logistic regression model' \
+ 'prediction on test set: {:.2f}'\
+ .format(bankModel.score(X_test, y_test)))
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+ Caption: Prediction with the model
+
+10. From an initial look, an accuracy metric of 90% gives us the
+ impression that the model has done a decent job of approximating the
+ data generating process. Or is it otherwise? Let\'s take a closer
+ look at the details of the prediction by generating the metrics for
+ the model. We will use two metric-generating functions, the
+ confusion matrix and classification report:
+
+ ```
+ # Confusion Matrix for the model
+ from sklearn.metrics import confusion_matrix
+ confusionMatrix = confusion_matrix(y_test, pred)
+ print(confusionMatrix)
+ ```
+
+
+ You should get the following output in the following format;
+ however, the values can vary as the modeling task will involve
+ variability:
+
+
+
+
+
+ Caption: Generation of the confusion matrix
+
+ Note
+
+ The end results that you get will be different from what you see
+ here as it depends on the system you are using. This is because the
+ modeling part is stochastic in nature and there will always be
+ differences.
+
+11. Next, let\'s generate a `classification_report`:
+
+ ```
+ from sklearn.metrics import classification_report
+ print(classification_report(y_test, pred))
+ ```
+
+
+ You should get a similar output; however, with different values due
+ to variability in the modeling process:
+
+
+
+
+
+
+From the metrics, we can see that, out of the total 11,998 examples of
+`no`, 11,754 were correctly classified as `no` and
+the balance, 244, were classified as `yes`. This gives a
+recall value of *11,754/11,998*, which is nearly 98%. From a precision
+perspective, out of the total 12,996 examples that were predicted as
+`no`, only 11,754 of them were really `no`, which
+takes our precision to 11,754/12,996 or 90%.
+
+However, the metrics for `yes` give a different picture. Out
+of the total 1,566 cases of `yes`, only 324 were correctly
+identified as `yes`. This gives us a recall of *324/1,566 =
+21%*. The precision is *324 / (324 + 244) = 57%*.
+
+From an overall accuracy level, this can be calculated as follows:
+correctly classified *examples / total examples = (11754 + 324) / 13564
+= 89%*.
+
+The metrics might seem good when you look only at the accuracy level.
+However, looking at the details, we can see that the classifier, in
+fact, is doing a poor job of classifying the `yes` cases. The
+classifier has been trained to predict mostly `no` values,
+which from a business perspective is useless. From a business
+perspective, we predominantly want the `yes` estimates, so
+that we can target those cases for focused marketing to try to sell term
+deposits. However, with the results we have, we don\'t seem to have done
+a good job in helping the business to increase revenue from term deposit
+sales.
+
+In this exercise, we have preprocessed data, then we performed the
+training process, and finally, we found useful prediction, analysis of
+metrics, and deriving strategies for further improvement of the model.
+
+What we have now built is the first model or a benchmark model. The next
+step is to try to improve on the benchmark model through different
+strategies. One such strategy is to feature engineer variables and build
+new models with new features. Let\'s achieve that in the next activity.
+
+
+
+Activity 3.02: Model Iteration 2 -- Logistic Regression Model with Feature Engineered Variables
+-----------------------------------------------------------------------------------------------
+
+As the data scientist of the bank, you created a benchmark model to
+predict which customers are likely to buy a term deposit. However,
+management wants to improve the results you got in the benchmark model.
+In *Exercise 3.04*, *Feature Engineering -- Creating New Features from
+Existing Ones,* you discussed the business scenario with the marketing
+and operations teams and created a new variable, `assetIndex`,
+by feature engineering three raw variables. You are now fitting another
+logistic regression model on the feature engineered variables and are
+trying to improve the results.
+
+In this activity, you will be feature engineering some of the variables
+to verify their effects on the predictions.
+
+The steps are as follows:
+
+1. Open the Colab notebook used for the feature engineering in
+ *Exercise 3.04*, *Feature Engineering -- Creating New Features from
+ Existing Ones,* and execute all of the steps from that exercise.
+
+2. Create dummy variables for the categorical variables using the
+ `pd.get_dummies()` function. Exclude original raw
+ variables such as loan and housing, which were used to create the
+ new variable, `assetIndex`.
+
+3. Select the numerical variables including the new feature engineered
+ variable, `assetIndex`, that was created.
+
+4. Transform some of the numerical variables by normalizing them using
+ the `MinMaxScaler()` function.
+
+5. Concatenate the numerical variables and categorical variables using
+ the `pd.concat()` function and then create `X`
+ and `Y` variables.
+
+6. Split the dataset using the `train_test_split()` function
+ and then fit a new model using the `LogisticRegression()`
+ model on the new features.
+
+7. Analyze the results after generating the confusion matrix and
+ classification report.
+
+ You should get the following output:
+
+
+
+
+
+Caption: Expected output with the classification report
+
+
+Summary
+=======
+
+
+In this lab, we learned about binary classification using logistic
+regression from the perspective of solving a use case. Let\'s summarize
+our learnings in this lab. We were introduced to classification
+problems and specifically binary classification problems. We also looked
+at the classification problem from the perspective of predicting term
+deposit propensity through a business discovery process. In the business
+discovery process, we identified different business drivers that
+influence business outcomes.
\ No newline at end of file
diff --git a/lab_guides/Lab_4.md b/lab_guides/Lab_4.md
new file mode 100644
index 0000000..a0bc0bb
--- /dev/null
+++ b/lab_guides/Lab_4.md
@@ -0,0 +1,1767 @@
+
+4. Multiclass Classification with RandomForest
+==============================================
+
+
+
+Overview
+
+This lab will show you how to train a multiclass classifier using
+the Random Forest algorithm. You will also see how to evaluate the
+performance of multiclass models.
+
+By the end of the lab, you will be able to implement a Random Forest
+classifier, as well as tune hyperparameters in order to improve model
+performance.
+
+
+
+
+Training a Random Forest Classifier
+===================================
+
+
+
+Let\'s see how we can train a Random Forest classifier on this dataset.
+First, we need to load the data from the GitHub repository using
+`pandas` and then we will print its first five rows using the
+`head()` method.
+
+Note
+
+All the example code given outside of Exercises in this lab relates
+to this Activity Recognition dataset. It is recommended that all code
+from these examples is entered and run in a single Google Colab
+Notebook, and kept separate from your Exercise Notebooks.
+
+```
+import pandas as pd
+file_url = 'https://raw.githubusercontent.com/fenago'\
+ '/data-science/master/Lab04/'\
+ 'Dataset/activity.csv'
+df = pd.read_csv(file_url)
+df.head()
+```
+
+The output will be as follows:
+
+
+
+Caption: First five rows of the dataset
+
+Each row represents an activity that was performed by a person and the
+name of the activity is stored in the `Activity` column. There
+are seven different activities in this variable: `bending1`,
+`bending2`, `cycling`, `lying`,
+`sitting`, `standing`, and `Walking`. The
+other six columns are different measurements taken from sensor data.
+
+In this example, you will accurately predict the target variable
+(`'Activity'`) from the features (the six other columns) using
+Random Forest. For example, for the first row of the preceding example,
+the model will receive the following features as input and will predict
+the `'bending1'` class:
+
+
+
+Caption: Features for the first row of the dataset
+
+But before that, we need to do a bit of data preparation. The
+`sklearn` package (we will use it to train Random Forest
+model) requires the target variable and the features to be separated.
+So, we need to extract the response variable using the
+`.pop()` method from `pandas`. The
+`.pop()` method extracts the specified column and removes it
+from the DataFrame:
+
+```
+target = df.pop('Activity')
+```
+Now the response variable is contained in the variable called
+`target` and all the features are in the DataFrame called
+`df`.
+
+Now we are going to split the dataset into training and testing sets.
+The model uses the training set to learn relevant parameters in
+predicting the response variable. The test set is used to check whether
+a model can accurately predict unseen data. We say the model is
+overfitting when it has learned the patterns relevant only to the
+training set and makes incorrect predictions about the testing set. In
+this case, the model performance will be much higher for the training
+set compared to the testing one. Ideally, we want to have a very similar
+level of performance for the training and testing sets. This topic will
+be covered in more depth in *Lab 7*, *The Generalization of Machine
+Learning Models*.
+
+The `sklearn` package provides a function called
+`train_test_split()` to randomly split the dataset into two
+different sets. We need to specify the following parameters for this
+function: the feature and target variables, the ratio of the testing set
+(`test_size`), and `random_state` in order to get
+reproducible results if we have to run the code again:
+
+```
+from sklearn.model_selection import train_test_split
+X_train, X_test, y_train, y_test = train_test_split\
+ (df, target, test_size=0.33, \
+ random_state=42)
+```
+
+There are four different outputs to the `train_test_split()`
+function: the features for the training set, the target variable for the
+training set, the features for the testing set, and its target variable.
+
+Now that we have got our training and testing sets, we are ready for
+modeling. Let\'s first import the `RandomForestClassifier`
+class from `sklearn.ensemble`:
+
+```
+from sklearn.ensemble import RandomForestClassifier
+```
+Now we can instantiate the Random Forest classifier with some
+hyperparameters. Remember from *Lab 1, Introduction to Data Science
+in Python*, a hyperparameter is a type of parameter the model can\'t
+learn but is set by data scientists to tune the model\'s learning
+process. This topic will be covered more in depth in *Lab 8,
+Hyperparameter Tuning*. For now, we will just specify the
+`random_state` value. We will walk you through some of the key
+hyperparameters in the following sections:
+
+```
+rf_model = RandomForestClassifier(random_state=1, \
+ n_estimators=10)
+```
+
+The next step is to train (also called fit) the model with the training
+data. During this step, the model will try to learn the relationship
+between the response variable and the independent variables and save the
+parameters learned. We need to specify the features and target variables
+as parameters:
+
+```
+rf_model.fit(X_train, y_train)
+```
+
+The output will be as follows:
+
+
+
+Caption: Logs of the trained RandomForest
+
+Now that the model has completed its training, we can use the parameters
+it learned to make predictions on the input data we will provide. In the
+following example, we are using the features from the training set:
+
+```
+preds = rf_model.predict(X_train)
+```
+Now we can print these predictions:
+
+```
+preds
+```
+
+The output will be as follows:
+
+
+
+Caption: Predictions of the RandomForest algorithm on the training
+set
+
+This output shows us the model predicted, respectively, the values
+`lying`, `bending1`, and `cycling` for the
+first three observations and `cycling`, `bending1`,
+and `standing` for the last three observations. Python, by
+default, truncates the output for a long list of values. This is why it
+shows only six values here.
+
+These are basically the key steps required for training a Random Forest
+classifier. This was quite straightforward, right? Training a machine
+learning model is incredibly easy but getting meaningful and accurate
+results is where the challenges lie. In the next section, we will learn
+how to assess the performance of a trained model.
+
+
+Evaluating the Model\'s Performance
+===================================
+
+
+Now that we know how to train a Random Forest classifier, it is time to
+check whether we did a good job or not. What we want is to get a model
+that makes extremely accurate predictions, so we need to assess its
+performance using some kind of metric.
+
+For a classification problem, multiple metrics can be used to assess the
+model\'s predictive power, such as F1 score, precision, recall, or ROC
+AUC. Each of them has its own specificity and depending on the projects
+and datasets, you may use one or another.
+
+In this lab, we will use a metric called **accuracy score**. It
+calculates the ratio between the number of correct predictions and the
+total number of predictions made by the model:
+
+
+
+Caption: Formula for accuracy score
+
+For instance, if your model made 950 correct predictions out of 1,000
+cases, then the accuracy score would be 950/1000 = 0.95. This would mean
+that your model was 95% accurate on that dataset. The
+`sklearn` package provides a function to calculate this score
+automatically and it is called `accuracy_score()`. We need to
+import it first:
+
+```
+from sklearn.metrics import accuracy_score
+```
+
+Then, we just need to provide the list of predictions for some
+observations and the corresponding true value for the target variable.
+Using the previous example, we will use the `y_train` and
+`preds` variables, which respectively contain the response
+variable (also known as the target) for the training set and the
+corresponding predictions made by the Random Forest model. We will reuse
+the predictions from the previous section -- `preds`:
+
+```
+accuracy_score(y_train, preds)
+```
+
+The output will be as follows:
+
+
+
+Caption: Accuracy score on the training set
+
+We achieved an accuracy score of 0.988 on our training data. This means
+we accurately predicted more than `98%` of these cases.
+Unfortunately, this doesn\'t mean you will be able to achieve such a
+high score for new, unseen data. Your model may have just learned the
+patterns that are only relevant to this training set, and in that case,
+the model will overfit.
+
+If we take the analogy of a student learning a subject for a semester,
+they could memorize by heart all the textbook exercises but when given a
+similar but unseen exercise, they wouldn\'t be able to solve it.
+Ideally, the student should understand the underlying concepts of the
+subject and be able to apply that learning to any similar exercises.
+This is exactly the same for our model: we want it to learn the generic
+patterns that will help it to make accurate predictions even on unseen
+data.
+
+But how can we assess the performance of a model for unseen data? Is
+there a way to get that kind of assessment? The answer to these
+questions is yes.
+
+Remember, in the last section, we split the dataset into training and
+testing sets. We used the training set to fit the model and assess its
+predictive power on it. But it hasn\'t seen the observations from the
+testing set at all, so we can use it to assess whether our model is
+capable of generalizing unseen data. Let\'s calculate the accuracy score
+for the testing set:
+
+```
+test_preds = rf_model.predict(X_test)
+accuracy_score(y_test, test_preds)
+```
+
+The output will be as follows:
+
+
+
+Caption: Accuracy score on the testing set
+
+OK. Now the accuracy has dropped drastically to `0.77`. The
+difference between the training and testing sets is quite big. This
+tells us our model is actually overfitting and learned only the patterns
+relevant to the training set. In an ideal case, the performance of your
+model should be equal or very close to equal for those two sets.
+
+In the next sections, we will look at tuning some Random Forest
+hyperparameters in order to reduce overfitting.
+
+
+
+Exercise 4.01: Building a Model for Classifying Animal Type and Assessing Its Performance
+-----------------------------------------------------------------------------------------
+
+In this exercise, we will train a Random Forest classifier to predict
+the type of an animal based on its attributes and check its accuracy
+score:
+
+
+1. Open a new Colab notebook.
+
+2. Import the `pandas` package:
+ ```
+ import pandas as pd
+ ```
+
+
+3. Create a variable called `file_url` that contains the URL
+ of the dataset:
+ ```
+ file_url = 'https://raw.githubusercontent.com'\
+ '/fenago/data-science'\
+ '/master/Lab04/Dataset'\
+ '/openml_phpZNNasq.csv'
+ ```
+
+
+4. Load the dataset into a DataFrame using the `.read_csv()`
+ method from pandas:
+ ```
+ df = pd.read_csv(file_url)
+ ```
+
+
+5. Print the first five rows of the DataFrame:
+
+ ```
+ df.head()
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+ Caption: First five rows of the DataFrame
+
+ We will be using the `type` column as our target variable.
+ We will need to remove the `animal` column from the
+ DataFrame and only use the remaining columns as features.
+
+6. Remove the `'animal'` column using the `.drop()`
+ method from `pandas` and specify the
+ `columns='animal'` and `inplace=True` parameters
+ (to directly update the original DataFrame):
+ ```
+ df.drop(columns='animal', inplace=True)
+ ```
+
+
+7. Extract the `'type'` column using the `.pop()`
+ method from `pandas`:
+ ```
+ y = df.pop('type')
+ ```
+
+
+8. Print the first five rows of the updated DataFrame:
+
+ ```
+ df.head()
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+ Caption: First five rows of the DataFrame
+
+9. Import the `train_test_split` function from
+ `sklearn.model_selection`:
+ ```
+ from sklearn.model_selection import train_test_split
+ ```
+
+
+10. Split the dataset into training and testing sets with the
+ `df`, `y`, `test_size=0.4`, and
+ `random_state=188` parameters:
+ ```
+ X_train, X_test, y_train, y_test = train_test_split\
+ (df, y, test_size=0.4, \
+ random_state=188)
+ ```
+
+
+11. Import `RandomForestClassifier` from
+ `sklearn.ensemble`:
+ ```
+ from sklearn.ensemble import RandomForestClassifier
+ ```
+
+
+12. Instantiate the `RandomForestClassifier` object with
+ `random_state` equal to `42`. Set the
+ `n-estimators` value to an initial default value of
+ `10`. We\'ll discuss later how changing this value affects
+ the result.
+ ```
+ rf_model = RandomForestClassifier(random_state=42, \
+ n_estimators=10)
+ ```
+
+
+13. Fit `RandomForestClassifier` with the training set:
+
+ ```
+ rf_model.fit(X_train, y_train)
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+ Caption: Logs of RandomForestClassifier
+
+14. Predict the outcome of the training set with the
+ `.predict()`method, save the results in a variable called
+ \'`train_preds`\', and print its value:
+
+ ```
+ train_preds = rf_model.predict(X_train)
+ train_preds
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+ Caption: Predictions on the training set
+
+15. Import the `accuracy_score` function from
+ `sklearn.metrics`:
+ ```
+ from sklearn.metrics import accuracy_score
+ ```
+
+
+16. Calculate the accuracy score on the training set, save the result in
+ a variable called `train_acc`, and print its value:
+
+ ```
+ train_acc = accuracy_score(y_train, train_preds)
+ print(train_acc)
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+ Caption: Accuracy score on the training set
+
+ Our model achieved an accuracy of `1` on the training set,
+ which means it perfectly predicted the target variable on all of
+ those observations. Let\'s check the performance on the testing set.
+
+17. Predict the outcome of the testing set with the
+ `.predict()` method and save the results into a variable
+ called `test_preds`:
+ ```
+ test_preds = rf_model.predict(X_test)
+ ```
+
+
+18. Calculate the accuracy score on the testing set, save the result in
+ a variable called `test_acc`, and print its value:
+
+ ```
+ test_acc = accuracy_score(y_test, test_preds)
+ print(test_acc)
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+
+Number of Trees Estimator
+-------------------------
+
+Now that we know how to fit a Random Forest classifier and assess its
+performance, it is time to dig into the details. In the coming sections,
+we will learn how to tune some of the most important hyperparameters for
+this algorithm. As mentioned in *Lab 1, Introduction to Data Science
+in Python*, hyperparameters are parameters that are not learned
+automatically by machine learning algorithms. Their values have to be
+set by data scientists. These hyperparameters can have a huge impact on
+the performance of a model, its ability to generalize to unseen data,
+and the time taken to learn patterns from the data.
+
+The first hyperparameter you will look at in this section is called
+`n_estimators`. This hyperparameter is responsible for
+defining the number of trees that will be trained by the
+`RandomForest` algorithm.
+
+Before looking at how to tune this hyperparameter, we need to understand
+what a tree is and why it is so important for the
+`RandomForest` algorithm.
+
+A tree is a logical graph that maps a decision and its outcomes at each
+of its nodes. Simply speaking, it is a series of yes/no (or true/false)
+questions that lead to different outcomes.
+
+A leaf is a special type of node where the model will make a prediction.
+There will be no split after a leaf. A single node split of a tree may
+look like this:
+
+
+
+Caption: Example of a single tree node
+
+A tree node is composed of a question and two outcomes depending on
+whether the condition defined by the question is met or not. In the
+preceding example, the question is `is avg_rss12 > 41?` If the
+answer is yes, the outcome is the `bending_1` leaf and if not,
+it will be the `sitting` leaf.
+
+A tree is just a series of nodes and leaves combined together:
+
+
+
+Caption: Example of a tree
+
+In the preceding example, the tree is composed of three nodes with
+different questions. Now, for an observation to be predicted as
+`sitting`, it will need to meet the conditions:
+`avg_rss13 <= 41`, `var_rss > 0.7`, and
+`avg_rss13 <= 16.25`.
+
+The `RandomForest` algorithm will build this kind of tree
+based on the training data it sees. We will not go through the
+mathematical details about how it defines the split for each node but,
+basically, it will go through every column of the dataset and see which
+split value will best help to separate the data into two groups of
+similar classes. Taking the preceding example, the first node with the
+`avg_rss13 > 41` condition will help to get the group of data
+on the left-hand side with mostly the `bending_1` class. The
+`RandomForest` algorithm usually builds several of this kind
+of tree and this is the reason why it is called a forest.
+
+As you may have guessed now, the `n_estimators` hyperparameter
+is used to specify the number of trees the `RandomForest`
+algorithm will build. For example (as in the previous exercise), say we
+ask it to build 10 trees. For a given observation, it will ask each tree
+to make a prediction. Then, it will average those predictions and use
+the result as the final prediction for this input. For instance, if, out
+of 10 trees, 8 of them predict the outcome `sitting`, then the
+`RandomForest` algorithm will use this outcome as the final
+prediction.
+
+Note
+
+If you don\'t pass in a specific `n_estimators`
+hyperparameter, it will use the default value. The default depends on
+the version of scikit-learn you\'re using. In early versions, the
+default value is 10. From version 0.22 onwards, the default is 100. You
+can find out which version you are using by executing the following
+code:
+
+`import sklearn`
+
+`sklearn.__version__`
+
+For more information, see here:
+
+
+In general, the higher the number of trees is, the better the
+performance you will get. Let\'s see what happens with
+`n_estimators = 2` on the Activity Recognition dataset:
+
+```
+rf_model2 = RandomForestClassifier(random_state=1, \
+ n_estimators=2)
+rf_model2.fit(X_train, y_train)
+preds2 = rf_model2.predict(X_train)
+test_preds2 = rf_model2.predict(X_test)
+print(accuracy_score(y_train, preds2))
+print(accuracy_score(y_test, test_preds2))
+```
+
+The output will be as follows:
+
+
+
+Caption: Accuracy of RandomForest with n\_estimators = 2
+
+As expected, the accuracy is significantly lower than the previous
+example with `n_estimators = 10`. Let\'s now try with
+`50` trees:
+
+```
+rf_model3 = RandomForestClassifier(random_state=1, \
+ n_estimators=50)
+rf_model3.fit(X_train, y_train)
+preds3 = rf_model3.predict(X_train)
+test_preds3 = rf_model3.predict(X_test)
+print(accuracy_score(y_train, preds3))
+print(accuracy_score(y_test, test_preds3))
+```
+
+The output will be as follows:
+
+
+
+Caption: Accuracy of RandomForest with n\_estimators = 50
+
+With `n_estimators = 50`, we respectively gained
+`1%` and `2%` on the accuracy scored for the
+training and testing sets, which is great. But the main drawback of
+increasing the number of trees is that it requires more computational
+power. So, it will take more time to train a model. In a real project,
+you will need to find the right balance between performance and training
+duration.
+
+
+
+Exercise 4.02: Tuning n\_estimators to Reduce Overfitting
+---------------------------------------------------------
+
+In this exercise, we will train a Random Forest classifier to predict
+the type of an animal based on its attributes and will try two different
+values for the `n_estimators` hyperparameter:
+
+We will be using the same zoo dataset as in the previous exercise.
+
+1. Open a new Colab notebook.
+
+2. Import the `pandas `package, `train_test_split`,
+ `RandomForestClassifier`, and `accuracy_score`
+ from `sklearn`:
+ ```
+ import pandas as pd
+ from sklearn.model_selection import train_test_split
+ from sklearn.ensemble import RandomForestClassifier
+ from sklearn.metrics import accuracy_score
+ ```
+
+
+3. Create a variable called `file_url` that contains the URL
+ to the dataset:
+ ```
+ file_url = 'https://raw.githubusercontent.com'\
+ '/fenago/data-science'\
+ '/master/Lab04/Dataset'\
+ '/openml_phpZNNasq.csv'
+ ```
+
+
+4. Load the dataset into a DataFrame using the `.read_csv()`
+ method from `pandas`:
+ ```
+ df = pd.read_csv(file_url)
+ ```
+
+
+5. Remove the `animal` column using `.drop()` and
+ then extract the `type` target variable into a new
+ variable called `y` using `.pop()`:
+ ```
+ df.drop(columns='animal', inplace=True)
+ y = df.pop('type')
+ ```
+
+
+6. Split the data into training and testing sets with
+ `train_test_split()` and the `test_size=0.4` and
+ `random_state=188` parameters:
+ ```
+ X_train, X_test, y_train, y_test = train_test_split\
+ (df, y, test_size=0.4, \
+ random_state=188)
+ ```
+
+
+7. Instantiate `RandomForestClassifier` with
+ `random_state=42` and `n_estimators=1`, and then
+ fit the model with the training set:
+
+ ```
+ rf_model = RandomForestClassifier(random_state=42, \
+ n_estimators=1)
+ rf_model.fit(X_train, y_train)
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+ Caption: Logs of RandomForestClassifier
+
+8. Make predictions on the training and testing sets with
+ `.predict()` and save the results into two new variables
+ called `train_preds` and `test_preds`:
+ ```
+ train_preds = rf_model.predict(X_train)
+ test_preds = rf_model.predict(X_test)
+ ```
+
+
+9. Calculate the accuracy score for the training and testing sets and
+ save the results in two new variables called `train_acc`
+ and `test_acc`:
+ ```
+ train_acc = accuracy_score(y_train, train_preds)
+ test_acc = accuracy_score(y_test, test_preds)
+ ```
+
+
+10. Print the accuracy scores: `train_acc` and
+ `test_acc`:
+
+ ```
+ print(train_acc)
+ print(test_acc)
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+ Caption: Accuracy scores for the training and testing sets
+
+ The accuracy score decreased for both the training and testing sets.
+ But now the difference is smaller compared to the results from
+ *Exercise 4.01*, *Building a Model for Classifying Animal Type and
+ Assessing Its Performance*.
+
+11. Instantiate another `RandomForestClassifier` with
+ `random_state=42` and `n_estimators=30`, and
+ then fit the model with the training set:
+
+ ```
+ rf_model2 = RandomForestClassifier(random_state=42, \
+ n_estimators=30)
+ rf_model2.fit(X_train, y_train)
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+ Caption: Logs of RandomForest with n\_estimators = 30
+
+12. Make predictions on the training and testing sets with
+ `.predict()` and save the results into two new variables
+ called `train_preds2` and `test_preds2`:
+ ```
+ train_preds2 = rf_model2.predict(X_train)
+ test_preds2 = rf_model2.predict(X_test)
+ ```
+
+
+13. Calculate the accuracy score for the training and testing sets and
+ save the results in two new variables called `train_acc2`
+ and `test_acc2`:
+ ```
+ train_acc2 = accuracy_score(y_train, train_preds2)
+ test_acc2 = accuracy_score(y_test, test_preds2)
+ ```
+
+
+14. Print the accuracy scores: `train_acc` and
+ `test_acc`:
+
+ ```
+ print(train_acc2)
+ print(test_acc2)
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+Caption: Accuracy scores for the training and testing sets
+
+
+
+Maximum Depth
+=============
+
+
+In the previous section, we learned how Random Forest builds multiple
+trees to make predictions. Increasing the number of trees does improve
+model performance but it usually doesn\'t help much to decrease the risk
+of overfitting. Our model in the previous example is still performing
+much better on the training set (data it has already seen) than on the
+testing set (unseen data).
+
+So, we are not confident enough yet to say the model will perform well
+in production. There are different hyperparameters that can help to
+lower the risk of overfitting for Random Forest and one of them is
+called `max_depth`.
+
+This hyperparameter defines the depth of the trees built by Random
+Forest. Basically, it tells Random Forest model, how many nodes
+(questions) it can create before making predictions. But how will that
+help to reduce overfitting, you may ask. Well, let\'s say you built a
+single tree and set the `max_depth` hyperparameter to
+`50`. This would mean that there would be some cases where you
+could ask 49 different questions (the value `c` includes the
+final leaf node) before making a prediction. So, the logic would be
+`IF X1 > value1 AND X2 > value2 AND X1 <= value3 AND … AND X3 > value49 THEN predict class A`.
+
+As you can imagine, this is a very specific rule. In the end, it may
+apply to only a few observations in the training set, with this case
+appearing very infrequently. Therefore, your model would be overfitting.
+By default, the value of this `max_depth` parameter is
+`None`, which means there is no limit set for the depth of the
+trees.
+
+What you really want is to find some rules that are generic enough to be
+applied to bigger groups of observations. This is why it is recommended
+to not create deep trees with Random Forest. Let\'s try several values
+for this hyperparameter on the Activity Recognition dataset:
+`3`, `10`, and `50`:
+
+```
+rf_model4 = RandomForestClassifier(random_state=1, \
+ n_estimators=50, max_depth=3)
+rf_model4.fit(X_train, y_train)
+preds4 = rf_model4.predict(X_train)
+test_preds4 = rf_model4.predict(X_test)
+print(accuracy_score(y_train, preds4))
+print(accuracy_score(y_test, test_preds4))
+```
+You should get the following output:
+
+
+
+Caption: Accuracy scores for the training and testing sets and a
+max\_depth of 3
+
+For a `max_depth` of `3`, we got extremely similar
+results for the training and testing sets but the overall performance
+decreased drastically to `0.61`. Our model is not overfitting
+anymore, but it is now underfitting; that is, it is not predicting the
+target variable very well (only in `61%` of cases). Let\'s
+increase `max_depth` to `10`:
+
+```
+rf_model5 = RandomForestClassifier(random_state=1, \
+ n_estimators=50, \
+ max_depth=10)
+rf_model5.fit(X_train, y_train)
+preds5 = rf_model5.predict(X_train)
+test_preds5 = rf_model5.predict(X_test)
+print(accuracy_score(y_train, preds5))
+print(accuracy_score(y_test, test_preds5))
+```
+
+
+Caption: Accuracy scores for the training and testing sets and a
+max\_depth of 10
+
+The accuracy of the training set increased and is relatively close to
+the testing set. We are starting to get some good results, but the model
+is still slightly overfitting. Now we will see the results for
+`max_depth = 50`:
+
+```
+rf_model6 = RandomForestClassifier(random_state=1, \
+ n_estimators=50, \
+ max_depth=50)
+rf_model6.fit(X_train, y_train)
+preds6 = rf_model6.predict(X_train)
+test_preds6 = rf_model6.predict(X_test)
+print(accuracy_score(y_train, preds6))
+print(accuracy_score(y_test, test_preds6))
+```
+
+The output will be as follows:
+
+
+
+Caption: Accuracy scores for the training and testing sets and a
+max\_depth of 50
+
+The accuracy jumped to `0.99` for the training set but it
+didn\'t improve much for the testing set. So, the model is overfitting
+with `max_depth = 50`. It seems the sweet spot to get good
+predictions and not much overfitting is around `10` for the
+`max_depth` hyperparameter in this dataset.
+
+
+
+Exercise 4.03: Tuning max\_depth to Reduce Overfitting
+------------------------------------------------------
+
+In this exercise, we will keep tuning our RandomForest classifier that
+predicts animal type by trying two different values for the
+`max_depth` hyperparameter:
+
+We will be using the same zoo dataset as in the previous exercise.
+
+1. Open a new Colab notebook.
+
+2. Import the `pandas` package, `train_test_split`,
+ `RandomForestClassifier`, and `accuracy_score`
+ from `sklearn`:
+ ```
+ import pandas as pd
+ from sklearn.model_selection import train_test_split
+ from sklearn.ensemble import RandomForestClassifier
+ from sklearn.metrics import accuracy_score
+ ```
+
+
+3. Create a variable called `file_url` that contains the URL
+ to the dataset:
+ ```
+ file_url = 'https://raw.githubusercontent.com'\
+ 'fenago/data-science'\
+ '/master/Lab04/Dataset'\
+ '/openml_phpZNNasq.csv'
+ ```
+
+
+4. Load the dataset into a DataFrame using the `.read_csv()`
+ method from `pandas`:
+ ```
+ df = pd.read_csv(file_url)
+ ```
+
+
+5. Remove the `animal` column using `.drop()` and
+ then extract the `type` target variable into a new
+ variable called `y` using `.pop()`:
+ ```
+ df.drop(columns='animal', inplace=True)
+ y = df.pop('type')
+ ```
+
+
+6. Split the data into training and testing sets with
+ `train_test_split()` and the parameters
+ `test_size=0.4` and `random_state=188`:
+ ```
+ X_train, X_test, y_train, y_test = train_test_split\
+ (df, y, test_size=0.4, \
+ random_state=188)
+ ```
+
+
+7. Instantiate `RandomForestClassifier` with
+ `random_state=42`, `n_estimators=30`, and
+ `max_depth=5`, and then fit the model with the training
+ set:
+
+ ```
+ rf_model = RandomForestClassifier(random_state=42, \
+ n_estimators=30, \
+ max_depth=5)
+ rf_model.fit(X_train, y_train)
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+ Caption: Logs of RandomForest
+
+8. Make predictions on the training and testing sets with
+ `.predict()` and save the results into two new variables
+ called `train_preds` and `test_preds`:
+ ```
+ train_preds = rf_model.predict(X_train)
+ test_preds = rf_model.predict(X_test)
+ ```
+
+
+9. Calculate the accuracy score for the training and testing sets and
+ save the results in two new variables called `train_acc`
+ and `test_acc`:
+ ```
+ train_acc = accuracy_score(y_train, train_preds)
+ test_acc = accuracy_score(y_test, test_preds)
+ ```
+
+
+10. Print the accuracy scores: `train_acc` and
+ `test_acc`:
+
+ ```
+ print(train_acc)
+ print(test_acc)
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+ Caption: Accuracy scores for the training and testing sets
+
+ We got the exact same accuracy scores as for the best result we
+ obtained in the previous exercise. This value for the
+ `max_depth` hyperparameter hasn\'t impacted the model\'s
+ performance.
+
+11. Instantiate another `RandomForestClassifier` with
+ `random_state=42`, `n_estimators=30`, and
+ `max_depth=2`, and then fit the model with the training
+ set:
+
+ ```
+ rf_model2 = RandomForestClassifier(random_state=42, \
+ n_estimators=30, \
+ max_depth=2)
+ rf_model2.fit(X_train, y_train)
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+ Caption: Logs of RandomForestClassifier with max\_depth = 2
+
+12. Make predictions on the training and testing sets with
+ `.predict()` and save the results into two new variables
+ called `train_preds2 `and `test_preds2`:
+ ```
+ train_preds2 = rf_model2.predict(X_train)
+ test_preds2 = rf_model2.predict(X_test)
+ ```
+
+
+13. Calculate the accuracy scores for the training and testing sets and
+ save the results in two new variables called `train_acc2`
+ and `test_acc2`:
+ ```
+ train_acc2 = accuracy_score(y_train, train_preds2)
+ test_acc2 = accuracy_score(y_test, test_preds2)
+ ```
+
+
+14. Print the accuracy scores: `train_acc` and
+ `test_acc`:
+
+ ```
+ print(train_acc2)
+ print(test_acc2)
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+
+
+Minimum Sample in Leaf
+======================
+
+
+It would be great if we could let the model know to not create such
+specific rules that happen quite infrequently. Luckily,
+`RandomForest` has such a hyperparameter and, you guessed it,
+it is `min_samples_leaf`. This hyperparameter will specify the
+minimum number of observations (or samples) that will have to fall under
+a leaf node to be considered in the tree. For instance, if we set
+`min_samples_leaf` to `3`, then
+`RandomForest` will only consider a split that leads to at
+least three observations on both the left and right leaf nodes. If this
+condition is not met for a split, the model will not consider it and
+will exclude it from the tree. The default value in `sklearn`
+for this hyperparameter is `1`. Let\'s try to find the optimal
+value for `min_samples_leaf` for the Activity Recognition
+dataset:
+
+```
+rf_model7 = RandomForestClassifier(random_state=1, \
+ n_estimators=50, \
+ max_depth=10, \
+ min_samples_leaf=3)
+rf_model7.fit(X_train, y_train)
+preds7 = rf_model7.predict(X_train)
+test_preds7 = rf_model7.predict(X_test)
+print(accuracy_score(y_train, preds7))
+print(accuracy_score(y_test, test_preds7))
+```
+
+The output will be as follows:
+
+
+
+Caption: Accuracy scores for the training and testing sets for
+min\_samples\_leaf=3
+
+With `min_samples_leaf=3`, the accuracy for both the training
+and testing sets didn\'t change much compared to the best model we found
+in the previous section. Let\'s try increasing it to `10`:
+
+```
+rf_model8 = RandomForestClassifier(random_state=1, \
+ n_estimators=50, \
+ max_depth=10, \
+ min_samples_leaf=10)
+rf_model8.fit(X_train, y_train)
+preds8 = rf_model8.predict(X_train)
+test_preds8 = rf_model8.predict(X_test)
+print(accuracy_score(y_train, preds8))
+print(accuracy_score(y_test, test_preds8))
+```
+
+The output will be as follows:
+
+
+
+Caption: Accuracy scores for the training and testing sets for
+min\_samples\_leaf=10
+
+Now the accuracy of the training set dropped a bit but increased for the
+testing set and their difference is smaller now. So, our model is
+overfitting less. Let\'s try another value for this hyperparameter --
+`25`:
+
+```
+rf_model9 = RandomForestClassifier(random_state=1, \
+ n_estimators=50, \
+ max_depth=10, \
+ min_samples_leaf=25)
+rf_model9.fit(X_train, y_train)
+preds9 = rf_model9.predict(X_train)
+test_preds9 = rf_model9.predict(X_test)
+print(accuracy_score(y_train, preds9))
+print(accuracy_score(y_test, test_preds9))
+```
+
+The output will be as follows:
+
+
+
+Caption: Accuracy scores for the training and testing sets for
+min\_samples\_leaf=25
+
+Both accuracies for the training and testing sets decreased but they are
+quite close to each other now. So, we will keep this value
+(`25`) as the optimal one for this dataset as the performance
+is still OK and we are not overfitting too much.
+
+When choosing the optimal value for this hyperparameter, you need to be
+careful: a value that\'s too low will increase the chance of the model
+overfitting, but on the other hand, setting a very high value will lead
+to underfitting (the model will not accurately predict the right
+outcome).
+
+For instance, if you have a dataset of `1000` rows, if you set
+`min_samples_leaf` to `400`, then the model will not
+be able to find good splits to predict `5` different classes.
+In this case, the model can only create one single split and the model
+will only be able to predict two different classes instead of
+`5`. It is good practice to start with low values first and
+then progressively increase them until you reach satisfactory
+performance.
+
+
+
+Exercise 4.04: Tuning min\_samples\_leaf
+----------------------------------------
+
+In this exercise, we will keep tuning our Random Forest classifier that
+predicts animal type by trying two different values for the
+`min_samples_leaf` hyperparameter:
+
+We will be using the same zoo dataset as in the previous exercise.
+
+1. Open a new Colab notebook.
+
+2. Import the `pandas` package, `train_test_split`,
+ `RandomForestClassifier`, and `accuracy_score`
+ from `sklearn`:
+ ```
+ import pandas as pd
+ from sklearn.model_selection import train_test_split
+ from sklearn.ensemble import RandomForestClassifier
+ from sklearn.metrics import accuracy_score
+ ```
+
+
+3. Create a variable called `file_url` that contains the URL
+ to the dataset:
+ ```
+ file_url = 'https://raw.githubusercontent.com'\
+ '/fenago/data-science'\
+ '/master/Lab04/Dataset/openml_phpZNNasq.csv'
+ ```
+
+
+4. Load the dataset into a DataFrame using the `.read_csv()`
+ method from `pandas`:
+ ```
+ df = pd.read_csv(file_url)
+ ```
+
+
+5. Remove the `animal` column using `.drop()` and
+ then extract the `type` target variable into a new
+ variable called `y` using `.pop()`:
+ ```
+ df.drop(columns='animal', inplace=True)
+ y = df.pop('type')
+ ```
+
+
+6. Split the data into training and testing sets with
+ `train_test_split()` and the parameters
+ `test_size=0.4` and `random_state=188`:
+ ```
+ X_train, X_test, \
+ y_train, y_test = train_test_split(df, y, test_size=0.4, \
+ random_state=188)
+ ```
+
+
+7. Instantiate `RandomForestClassifier` with
+ `random_state=42`, `n_estimators=30`,
+ `max_depth=2`, and `min_samples_leaf=3`, and
+ then fit the model with the training set:
+
+ ```
+ rf_model = RandomForestClassifier(random_state=42, \
+ n_estimators=30, \
+ max_depth=2, \
+ min_samples_leaf=3)
+ rf_model.fit(X_train, y_train)
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+ Caption: Logs of RandomForest
+
+8. Make predictions on the training and testing sets with
+ `.predict()` and save the results into two new variables
+ called `train_preds` and `test_preds`:
+ ```
+ train_preds = rf_model.predict(X_train)
+ test_preds = rf_model.predict(X_test)
+ ```
+
+
+9. Calculate the accuracy score for the training and testing sets and
+ save the results in two new variables called `train_acc`
+ and `test_acc`:
+ ```
+ train_acc = accuracy_score(y_train, train_preds)
+ test_acc = accuracy_score(y_test, test_preds)
+ ```
+
+
+10. Print the accuracy score -- `train_acc` and
+ `test_acc`:
+
+ ```
+ print(train_acc)
+ print(test_acc)
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+ Caption: Accuracy scores for the training and testing sets
+
+ The accuracy score decreased for both the training and testing sets
+ compared to the best result we got in the previous exercise. Now the
+ difference between the training and testing sets\' accuracy scores
+ is much smaller so our model is overfitting less.
+
+11. Instantiate another `RandomForestClassifier` with
+ `random_state=42`, `n_estimators=30`,
+ `max_depth=2`, and `min_samples_leaf=7`, and
+ then fit the model with the training set:
+
+ ```
+ rf_model2 = RandomForestClassifier(random_state=42, \
+ n_estimators=30, \
+ max_depth=2, \
+ min_samples_leaf=7)
+ rf_model2.fit(X_train, y_train)
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+ Caption: Logs of RandomForest with max\_depth=2
+
+12. Make predictions on the training and testing sets with
+ `.predict()` and save the results into two new variables
+ called `train_preds2` and `test_preds2`:
+ ```
+ train_preds2 = rf_model2.predict(X_train)
+ test_preds2 = rf_model2.predict(X_test)
+ ```
+
+
+13. Calculate the accuracy score for the training and testing sets and
+ save the results in two new variables called `train_acc2`
+ and `test_acc2`:
+ ```
+ train_acc2 = accuracy_score(y_train, train_preds2)
+ test_acc2 = accuracy_score(y_test, test_preds2)
+ ```
+
+
+14. Print the accuracy scores: `train_acc` and
+ `test_acc`:
+
+ ```
+ print(train_acc2)
+ print(test_acc2)
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+
+
+Maximum Features
+================
+
+
+We are getting close to the end of this lab. You have already
+learned how to tune several of the most important hyperparameters for
+RandomForest. In this section, we will present you with another
+extremely important one: `max_features`.
+
+Earlier, we learned that `RandomForest` builds multiple trees
+and takes the average to make predictions. This is why it is called a
+forest, but we haven\'t really discussed the \"random\" part yet. Going
+through this lab, you may have asked yourself: how does building
+multiple trees help to get better predictions, and won\'t all the trees
+look the same given that the input data is the same?
+
+Before answering these questions, let\'s use the analogy of a court
+trial. In some countries, the final decision of a trial is either made
+by a judge or a jury. A judge is a person who knows the law in detail
+and can decide whether a person has broken the law or not. On the other
+hand, a jury is composed of people from different backgrounds who don\'t
+know each other or any of the parties involved in the trial and have
+limited knowledge of the legal system. In this case, we are asking
+random people who are not expert in the law to decide the outcome of a
+case. This sounds very risky at first. The risk of one person making the
+wrong decision is very high. But in fact, the risk of 10 or 20 people
+all making the wrong decision is relatively low.
+
+But there is one condition that needs to be met for this to work:
+randomness. If all the people in the jury come from the same background,
+work in the same industry, or live in the same area, they may share the
+same way of thinking and make similar decisions. For instance, if a
+group of people were raised in a community where you only drink hot
+chocolate at breakfast and one day you ask them if it is OK to drink
+coffee at breakfast, they would all say no.
+
+On the other hand, say you got another group of people from different
+backgrounds with different habits: some drink coffee, others tea, a few
+drink orange juice, and so on. If you asked them the same question, you
+would end up with the majority of them saying yes. Because we randomly
+picked these people, they have less bias as a group, and this therefore
+lowers the risk of them making a wrong decision.
+
+RandomForest actually applies the same logic: it builds a number of
+trees independently of each other by randomly sampling the data. A tree
+may see `60%` of the training data, another one
+`70%`, and so on. By doing so, there is a high chance that the
+trees are absolutely different from each other and don\'t share the same
+bias. This is the secret of RandomForest: building multiple random trees
+leads to higher accuracy.
+
+But it is not the only way RandomForest creates randomness. It does so
+also by randomly sampling columns. Each tree will only see a subset of
+the features rather than all of them. And this is exactly what the
+`max_features` hyperparameter is for: it will set the maximum
+number of features a tree is allowed to see.
+
+In `sklearn`, you can specify the value of this hyperparameter
+as:
+
+- The maximum number of features, as an integer.
+- A ratio, as the percentage of allowed features.
+- The `sqrt` function (the default value in
+ `sklearn`, which stands for square root), which will use
+ the square root of the number of features as the maximum value. If,
+ for a dataset, there are `25` features, its square root
+ will be `5` and this will be the value for
+ `max_features`.
+- The `log2` function, which will use the log base,
+ `2`, of the number of features as the maximum value. If,
+ for a dataset, there are eight features, its `log2` will
+ be `3` and this will be the value for
+ `max_features`.
+- The `None` value, which means Random Forest will use all
+ the features available.
+
+Let\'s try three different values on the activity dataset. First, we
+will specify the maximum number of features as two:
+
+```
+rf_model10 = RandomForestClassifier(random_state=1, \
+ n_estimators=50, \
+ max_depth=10, \
+ min_samples_leaf=25, \
+ max_features=2)
+rf_model10.fit(X_train, y_train)
+preds10 = rf_model10.predict(X_train)
+test_preds10 = rf_model10.predict(X_test)
+print(accuracy_score(y_train, preds10))
+print(accuracy_score(y_test, test_preds10))
+```
+
+The output will be as follows:
+
+
+
+Caption: Accuracy scores for the training and testing sets for
+max\_features=2
+
+We got results similar to those of the best model we trained in the
+previous section. This is not really surprising as we were using the
+default value of `max_features` at that time, which is
+`sqrt`. The square root of `2` equals
+`1.45`, which is quite close to `2`. This time,
+let\'s try with the ratio `0.7`:
+
+```
+rf_model11 = RandomForestClassifier(random_state=1, \
+ n_estimators=50, \
+ max_depth=10, \
+ min_samples_leaf=25, \
+ max_features=0.7)
+rf_model11.fit(X_train, y_train)
+preds11 = rf_model11.predict(X_train)
+test_preds11 = rf_model11.predict(X_test)
+print(accuracy_score(y_train, preds11))
+print(accuracy_score(y_test, test_preds11))
+```
+
+The output will be as follows:
+
+
+
+Caption: Accuracy scores for the training and testing sets for
+max\_features=0.7
+
+With this ratio, both accuracy scores increased for the training and
+testing sets and the difference between them is less. Our model is
+overfitting less now and has slightly improved its predictive power.
+Let\'s give it a shot with the `log2` option:
+
+```
+rf_model12 = RandomForestClassifier(random_state=1, \
+ n_estimators=50, \
+ max_depth=10, \
+ min_samples_leaf=25, \
+ max_features='log2')
+rf_model12.fit(X_train, y_train)
+preds12 = rf_model12.predict(X_train)
+test_preds12 = rf_model12.predict(X_test)
+print(accuracy_score(y_train, preds12))
+print(accuracy_score(y_test, test_preds12))
+```
+
+The output will be as follows:
+
+
+
+Caption: Accuracy scores for the training and testing sets for
+max\_features=\'log2\'
+
+We got similar results as for the default value (`sqrt`) and
+`2`. Again, this is normal as the `log2` of
+`6` equals `2.58`. So, the optimal value we found
+for the `max_features` hyperparameter is `0.7` for
+this dataset.
+
+
+
+Exercise 4.05: Tuning max\_features
+-----------------------------------
+
+In this exercise, we will keep tuning our RandomForest classifier that
+predicts animal type by trying two different values for the
+`max_features` hyperparameter:
+
+We will be using the same zoo dataset as in the previous exercise.
+
+1. Open a new Colab notebook.
+
+2. Import the `pandas` package, `train_test_split`,
+ `RandomForestClassifier`, and `accuracy_score`
+ from `sklearn`:
+ ```
+ import pandas as pd
+ from sklearn.model_selection import train_test_split
+ from sklearn.ensemble import RandomForestClassifier
+ from sklearn.metrics import accuracy_score
+ ```
+
+
+3. Create a variable called `file_url` that contains the URL
+ to the dataset:
+ ```
+ file_url = 'https://raw.githubusercontent.com'\
+ '/fenago/data-science'\
+ '/master/Lab04/Dataset/openml_phpZNNasq.csv'
+ ```
+
+
+4. Load the dataset into a DataFrame using the `.read_csv()`
+ method from `pandas`:
+ ```
+ df = pd.read_csv(file_url)
+ ```
+
+
+5. Remove the `animal` column using `.drop()` and
+ then extract the `type` target variable into a new
+ variable called `y` using `.pop()`:
+ ```
+ df.drop(columns='animal', inplace=True)
+ y = df.pop('type')
+ ```
+
+
+6. Split the data into training and testing sets with
+ `train_test_split()` and the parameters
+ `test_size=0.4` and `random_state=188`:
+ ```
+ X_train, X_test, \
+ y_train, y_test = train_test_split(df, y, test_size=0.4, \
+ random_state=188)
+ ```
+
+
+7. Instantiate `RandomForestClassifier` with
+ `random_state=42`, `n_estimators=30`,
+ `max_depth=2`, `min_samples_leaf=7`, and
+ `max_features=10`, and then fit the model with the
+ training set:
+
+ ```
+ rf_model = RandomForestClassifier(random_state=42, \
+ n_estimators=30, \
+ max_depth=2, \
+ min_samples_leaf=7, \
+ max_features=10)
+ rf_model.fit(X_train, y_train)
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+ Caption: Logs of RandomForest
+
+8. Make predictions on the training and testing sets with
+ `.predict()` and save the results into two new variables
+ called `train_preds` and `test_preds`:
+ ```
+ train_preds = rf_model.predict(X_train)
+ test_preds = rf_model.predict(X_test)
+ ```
+
+
+9. Calculate the accuracy scores for the training and testing sets and
+ save the results in two new variables called `train_acc`
+ and `test_acc`:
+ ```
+ train_acc = accuracy_score(y_train, train_preds)
+ test_acc = accuracy_score(y_test, test_preds)
+ ```
+
+
+10. Print the accuracy scores: `train_acc` and
+ `test_acc`:
+
+ ```
+ print(train_acc)
+ print(test_acc)
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+ Caption: Accuracy scores for the training and testing sets
+
+11. Instantiate another `RandomForestClassifier` with
+ `random_state=42`, `n_estimators=30`,
+ `max_depth=2`, `min_samples_leaf=7`, and
+ `max_features=0.2`, and then fit the model with the
+ training set:
+
+ ```
+ rf_model2 = RandomForestClassifier(random_state=42, \
+ n_estimators=30, \
+ max_depth=2, \
+ min_samples_leaf=7, \
+ max_features=0.2)
+ rf_model2.fit(X_train, y_train)
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+ Caption: Logs of RandomForest with max\_features = 0.2
+
+12. Make predictions on the training and testing sets with
+ `.predict()` and save the results into two new variables
+ called `train_preds2` and `test_preds2`:
+ ```
+ train_preds2 = rf_model2.predict(X_train)
+ test_preds2 = rf_model2.predict(X_test)
+ ```
+
+
+13. Calculate the accuracy score for the training and testing sets and
+ save the results in two new variables called `train_acc2`
+ and `test_acc2`:
+ ```
+ train_acc2 = accuracy_score(y_train, train_preds2)
+ test_acc2 = accuracy_score(y_test, test_preds2)
+ ```
+
+
+14. Print the accuracy scores: `train_acc` and
+ `test_acc`:
+
+ ```
+ print(train_acc2)
+ print(test_acc2)
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+
+
+
+Activity 4.01: Train a Random Forest Classifier on the ISOLET Dataset
+---------------------------------------------------------------------
+
+You are working for a technology company and they are planning to launch
+a new voice assistant product. You have been tasked with building a
+classification model that will recognize the letters spelled out by a
+user based on the signal frequencies captured. Each sound can be
+captured and represented as a signal composed of multiple frequencies.
+
+
+The following steps will help you to complete this activity:
+
+1. Download and load the dataset using `.read_csv()` from
+ `pandas`.
+2. Extract the response variable using `.pop()` from
+ `pandas`.
+3. Split the dataset into training and test sets using
+ `train_test_split()` from
+ `sklearn.model_selection`.
+4. Create a function that will instantiate and fit a
+ `RandomForestClassifier` using `.fit()` from
+ `sklearn.ensemble`.
+5. Create a function that will predict the outcome for the training and
+ testing sets using `.predict()`.
+6. Create a function that will print the accuracy score for the
+ training and testing sets using `accuracy_score()` from
+ `sklearn.metrics`.
+7. Train and get the accuracy score for a range of different
+ hyperparameters. Here are some options you can try:
+ - `n_estimators = 20` and `50`
+ - `max_depth = 5` and `10`
+ - `min_samples_leaf = 10` and `50`
+ - `max_features = 0.5` and `0.3`
+8. Select the best hyperparameter value.
+
+These are the accuracy scores for the best model we trained:
+
+
+
+
+
+
+Summary
+=======
+
+
+We have finally reached the end of this lab on multiclass
+classification with Random Forest. We learned that multiclass
+classification is an extension of binary classification: instead of
+predicting only two classes, target variables can have many more values.
+We saw how we can train a Random Forest model in just a few lines of
+code and assess its performance by calculating the accuracy score for
+the training and testing sets. Finally, we learned how to tune some of
+its most important hyperparameters: `n_estimators`,
+`max_depth`, `min_samples_leaf`, and
+`max_features`. We also saw how their values can have a
+significant impact on the predictive power of a model but also on its
+ability to generalize to unseen data.
diff --git a/lab_guides/Lab_5.md b/lab_guides/Lab_5.md
new file mode 100644
index 0000000..25bba0e
--- /dev/null
+++ b/lab_guides/Lab_5.md
@@ -0,0 +1,2228 @@
+
+5. Performing Your First Cluster Analysis
+=========================================
+
+
+
+Overview
+
+This lab will introduce you to unsupervised learning tasks, where
+algorithms have to automatically learn patterns from data by themselves
+as no target variables are defined beforehand. We will focus
+specifically on the k-means algorithm, and see how to standardize and
+process data for use in cluster analysis.
+
+By the end of this lab, you will be able to load and visualize data
+and clusters with scatter plots; prepare data for cluster analysis;
+perform centroid clustering with k-means; interpret clustering results
+and determine the optimal number of clusters for a given dataset.
+
+
+Clustering with k-means
+=======================
+
+
+We will perform cluster analysis on this dataset for two specific
+variables (or columns): `Average net tax` and
+`Average total deductions`. Our objective is to find groups
+(or clusters) of postcodes sharing similar patterns in terms of tax
+received and money deducted. Here is a scatter plot of these two
+variables:
+
+
+
+Caption: Scatter plot of the ATO dataset
+
+
+
+Exercise 5.01: Performing Your First Clustering Analysis on the ATO Dataset
+---------------------------------------------------------------------------
+
+In this exercise, we will be using k-means clustering on the ATO dataset
+and observing the different clusters that the dataset divides itself
+into, after which we will conclude by analyzing the output:
+
+1. Open a new Colab notebook.
+
+2. Next, load the required Python packages: `pandas` and
+ `KMeans` from `sklearn.cluster`.
+
+ We will be using the `import` function from Python:
+
+ Note
+
+ You can create short aliases for the packages you will be calling
+ quite often in your script with the function mentioned in the
+ following code snippet.
+
+ ```
+ import pandas as pd
+ from sklearn.cluster import KMeans
+ ```
+
+
+ Note
+
+ We will be looking into `KMeans` (from
+ `sklearn.cluster`), which you have used in the code here,
+ later in the lab for a more detailed explanation of it.
+
+3. Next, create a variable containing the link to the file. We will
+ call this variable `file_url`:
+
+ ```
+ file_url = 'https://raw.githubusercontent.com'\
+ '/fenago/data-science'\
+ '/master/Lab05/DataSet/taxstats2015.csv'
+ ```
+
+
+ In the next step, we will use the `pandas` package to load
+ our data into a DataFrame (think of it as a table, like on an Excel
+ spreadsheet, with a row index and column names).
+
+ Our input file is in `CSV` format, and `pandas`
+ has a method that can directly read this format, which is
+ `.read_csv()`.
+
+4. Use the `usecols` parameter to subset only the columns we
+ need rather than loading the entire dataset. We just need to provide
+ a list of the column names we are interested in, which are mentioned
+ in the following code snippet:
+
+ ```
+ df = pd.read_csv(file_url, \
+ usecols=['Postcode', \
+ 'Average net tax', \
+ 'Average total deductions'])
+ ```
+
+
+ Now we have loaded the data into a `pandas` DataFrame.
+
+5. Next, let\'s display the first 5 rows of this DataFrame , using the
+ method `.head()`:
+
+ ```
+ df.head()
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+ Caption: The first five rows of the ATO DataFrame
+
+6. Now, to output the last 5 rows, we use `.tail()`:
+
+ ```
+ df.tail()
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+ Caption: The last five rows of the ATO DataFrame
+
+ Now that we have our data, let\'s jump straight to what we want to
+ do: find clusters.
+
+ As you saw in the previous labs, `sklearn` provides
+ the exact same APIs for training different machine learning
+ algorithms, such as:
+
+ - Instantiate an algorithm with the specified hyperparameters
+ (here it will be KMeans(hyperparameters)).
+
+ - Fit the model with the training data with the method
+ `.fit()`.
+
+ - Predict the result with the given input data with the method
+ `.predict()`.
+
+ Note
+
+ Here, we will use all the default values for the k-means
+ hyperparameters except for the `random_state` one.
+ Specifying a fixed random state (also called a **seed**) will
+ help us to get reproducible results every time we have to rerun
+ our code.
+
+7. Instantiate k-means with a random state of `42` and save
+ it into a variable called `kmeans`:
+ ```
+ kmeans = KMeans(random_state=42)
+ ```
+
+
+8. Now feed k-means with our training data. To do so, we need to get
+ only the variables (or columns) used for fitting the model. In our
+ case, the variables are `'Average net tax'` and
+ `'Average total deductions'`, and they are saved in a new
+ variable called `X`:
+ ```
+ X = df[['Average net tax', 'Average total deductions']]
+ ```
+
+
+9. Now fit `kmeans` with this training data:
+
+ ```
+ kmeans.fit(X)
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+ Caption: Summary of the fitted kmeans and its hyperparameters
+
+ We just ran our first clustering algorithm in just a few lines of
+ code.
+
+10. See which cluster each data point belongs to by using the
+ `.predict()` method:
+
+ ```
+ y_preds = kmeans.predict(X)
+ y_preds
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+ Caption: Output of the k-means predictions
+
+ Note
+
+ Although we set a `random_state` value, you may still get
+ an output with different cluster numbers than the one shown above.
+ This will depend on the version of scikit-learn you are using. The
+ output above was generated using version 0.22.2. You can find out
+ which version you are using by executing the following code:
+
+ `import sklearn`
+
+ `sklearn.__version__`
+
+11. Now, add these predictions into the original DataFrame and take a
+ look at the first five postcodes:
+
+ ```
+ df['cluster'] = y_preds
+ df.head()
+ ```
+
+
+ Note
+
+ The predictions from the sklearn `predict()` method are in
+ the exact same order as the input data. So, the first prediction
+ will correspond to the first row of your DataFrame.
+
+ You should get the following output:
+
+
+
+
+
+Caption: Cluster number assigned to the first five postcodes
+
+
+Interpreting k-means Results
+============================
+
+
+After training our k-means algorithm, we will likely be interested in
+analyzing its results in more detail. Remember, the objective of cluster
+analysis is to group observations with similar patterns together. But
+how can we see whether the groupings found by the algorithm are
+meaningful? We will be looking at this in this section by using the
+dataset results we just generated.
+
+One way of investigating this is to analyze the dataset row by row with
+the assigned cluster for each observation. This can be quite tedious,
+especially if the size of your dataset is quite big, so it would be
+better to have a kind of summary of the cluster results.
+
+If you are familiar with Excel spreadsheets, you are probably thinking
+about using a pivot table to get the average of the variables for each
+cluster. In SQL, you would have probably used a `GROUP BY`
+statement. If you are not familiar with either of these, you may think
+of grouping each cluster together and then calculating the average for
+each of them. The good news is that this can be easily achieved with the
+`pandas` package in Python. Let\'s see how this can be done
+with an example.
+
+To create a pivot table similar to an Excel one, we will be using the
+`pivot_table()` method from `pandas`. We need to
+specify the following parameters for this method:
+
+- `values`: This parameter corresponds to the numerical
+ columns you want to calculate summaries for (or aggregations), such
+ as getting averages or counts. In an Excel pivot table, it is also
+ called `values`. In our dataset, we will use the
+ `Average net tax` and `Average total deductions`
+ variables.
+
+- `index`: This parameter is used to specify the columns you
+ want to see summaries for. In our case, it will be the
+ `cluster` column. In a pivot table in Excel, this
+ corresponds with the `Rows` field.
+
+- `aggfunc`: This is where you will specify the aggregation
+ functions you want to summarize the data with, such as getting
+ averages or counts. In Excel, this is the `Summarize by`
+ option in the `values` field. An example of how to use the
+ `aggfunc` method is shown below.
+
+ Note
+
+ Run the code below in the same notebook as you used for the previous
+ exercise.
+
+```
+import numpy as np
+df.pivot_table(values=['Average net tax', \
+ 'Average total deductions'], \
+ index='cluster', aggfunc=np.mean)
+```
+Note
+
+We will be using the `numpy` implementation of
+`mean()` as it is more optimized for pandas DataFrames.
+
+
+
+Caption: Output of the pivot\_table function
+
+In this summary, we can see that the algorithm has grouped the data into
+eight clusters (clusters 0 to 7). Cluster 0 has the lowest average net
+tax and total deductions amounts among all the clusters, while cluster 4
+has the highest values. With this pivot table, we are able to compare
+clusters between them using their summarized values.
+
+Using an aggregated view of clusters is a good way of seeing the
+difference between them, but it is not the only way. Another possibility
+is to visualize clusters in a graph. This is exactly what we are going
+to do now.
+
+You may have heard of different visualization packages, such as
+`matplotlib`, `seaborn`, and `bokeh`, but
+in this lab, we will be using the `altair` package because
+it is quite simple to use (its API is very similar to
+`sklearn`). Let\'s import it first:
+
+```
+import altair as alt
+```
+
+Then, we will instantiate a `Chart()` object with our
+DataFrame and save it into a variable called `chart`:
+
+```
+chart = alt.Chart(df)
+```
+Now we will specify the type of graph we want, a scatter plot, with the
+`.mark_circle()` method and will save it into a new variable
+called `scatter_plot`:
+
+```
+scatter_plot = chart.mark_circle()
+```
+Finally, we need to configure our scatter plot by specifying the names
+of the columns that will be our `x`- and `y`-axes on
+the graph. We also tell the scatter plot to color each point according
+to its cluster value with the `color` option:
+
+```
+scatter_plot.encode(x='Average net tax', \
+ y='Average total deductions', \
+ color='cluster:N')
+```
+Note
+
+You may have noticed that we added `:N` at the end of the
+`cluster` column name. This extra parameter is used in
+`altair` to specify the type of value for this column.
+`:N` means the information contained in this column is
+categorical. `altair` automatically defines the color scheme
+to be used depending on the type of a column.
+
+You should get the following output:
+
+
+
+Caption: Scatter plot of the clusters
+
+
+
+Let\'s say we want to add a tooltip that will display the values for the
+two columns of interest: the postcode and the assigned cluster. With
+`altair`, we just need to add a parameter called
+`tooltip` in the `encode()` method with a list of
+corresponding column names and call the `interactive()` method
+just after, as seen in the following code snippet:
+
+```
+scatter_plot.encode(x='Average net tax', \
+ y='Average total deductions', \
+ color='cluster:N', \
+ tooltip=['Postcode', \
+ 'cluster', 'Average net tax', \
+ 'Average total deductions'])\
+ .interactive()
+```
+You should get the following output:
+
+
+
+Caption: Interactive scatter plot of the clusters with tooltip
+
+Now we can easily hover over and inspect the data points near the
+cluster boundaries and find out that the threshold used to differentiate
+the purple cluster (6) from the red one (2) is close to 32,000 in
+`'Average Net Tax'`.
+
+
+
+Exercise 5.02: Clustering Australian Postcodes by Business Income and Expenses
+------------------------------------------------------------------------------
+
+In this exercise, we will learn how to perform clustering analysis with
+k-means and visualize its results based on postcode values sorted by
+business income and expenses. The following steps will help you complete
+this exercise:
+
+1. Open a new Colab notebook for this exercise.
+
+2. Now `import` the required packages (`pandas`,
+ `sklearn`, `altair`, and `numpy`):
+ ```
+ import pandas as pd
+ from sklearn.cluster import KMeans
+ import altair as alt
+ import numpy as np
+ ```
+
+
+3. Assign the link to the ATO dataset to a variable called
+ `file_url`:
+ ```
+ file_url = 'https://raw.githubusercontent.com'\
+ '/fenago/data-science'\
+ '/master/Lab05/DataSet/taxstats2015.csv'
+ ```
+
+
+4. Using the `read_csv` method from the pandas package, load
+ the dataset with only the following columns with the
+ `use_cols` parameter: `'Postcode'`,
+ `'Average total business income'`, and
+ `'Average total business expenses'`:
+ ```
+ df = pd.read_csv(file_url, \
+ usecols=['Postcode', \
+ 'Average total business income', \
+ 'Average total business expenses'])
+ ```
+
+
+5. Display the last 10 rows from the ATO dataset using the
+ `.tail()` method from pandas:
+
+ ```
+ df.tail(10)
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+ Caption: The last 10 rows of the ATO dataset
+
+6. Extract the `'Average total business income'` and
+ `'Average total business expenses'` columns using the
+ following pandas column subsetting syntax:
+ `dataframe_name[]`. Then, save them into
+ a new variable called `X`:
+ ```
+ X = df[['Average total business income', \
+ 'Average total business expenses']]
+ ```
+
+
+7. Now fit `kmeans` with this new variable using a value of
+ `8` for the `random_state` hyperparameter:
+
+ ```
+ kmeans = KMeans(random_state=8)
+ kmeans.fit(X)
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+ Caption: Summary of the fitted kmeans and its hyperparameters
+
+8. Using the `predict` method from the `sklearn`
+ package, predict the clustering assignment from the input variable,
+ `(X)`, save the results into a new variable called
+ `y_preds`, and display the last `10`
+ predictions:
+
+ ```
+ y_preds = kmeans.predict(X)
+ y_preds[-10:]
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+ Caption: Results of the clusters assigned to the last 10
+ observations
+
+9. Save the predicted clusters back to the DataFrame by creating a new
+ column called `'cluster'` and print the last
+ `10` rows of the DataFrame using the `.tail()`
+ method from the `pandas` package:
+
+ ```
+ df['cluster'] = y_preds
+ df.tail(10)
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+ Caption: The last 10 rows of the ATO dataset with the added
+ cluster column
+
+10. Generate a pivot table with the averages of the two columns for each
+ cluster value using the `pivot_table` method from the
+ `pandas` package with the following parameters:
+
+ Provide the names of the columns to be aggregated,
+ `'Average total business income'`
+ and` 'Average total business expenses'`, to the parameter
+ values.
+
+ Provide the name of the column to be grouped, `'cluster'`,
+ to the parameter index.
+
+ Use the `.mean` method from NumPy (`np`) as the
+ aggregation function for the `aggfunc` parameter:
+
+ ```
+ df.pivot_table(values=['Average total business income', \
+ 'Average total business expenses'], \
+ index='cluster', aggfunc=np.mean)
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+ Caption: Output of the pivot\_table function
+
+11. Now let\'s plot the clusters using an interactive scatter plot.
+ First, use `Chart()` and `mark_circle()` from
+ the `altair` package to instantiate a scatter plot graph:
+ ```
+ scatter_plot = alt.Chart(df).mark_circle()
+ ```
+
+
+12. Use the `encode` and `interactive` methods from
+ `altair` to specify the display of the scatter plot and
+ its interactivity options with the following parameters:
+
+ Provide the name of the `'Average total business income'`
+ column to the `x` parameter (the x-axis).
+
+ Provide the name of the
+ `'Average total business expenses'` column to the
+ `y` parameter (the y-axis).
+
+ Provide the name of the `cluster:N` column to the
+ `color` parameter (providing a different color for each
+ group).
+
+ Provide these column names -- `'Postcode'`,
+ `'cluster'`, `'Average total business income'`,
+ and `'Average total business expenses'` -- to the
+ `'tooltip'` parameter (this being the information
+ displayed by the tooltip):
+
+ ```
+ scatter_plot.encode(x='Average total business income', \
+ y='Average total business expenses', \
+ color='cluster:N', tooltip = ['Postcode', \
+ 'cluster', \
+ 'Average total business income', \
+ 'Average total business expenses'])\
+ .interactive()
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+Caption: Interactive scatter plot of the clusters
+
+
+
+Choosing the Number of Clusters
+===============================
+
+
+In the previous sections, we saw how easy it is to fit the k-means
+algorithm on a given dataset. In our ATO dataset, we found 8 different
+clusters that were mainly defined by the values of the
+`Average net tax` variable.
+
+But you may have asked yourself: \"*Why 8 clusters? Why not 3 or 15
+clusters?*\" These are indeed excellent questions. The short answer is
+that we used k-means\' default value for the hyperparameter
+`n_cluster`, defining the number of clusters to be found, as
+8.
+
+As you will recall from *Lab 2*, *Regression*, and *Lab 4*,
+*Multiclass Classification with RandomForest*, the value of a
+hyperparameter isn\'t learned by the algorithm but has to be set
+arbitrarily by you prior to training. For k-means, `n_cluster`
+is one of the most important hyperparameters you will have to tune.
+Choosing a low value will lead k-means to group many data points
+together, even though they are very different from each other. On the
+other hand, choosing a high value may force the algorithm to split close
+observations into multiple ones, even though they are very similar.
+
+Looking at the scatter plot from the ATO dataset, eight clusters seems
+to be a lot. On the graph, some of the clusters look very close to each
+other and have similar values. Intuitively, just by looking at the plot,
+you could have said that there were between two and four different
+clusters. As you can see, this is quite suggestive, and it would be
+great if there was a function that could help us to define the right
+number of clusters for a dataset. Such a method does indeed exist, and
+it is called the **Elbow** method.
+
+This method assesses the compactness of clusters, the objective being to
+minimize a value known as **inertia**. More details and an explanation
+about this will be provided later in this lab. For now, think of
+inertia as a value that says, for a group of data points, how far from
+each other or how close to each other they are.
+
+Let\'s apply this method to our ATO dataset. First, we will define the
+range of cluster numbers we want to evaluate (between 1 and 10) and save
+them in a DataFrame called `clusters`. We will also create an
+empty list called `inertia`, where we will store our
+calculated values.
+
+Note
+
+Open the notebook you were using for *Exercise 5.01*, *Performing Your
+First Clustering Analysis on the ATO Dataset*, execute the code you
+already entered, and then continue at the end of the notebook with the
+following code.
+
+```
+clusters = pd.DataFrame()
+clusters['cluster_range'] = range(1, 10)
+inertia = []
+```
+Next, we will create a `for` loop that will iterate over the
+range, fit a k-means model with the specified number of
+`clusters`, extract the `inertia` value, and store
+it in our list, as in the following code snippet:
+
+```
+for k in clusters['cluster_range']:
+ kmeans = KMeans(n_clusters=k, random_state=8).fit(X)
+ inertia.append(kmeans.inertia_)
+```
+Now we can use our list of `inertia` values in the
+`clusters` DataFrame:
+
+```
+clusters['inertia'] = inertia
+clusters
+```
+You should get the following output:
+
+
+
+Caption: Dataframe containing inertia values for our clusters
+
+Then, we need to plot a line chart using `altair` with the
+`mark_line()` method. We will specify the
+`'cluster_range'` column as our x-axis and
+`'inertia'` as our y-axis, as in the following code snippet:
+
+```
+alt.Chart(clusters).mark_line()\
+ .encode(x='cluster_range', y='inertia')
+```
+You should get the following output:
+
+
+
+Caption: Plotting the Elbow method
+
+Note
+
+You don\'t have to save each of the `altair` objects in a
+separate variable; you can just append the methods one after the other
+with \"`.".`
+
+Now that we have plotted the inertia value against the number of
+clusters, we need to find the optimal number of clusters. What we need
+to do is to find the inflection point in the graph, where the inertia
+value starts to decrease more slowly (that is, where the slope of the
+line almost reaches a 45-degree angle). Finding the right **inflection
+point** can be a bit tricky. If you picture this line chart as an arm,
+what we want is to find the center of the Elbow (now you know where the
+name for this method comes from). So, looking at our example, we will
+say that the optimal number of clusters is three. If we kept adding more
+clusters, the inertia would not decrease drastically and add any value.
+This is the reason why we want to find the middle of the Elbow as the
+inflection point.
+
+Now let\'s retrain our `Kmeans` with this hyperparameter and
+plot the clusters as shown in the following code snippet:
+
+```
+kmeans = KMeans(random_state=42, n_clusters=3)
+kmeans.fit(X)
+df['cluster2'] = kmeans.predict(X)
+scatter_plot.encode(x='Average net tax', \
+ y='Average total deductions', \
+ color='cluster2:N', \
+ tooltip=['Postcode', 'cluster', \
+ 'Average net tax', \
+ 'Average total deductions'])\
+ .interactive()
+```
+You should get the following output:
+
+
+
+Caption: Scatter plot of the three clusters
+
+This is very different compared to our initial results. Looking at the
+three clusters, we can see that:
+
+- The first cluster (red) represents postcodes with low values for
+ both average net tax and total deductions.
+
+- The second cluster (blue) is for medium average net tax and low
+ average total deductions.
+
+- The third cluster (orange) is grouping all postcodes with average
+ net tax values above 35,000.
+
+ Note
+
+ It is worth noticing that the data points are more spread in the
+ third cluster; this may indicate that there are some outliers in
+ this group.
+
+This example showed us how important it is to define the right number of
+clusters before training a k-means algorithm if we want to get
+meaningful groups from data. We used a method called the Elbow method to
+find this optimal number.
+
+
+
+Exercise 5.03: Finding the Optimal Number of Clusters
+-----------------------------------------------------
+
+In this exercise, we will apply the Elbow method to the same data as in
+*Exercise 5.02*, *Clustering Australian Postcodes by Business Income and
+Expenses*, to find the optimal number of clusters, before fitting a
+k-means model:
+
+1. Open a new Colab notebook for this exercise.
+
+2. Now `import` the required packages (`pandas`,
+ `sklearn`, and `altair`):
+
+ ```
+ import pandas as pd
+ from sklearn.cluster import KMeans
+ import altair as alt
+ ```
+
+
+ Next, we will load the dataset and select the same columns as in
+ *Exercise 5.02*, *Clustering Australian Postcodes by Business Income
+ and Expenses*, and print the first five rows.
+
+3. Assign the link to the ATO dataset to a variable called
+ `file_url`:
+ ```
+ file_url = 'https://raw.githubusercontent.com'\
+ '/fenago/data-science'\
+ '/master/Lab05/DataSet/taxstats2015.csv'
+ ```
+
+
+4. Using the `.read_csv()` method from the pandas package,
+ load the dataset with only the following columns using the
+ `use_cols` parameter: `'Postcode'`,
+ `'Average total business income'`, and
+ `'Average total business expenses'`:
+ ```
+ df = pd.read_csv(file_url, \
+ usecols=['Postcode', \
+ 'Average total business income', \
+ 'Average total business expenses'])
+ ```
+
+
+5. Display the first five rows of the DataFrame with the
+ `.head()` method from the pandas package:
+
+ ```
+ df.head()
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+ Caption: The first five rows of the ATO DataFrame
+
+6. Assign the `'Average total business income'` and
+ `'Average total business expenses'` columns to a new
+ variable called `X`:
+ ```
+ X = df[['Average total business income', \
+ 'Average total business expenses']]
+ ```
+
+
+7. Create an empty pandas DataFrame called `clusters` and an
+ empty list called `inertia`:
+
+ ```
+ clusters = pd.DataFrame()
+ inertia = []
+ ```
+
+
+ Now, use the `range` function to generate a list
+ containing the range of cluster numbers, from `1` to
+ `15`, and assign it to a new column called
+ `'cluster_range'` from the `'clusters'`
+ DataFrame:
+
+ ```
+ clusters['cluster_range'] = range(1, 15)
+ ```
+
+
+8. Create a `for` loop to go through each cluster number and
+ fit a k-means model accordingly, then append the `inertia`
+ values using the `'inertia_'` parameter with the
+ `'inertia'` list:
+ ```
+ for k in clusters['cluster_range']:
+ kmeans = KMeans(n_clusters=k).fit(X)
+ inertia.append(kmeans.inertia_)
+ ```
+
+
+9. Assign the `inertia` list to a new column called
+ `'inertia'` from the `clusters` DataFrame and
+ display its content:
+
+ ```
+ clusters['inertia'] = inertia
+ clusters
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+ Caption: Plotting the Elbow method
+
+10. Now use `mark_line()` and `encode()` from the
+ `altair` package to plot the Elbow graph with
+ `'cluster_range'` as the x-axis and `'inertia'`
+ as the y-axis:
+
+ ```
+ alt.Chart(clusters).mark_line()\
+ .encode(alt.X('cluster_range'), alt.Y('inertia'))
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+ Caption: Plotting the Elbow method
+
+11. Looking at the Elbow plot, identify the optimal number of clusters,
+ and assign this value to a variable called
+ `optim_cluster`:
+ ```
+ optim_cluster = 4
+ ```
+
+
+12. Train a k-means model with this number of clusters and a
+ `random_state` value of `42` using the
+ `fit` method from `sklearn`:
+ ```
+ kmeans = KMeans(random_state=42, n_clusters=optim_cluster)
+ kmeans.fit(X)
+ ```
+
+
+13. Now, using the `predict` method from `sklearn`,
+ get the predicted assigned cluster for each data point contained in
+ the `X` variable and save the results into a new column
+ called `'cluster2'` from the `df` DataFrame:
+ ```
+ df['cluster2'] = kmeans.predict(X)
+ ```
+
+
+14. Display the first five rows of the `df` DataFrame using
+ the `head` method from the `pandas` package:
+
+ ```
+ df.head()
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+ Caption: The first five rows with the cluster predictions
+
+15. Now plot the scatter plot using the `mark_circle()` and
+ `encode()` methods from the `altair` package.
+ Also, to add interactiveness, use the `tooltip` parameter
+ and the `interactive()` method from the `altair`
+ package as shown in the following code snippet:
+
+ ```
+ alt.Chart(df).mark_circle()\
+ .encode\
+ (x='Average total business income', \
+ y='Average total business expenses', \
+ color='cluster2:N', \
+ tooltip=['Postcode', 'cluster2', \
+ 'Average total business income',\
+ 'Average total business expenses'])\
+ .interactive()
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+
+
+Initializing Clusters
+=====================
+
+
+Since the beginning of this lab, we\'ve been referring to k-means
+every time we\'ve fitted our clustering algorithms. But you may have
+noticed in each model summary that there was a hyperparameter called
+`init` with the default value as k-means++. We were, in fact,
+using k-means++ all this time.
+
+The difference between k-means and k-means++ is in how they initialize
+clusters at the start of the training. k-means randomly chooses the
+center of each cluster (called the **centroid**) and then assigns each
+data point to its nearest cluster. If this cluster initialization is
+chosen incorrectly, this may lead to non-optimal grouping at the end of
+the training process. For example, in the following graph, we can
+clearly see the three natural groupings of the data, but the algorithm
+didn\'t succeed in identifying them properly:
+
+
+
+Caption: Example of non-optimal clusters being found
+
+k-means++ is an attempt to find better clusters at initialization time.
+The idea behind it is to choose the first cluster randomly and then pick
+the next ones, those further away, using a probability distribution from
+the remaining data points. Even though k-means++ tends to get better
+results compared to the original k-means, in some cases, it can still
+lead to non-optimal clustering.
+
+Another hyperparameter data scientists can use to lower the risk of
+incorrect clusters is `n_init`. This corresponds to the number
+of times k-means is run with different initializations, the final model
+being the best run. So, if you have a high number for this
+hyperparameter, you will have a higher chance of finding the optimal
+clusters, but the downside is that the training time will be longer. So,
+you have to choose this value carefully, especially if you have a large
+dataset.
+
+Let\'s try this out on our ATO dataset by having a look at the following
+example.
+
+Note
+
+Open the notebook you were using for *Exercise 5.01*, *Performing Your
+First Clustering Analysis on the ATO Dataset,* and earlier examples.
+Execute the code you already entered, and then continue at the end of
+the notebook with the following code.
+
+First, let\'s run only one iteration using random initialization:
+
+```
+kmeans = KMeans(random_state=14, n_clusters=3, \
+ init='random', n_init=1)
+kmeans.fit(X)
+```
+As usual, we want to visualize our clusters with a scatter plot, as
+defined in the following code snippet:
+
+```
+df['cluster3'] = kmeans.predict(X)
+alt.Chart(df).mark_circle()\
+ .encode(x='Average net tax', \
+ y='Average total deductions', \
+ color='cluster3:N', \
+ tooltip=['Postcode', 'cluster', \
+ 'Average net tax', \
+ 'Average total deductions']) \
+ .interactive()
+```
+You should get the following output:
+
+
+
+Caption: Clustering results with n\_init as 1 and init as random
+
+Overall, the result is very close to that of our previous run. It is
+worth noticing that the boundaries between the clusters are slightly
+different.
+
+Now let\'s try with five iterations (using the `n_init`
+hyperparameter) and k-means++ initialization (using the `init`
+hyperparameter):
+
+```
+kmeans = KMeans(random_state=14, n_clusters=3, \
+ init='k-means++', n_init=5)
+kmeans.fit(X)
+df['cluster4'] = kmeans.predict(X)
+alt.Chart(df).mark_circle()\
+ .encode(x='Average net tax', \
+ y='Average total deductions', \
+ color='cluster4:N', \
+ tooltip=['Postcode', 'cluster', \
+ 'Average net tax', \
+ 'Average total deductions'])\
+ .interactive()
+```
+You should get the following output:
+
+
+
+Caption: Clustering results with n\_init as 5 and init as k-means++
+
+Here, the results are very close to the original run with 10 iterations.
+This means that we didn\'t have to run so many iterations for k-means to
+converge and could have saved some time with a lower number.
+
+
+
+Exercise 5.04: Using Different Initialization Parameters to Achieve a Suitable Outcome
+--------------------------------------------------------------------------------------
+
+In this exercise, we will use the same data as in *Exercise 5.02*,
+*Clustering Australian Postcodes by Business Income and Expenses*, and
+try different values for the `init` and `n_init`
+hyperparameters and see how they affect the final clustering result:
+
+1. Open a new Colab notebook.
+
+2. Import the required packages, which are `pandas`,
+ `sklearn`, and `altair`:
+ ```
+ import pandas as pd
+ from sklearn.cluster import KMeans
+ import altair as alt
+ ```
+
+
+3. Assign the link to the ATO dataset to a variable called
+ `file_url`:
+ ```
+ file_url = 'https://raw.githubusercontent.com'\
+ '/fenago/data-science'\
+ '/master/Lab05/DataSet/taxstats2015.csv'
+ ```
+
+
+4. Load the dataset and select the same columns as in *Exercise 5.02*,
+ *Clustering Australian Postcodes by Business Income and Expenses*,
+ and *Exercise 5.03*, *Finding the Optimal Number of Clusters*, using
+ the `read_csv()` method from the `pandas`
+ package:
+ ```
+ df = pd.read_csv(file_url, \
+ usecols=['Postcode', \
+ 'Average total business income', \
+ 'Average total business expenses'])
+ ```
+
+
+5. Assign the `'Average total business income'` and
+ `'Average total business expenses'` columns to a new
+ variable called `X`:
+ ```
+ X = df[['Average total business income', \
+ 'Average total business expenses']]
+ ```
+
+
+6. Fit a k-means model with `n_init` equal to `1`
+ and a random `init`:
+ ```
+ kmeans = KMeans(random_state=1, n_clusters=4, \
+ init='random', n_init=1)
+ kmeans.fit(X)
+ ```
+
+
+7. Using the `predict` method from the `sklearn`
+ package, predict the clustering assignment from the input variable,
+ `(X)`, and save the results into a new column called
+ `'cluster3'` in the DataFrame:
+ ```
+ df['cluster3'] = kmeans.predict(X)
+ ```
+
+
+8. Plot the clusters using an interactive scatter plot. First, use
+ `Chart()` and `mark_circle()` from the
+ `altair` package to instantiate a scatter plot graph, as
+ shown in the following code snippet:
+ ```
+ scatter_plot = alt.Chart(df).mark_circle()
+ ```
+
+
+9. Use the `encode` and `interactive` methods from
+ `altair` to specify the display of the scatter plot and
+ its interactivity options with the following parameters:
+
+ Provide the name of the `'Average total business income'`
+ column to the `x` parameter (x-axis).
+
+ Provide the name of the
+ `'Average total business expenses'` column to the
+ `y` parameter (y-axis).
+
+ Provide the name of the `'cluster3:N'` column to the
+ `color` parameter (which defines the different colors for
+ each group).
+
+ Provide these column names -- `'Postcode'`,
+ `'cluster3'`, `'Average total business income'`,
+ and `'Average total business expenses'` -- to the
+ `tooltip` parameter:
+
+ ```
+ scatter_plot.encode(x='Average total business income', \
+ y='Average total business expenses', \
+ color='cluster3:N', \
+ tooltip=['Postcode', 'cluster3', \
+ 'Average total business income', \
+ 'Average total business expenses'])\
+ .interactive()
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+ Caption: Clustering results with n\_init as 1 and init as random
+
+10. Repeat *Steps 5* to *8* but with different k-means hyperparameters,
+ `n_init=10` and random `init`, as shown in the
+ following code snippet:
+
+ ```
+ kmeans = KMeans(random_state=1, n_clusters=4, \
+ init='random', n_init=10)
+ kmeans.fit(X)
+ df['cluster4'] = kmeans.predict(X)
+ scatter_plot = alt.Chart(df).mark_circle()
+ scatter_plot.encode(x='Average total business income', \
+ y='Average total business expenses', \
+ color='cluster4:N',
+ tooltip=['Postcode', 'cluster4', \
+ 'Average total business income', \
+ 'Average total business expenses'])\
+ .interactive()
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+ Caption: Clustering results with n\_init as 10 and init as
+ random
+
+11. Again, repeat *Steps 5* to *8* but with different k-means
+ hyperparameters -- `n_init=100` and random
+ `init`:
+
+ ```
+ kmeans = KMeans(random_state=1, n_clusters=4, \
+ init='random', n_init=100)
+ kmeans.fit(X)
+ df['cluster5'] = kmeans.predict(X)
+ scatter_plot = alt.Chart(df).mark_circle()
+ scatter_plot.encode(x='Average total business income', \
+ y='Average total business expenses', \
+ color='cluster5:N', \
+ tooltip=['Postcode', 'cluster5', \
+ 'Average total business income', \
+ 'Average total business expenses'])\
+ .interactive()
+ ```
+
+
+ You should get the following output:
+
+
+
+Caption: Clustering results with n\_init as 10 and init as random
+
+
+
+Calculating the Distance to the Centroid
+========================================
+
+
+We\'ve talked a lot about similarities between data points in the
+previous sections, but we haven\'t really defined what this means. You
+have probably guessed that it has something to do with how close or how
+far observations are from each other. You are heading in the right
+direction. It has to do with some sort of distance measure between two
+points. The one used by k-means is called **squared Euclidean distance**
+and its formula is:
+
+
+
+Caption: The squared Euclidean distance formula
+
+If you don\'t have a statistical background, this formula may look
+intimidating, but it is actually very simple. It is the sum of the
+squared difference between the data coordinates. Here, *x* and *y* are
+two data points and the index, *i*, represents the number of
+coordinates. If the data has two dimensions, *i* equals 2. Similarly, if
+there are three dimensions, then *i* will be 3.
+
+Let\'s apply this formula to the ATO dataset.
+
+First, we will grab the values needed -- that is, the coordinates from
+the first two observations -- and print them:
+
+Note
+
+Open the notebook you were using for *Exercise 5.01*, *Performing Your
+First Clustering Analysis on the ATO Dataset*, and earlier examples.
+Execute the code you already entered, and then continue at the end of
+the notebook with the following code.
+
+```
+x = X.iloc[0,].values
+y = X.iloc[1,].values
+print(x)
+print(y)
+```
+You should get the following output:
+
+
+
+Caption: Extracting the first two observations from the ATO dataset
+
+Note
+
+In pandas, the `iloc` method is used to subset the rows or
+columns of a DataFrame by index. For instance, if we wanted to grab row
+number 888 and column number 6, we would use the following syntax:
+`dataframe.iloc[888, 6]`.
+
+The coordinates for `x` are `(27555, 2071)` and the
+coordinates for `y` are `(28142, 3804)`. Here, the
+formula is telling us to calculate the squared difference between each
+axis of the two data points and sum them:
+
+```
+squared_euclidean = (x[0] - y[0])**2 + (x[1] - y[1])**2
+print(squared_euclidean)
+```
+You should get the following output:
+
+```
+3347858
+```
+k-means uses this metric to calculate the distance between each data
+point and the center of its assigned cluster (also called the centroid).
+Here is the basic logic behind this algorithm:
+
+1. Choose the centers of the clusters (the centroids) randomly.
+2. Assign each data point to the nearest centroid using the squared
+ Euclidean distance.
+3. Update each centroid\'s coordinates to the newly calculated center
+ of the data points assigned to it.
+4. Repeat *Steps 2* and *3* until the clusters converge (that is, until
+ the cluster assignment doesn\'t change anymore) or until the maximum
+ number of iterations has been reached.
+
+That\'s it. The k-means algorithm is as simple as that. We can extract
+the centroids after fitting a k-means model with
+`cluster_centers_`.
+
+Let\'s see how we can plot the centroids in an example.
+
+First, we fit a k-means model as shown in the following code snippet:
+
+```
+kmeans = KMeans(random_state=42, n_clusters=3, \
+ init='k-means++', n_init=5)
+kmeans.fit(X)
+df['cluster6'] = kmeans.predict(X)
+```
+Now extract the `centroids` into a DataFrame and print them:
+
+```
+centroids = kmeans.cluster_centers_
+centroids = pd.DataFrame(centroids, \
+ columns=['Average net tax', \
+ 'Average total deductions'])
+print(centroids)
+```
+You should get the following output:
+
+
+
+Caption: Coordinates of the three centroids
+
+We will plot the usual scatter plot but will assign it to a variable
+called `chart1`:
+
+```
+chart1 = alt.Chart(df).mark_circle()\
+ .encode(x='Average net tax', \
+ y='Average total deductions', \
+ color='cluster6:N', \
+ tooltip=['Postcode', 'cluster6', \
+ 'Average net tax', \
+ 'Average total deductions'])\
+ .interactive()
+chart1
+```
+You should get the following output:
+
+
+
+Caption: Scatter plot of the clusters
+
+Now, to create a second scatter plot only for the centroids called
+`chart2`:
+
+```
+chart2 = alt.Chart(centroids).mark_circle(size=100)\
+ .encode(x='Average net tax', \
+ y='Average total deductions', \
+ color=alt.value('black'), \
+ tooltip=['Average net tax', \
+ 'Average total deductions'])\
+ .interactive()
+chart2
+```
+You should get the following output:
+
+
+
+Caption: Scatter plot of the centroids
+
+And now we combine the two charts, which is extremely easy with
+`altair`:
+
+```
+chart1 + chart2
+```
+You should get the following output:
+
+
+
+Caption: Scatter plot of the clusters and their centroids
+
+Now we can easily see which centroids the observations are closest to.
+
+
+
+Exercise 5.05: Finding the Closest Centroids in Our Dataset
+-----------------------------------------------------------
+
+In this exercise, we will be coding the first iteration of k-means in
+order to assign data points to their closest cluster centroids. The
+following steps will help you complete the exercise:
+
+1. Open a new Colab notebook.
+
+2. Now `import` the required packages, which are
+ `pandas`, `sklearn`, and `altair`:
+ ```
+ import pandas as pd
+ from sklearn.cluster import KMeans
+ import altair as alt
+ ```
+
+
+3. Load the dataset and select the same columns as in *Exercise 5.02*,
+ *Clustering Australian Postcodes by Business Income and Expenses*,
+ using the `read_csv()` method from the `pandas`
+ package:
+ ```
+ file_url = 'https://raw.githubusercontent.com/'\
+ 'fenago/data-science/'\
+ 'master/Lab05/DataSet/taxstats2015.csv'
+ df = pd.read_csv(file_url, \
+ usecols=['Postcode', \
+ 'Average total business income', \
+ 'Average total business expenses'])
+ ```
+
+
+4. Assign the `'Average total business income'` and
+ `'Average total business expenses'` columns to a new
+ variable called `X`:
+ ```
+ X = df[['Average total business income', \
+ 'Average total business expenses']]
+ ```
+
+
+5. Now, calculate the minimum and maximum using the `min()`
+ and `max()` values of the
+ `'Average total business income'` and
+ `'Average total business income'` variables, as shown in
+ the following code snippet:
+ ```
+ business_income_min = df['Average total business income'].min()
+ business_income_max = df['Average total business income'].max()
+ business_expenses_min = df['Average total business expenses']\
+ .min()
+ business_expenses_max = df['Average total business expenses']\
+ .max()
+ ```
+
+
+6. Print the values of these four variables, which are the minimum and
+ maximum values of the two variables:
+
+ ```
+ print(business_income_min)
+ print(business_income_max)
+ print(business_expenses_min)
+ print(business_expenses_max)
+ ```
+
+
+ You should get the following output:
+
+ ```
+ 0
+ 876324
+ 0
+ 884659
+ ```
+
+
+7. Now import the `random` package and use the
+ `seed()` method to set a seed of `42`, as shown
+ in the following code snippet:
+ ```
+ import random
+ random.seed(42)
+ ```
+
+
+8. Create an empty pandas DataFrame and assign it to a variable called
+ `centroids`:
+ ```
+ centroids = pd.DataFrame()
+ ```
+
+
+9. Generate four random values using the `sample()` method
+ from the `random` package with possible values between the
+ minimum and maximum values of the
+ `'Average total business expenses'` column using
+ `range()` and store the results in a new column called
+ `'Average total business income'` from the
+ `centroids` DataFrame:
+ ```
+ centroids\
+ ['Average total business income'] = random.sample\
+ (range\
+ (business_income_min, \
+ business_income_max), 4)
+ ```
+
+
+10. Repeat the same process to generate `4` random values for
+ `'Average total business expenses'`:
+ ```
+ centroids\
+ ['Average total business expenses'] = random.sample\
+ (range\
+ (business_expenses_min,\
+ business_expenses_max), 4)
+ ```
+
+
+11. Create a new column called `'cluster'` from the
+ `centroids` DataFrame using the
+ `.index `attributes from the pandas package and print this
+ DataFrame:
+
+ ```
+ centroids['cluster'] = centroids.index
+ centroids
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+ Caption: Coordinates of the four random centroids
+
+12. Create a scatter plot with the `altair` package to display
+ the data contained in the `df` DataFrame and save it in a
+ variable called `'chart1'`:
+ ```
+ chart1 = alt.Chart(df.head()).mark_circle()\
+ .encode(x='Average total business income', \
+ y='Average total business expenses', \
+ color=alt.value('orange'), \
+ tooltip=['Postcode', \
+ 'Average total business income', \
+ 'Average total business expenses'])\
+ .interactive()
+ ```
+
+
+13. Now create a second scatter plot using the `altair`
+ package to display the centroids and save it in a variable called
+ `'chart2'`:
+ ```
+ chart2 = alt.Chart(centroids).mark_circle(size=100)\
+ .encode(x='Average total business income', \
+ y='Average total business expenses', \
+ color=alt.value('black'), \
+ tooltip=['cluster', \
+ 'Average total business income',\
+ 'Average total business expenses'])\
+ .interactive()
+ ```
+
+
+14. Display the two charts together using the altair syntax:
+ ` + `:
+
+ ```
+ chart1 + chart2
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+ Caption: Scatter plot of the random centroids and the first five
+ observations
+
+15. Define a function that will calculate the
+ `squared_euclidean` distance and return its value. This
+ function will take the `x` and `y` coordinates
+ of a data point and a centroid:
+ ```
+ def squared_euclidean(data_x, data_y, \
+ centroid_x, centroid_y, ):
+ return (data_x - centroid_x)**2 + (data_y - centroid_y)**2
+ ```
+
+
+16. Using the `.at` method from the pandas package, extract
+ the first row\'s `x` and `y` coordinates and
+ save them in two variables called `data_x` and
+ `data_y`:
+ ```
+ data_x = df.at[0, 'Average total business income']
+ data_y = df.at[0, 'Average total business expenses']
+ ```
+
+
+17. Using a `for` loop or list comprehension, calculate the
+ `squared_euclidean` distance of the first observation
+ (using its `data_x` and `data_y` coordinates)
+ against the `4` different centroids contained in
+ `centroids`, save the result in a variable called
+ `distance`, and display it:
+
+ ```
+ distances = [squared_euclidean\
+ (data_x, data_y, centroids.at\
+ [i, 'Average total business income'], \
+ centroids.at[i, \
+ 'Average total business expenses']) \
+ for i in range(4)]
+ distances
+ ```
+
+
+ You should get the following output:
+
+ ```
+ [215601466600, 10063365460, 34245932020, 326873037866]
+ ```
+
+
+18. Use the `index` method from the list containing the
+ `squared_euclidean` distances to find the cluster with the
+ shortest distance, as shown in the following code snippet:
+ ```
+ cluster_index = distances.index(min(distances))
+ ```
+
+
+19. Save the `cluster` index in a column called
+ `'cluster'` from the `df` DataFrame for the
+ first observation using the `.at` method from the pandas
+ package:
+ ```
+ df.at[0, 'cluster'] = cluster_index
+ ```
+
+
+20. Display the first five rows of `df` using the
+ `head()` method from the `pandas` package:
+
+ ```
+ df.head()
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+ Caption: The first five rows of the ATO DataFrame with the
+ assigned cluster number for the first row
+
+21. Repeat *Steps 15* to *19* for the next `4` rows to
+ calculate their distances from the centroids and find the cluster
+ with the smallest distance value:
+
+ ```
+ distances = [squared_euclidean\
+ (df.at[1, 'Average total business income'], \
+ df.at[1, 'Average total business expenses'], \
+ centroids.at[i, 'Average total business income'],\
+ centroids.at[i, \
+ 'Average total business expenses'])\
+ for i in range(4)]
+ df.at[1, 'cluster'] = distances.index(min(distances))
+ distances = [squared_euclidean\
+ (df.at[2, 'Average total business income'], \
+ df.at[2, 'Average total business expenses'], \
+ centroids.at[i, 'Average total business income'],\
+ centroids.at[i, \
+ 'Average total business expenses'])\
+ for i in range(4)]
+ df.at[2, 'cluster'] = distances.index(min(distances))
+ distances = [squared_euclidean\
+ (df.at[3, 'Average total business income'], \
+ df.at[3, 'Average total business expenses'], \
+ centroids.at[i, 'Average total business income'],\
+ centroids.at[i, \
+ 'Average total business expenses'])\
+ for i in range(4)]
+ df.at[3, 'cluster'] = distances.index(min(distances))
+ distances = [squared_euclidean\
+ (df.at[4, 'Average total business income'], \
+ df.at[4, 'Average total business expenses'], \
+ centroids.at[i, \
+ 'Average total business income'], \
+ centroids.at[i, \
+ 'Average total business expenses']) \
+ for i in range(4)]
+ df.at[4, 'cluster'] = distances.index(min(distances))
+ df.head()
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+ Caption: The first five rows of the ATO DataFrame and their
+ assigned clusters
+
+22. Finally, plot the centroids and the first `5` rows of the
+ dataset using the `altair` package as in *Steps 12* to
+ *13*:
+
+ ```
+ chart1 = alt.Chart(df.head()).mark_circle()\
+ .encode(x='Average total business income', \
+ y='Average total business expenses', \
+ color='cluster:N', \
+ tooltip=['Postcode', 'cluster', \
+ 'Average total business income', \
+ 'Average total business expenses'])\
+ .interactive()
+ chart2 = alt.Chart(centroids).mark_circle(size=100)\
+ .encode(x='Average total business income', \
+ y='Average total business expenses', \
+ color=alt.value('black'), \
+ tooltip=['cluster', \
+ 'Average total business income',\
+ 'Average total business expenses'])\
+ .interactive()
+ chart1 + chart2
+ ```
+
+
+ You should get the following output:
+
+
+
+Caption: Scatter plot of the random centroids and the first fiveobservations
+
+
+Standardizing Data
+==================
+
+
+You\'ve already learned a lot about the k-means algorithm, and we are
+close to the end of this lab. In this final section, we will not
+talk about another hyperparameter (you\'ve already been through the main
+ones) but a very important topic: **data processing**.
+
+Fitting a k-means algorithm is extremely easy. The trickiest part is
+making sure the resulting clusters are meaningful for your project, and
+we have seen how we can tune some hyperparameters to ensure this. But
+handling input data is as important as all the steps you have learned
+about so far. If your dataset is not well prepared, even if you find the
+best hyperparameters, you will still get some bad results.
+
+Let\'s have another look at our ATO dataset. In the previous section,
+*Calculating the Distance to the Centroid*, we found three different
+clusters, and they were mainly defined by the
+`'Average net tax'` variable. It was as if k-means didn\'t
+take into account the second variable,
+`'Average total deductions'`, at all. This is in fact due to
+these two variables having very different ranges of values and the way
+that squared Euclidean distance is calculated.
+
+Squared Euclidean distance is weighted more toward high-value variables.
+Let\'s take an example to illustrate this point with two data points
+called A and B with respective x and y coordinates of (1, 50000) and
+(100, 100000). The squared Euclidean distance between A and B will be
+(100000 - 50000)\^2 + (100 - 1)\^2. We can clearly see that the result
+will be mainly driven by the difference between 100,000 and 50,000:
+50,000\^2. The difference of 100 minus 1 (99\^2) will account for very
+little in the final result.
+
+But if you look at the ratio between 100,000 and 50,000, it is a factor
+of 2 (100,000 / 50,000 = 2), while the ratio between 100 and 1 is a
+factor of 100 (100 / 1 = 100). Does it make sense for the higher-value
+variable to \"dominate\" the clustering result? It really depends on
+your project, and this situation may be intended. But if you want things
+to be fair between the different axes, it\'s preferable to bring them
+all into a similar range of values before fitting a k-means model. This
+is the reason why you should always consider standardizing your data
+before running your k-means algorithm.
+
+There are multiple ways to standardize data, and we will have a look at
+the two most popular ones: **min-max scaling** and **z-score**. Luckily
+for us, the `sklearn` package has an implementation for both
+methods.
+
+The formula for min-max scaling is very simple: on each axis, you need
+to remove the minimum value for each data point and divide the result by
+the difference between the maximum and minimum values. The scaled data
+will have values ranging between 0 and 1:
+
+
+
+Caption: Min-max scaling formula
+
+Let\'s look at min-max scaling with `sklearn` in the following
+example.
+
+Note
+
+Open the notebook you were using for *Exercise 5.01*, *Performing Your
+First Clustering Analysis on the ATO Dataset*, and earlier examples.
+Execute the code you already entered, and then continue at the end of
+the notebook with the following code.
+
+First, we import the relevant class and instantiate an object:
+
+```
+from sklearn.preprocessing import MinMaxScaler
+min_max_scaler = MinMaxScaler()
+```
+
+Then, we fit it to our dataset:
+
+```
+min_max_scaler.fit(X)
+```
+You should get the following output:
+
+
+
+Caption: Min-max scaling summary
+
+And finally, call the `transform()` method to standardize the
+data:
+
+```
+X_min_max = min_max_scaler.transform(X)
+X_min_max
+```
+You should get the following output:
+
+
+
+Caption: Min-max-scaled data
+
+Now we print the minimum and maximum values of the min-max-scaled data
+for both axes:
+
+```
+X_min_max[:,0].min(), X_min_max[:,0].max(), \
+X_min_max[:,1].min(), X_min_max[:,1].max()
+```
+You should get the following output:
+
+
+
+Caption: Minimum and maximum values of the min-max-scaled data
+
+We can see that both axes now have their values sitting between 0 and 1.
+
+The **z-score** is calculated by removing the overall average from the
+data point and dividing the result by the standard deviation for each
+axis. The distribution of the standardized data will have a mean of 0
+and a standard deviation of 1:
+
+
+
+Caption: Z-score formula
+
+To apply it with `sklearn`, first, we have to import the
+relevant `StandardScaler` class and instantiate an object:
+
+```
+from sklearn.preprocessing import StandardScaler
+standard_scaler = StandardScaler()
+```
+This time, instead of calling `fit()` and then
+`transform()`, we use the `fit_transform()` method:
+
+```
+X_scaled = standard_scaler.fit_transform(X)
+X_scaled
+```
+You should get the following output:
+
+
+
+Caption: Z-score-standardized data
+
+Now we\'ll look at the minimum and maximum values for each axis:
+
+```
+X_scaled[:,0].min(), X_scaled[:,0].max(), \
+X_scaled[:,1].min(), X_scaled[:,1].max()
+```
+You should get the following output:
+
+
+
+Caption: Minimum and maximum values of the z-score-standardized data
+
+The value ranges for both axes are much lower now and we can see that
+their maximum values are around 9 and 18, which indicates that there are
+some extreme outliers in the data.
+
+Now, to fit a k-means model and plot a scatter plot on the
+z-score-standardized data with the following code snippet:
+
+```
+kmeans = KMeans(random_state=42, n_clusters=3, \
+ init='k-means++', n_init=5)
+kmeans.fit(X_scaled)
+df['cluster7'] = kmeans.predict(X_scaled)
+alt.Chart(df).mark_circle()\
+ .encode(x='Average net tax', \
+ y='Average total deductions', \
+ color='cluster7:N', \
+ tooltip=['Postcode', 'cluster7', \
+ 'Average net tax', \
+ 'Average total deductions'])\
+ .interactive()
+```
+You should get the following output:
+
+
+
+Caption: Scatter plot of the standardized data
+
+k-means results are very different from the standardized data. Now we
+can see that there are two main clusters (blue and red) and their
+boundaries are not straight vertical lines anymore but diagonal. So,
+k-means is actually taking into consideration both axes now. The orange
+cluster contains much fewer data points compared to previous iterations,
+and it seems it is grouping all the extreme outliers with high values
+together. If your project was about detecting anomalies, you would have
+found a way here to easily separate outliers from \"normal\"
+observations.
+
+
+
+Exercise 5.06: Standardizing the Data from Our Dataset
+------------------------------------------------------
+
+In this final exercise, we will standardize the data using min-max
+scaling and the z-score and fit a k-means model for each method and see
+their impact on k-means:
+
+1. Open a new Colab notebook.
+
+2. Now import the required `pandas`, `sklearn`, and
+ `altair` packages:
+ ```
+ import pandas as pd
+ from sklearn.cluster import KMeans
+ import altair as alt
+ ```
+
+
+3. Load the dataset and select the same columns as in *Exercise 5.02*,
+ *Clustering Australian Postcodes by Business Income and Expenses*,
+ using the `read_csv()` method from the `pandas`
+ package:
+ ```
+ file_url = 'https://raw.githubusercontent.com'\
+ '/fenago/data-science'\
+ '/master/Lab05/DataSet/taxstats2015.csv'
+ df = pd.read_csv(file_url, \
+ usecols=['Postcode', \
+ 'Average total business income', \
+ 'Average total business expenses'])
+ ```
+
+
+4. Assign the `'Average total business income'` and
+ `'Average total business expenses'` columns to a new
+ variable called `X`:
+ ```
+ X = df[['Average total business income', \
+ 'Average total business expenses']]
+ ```
+
+
+5. Import the `MinMaxScaler` and `StandardScaler`
+ classes from `sklearn`:
+ ```
+ from sklearn.preprocessing import MinMaxScaler
+ from sklearn.preprocessing import StandardScaler
+ ```
+
+
+6. Instantiate and fit `MinMaxScaler` with the data:
+
+ ```
+ min_max_scaler = MinMaxScaler()
+ min_max_scaler.fit(X)
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+ Caption: Summary of the min-max scaler
+
+7. Perform the min-max scaling transformation and save the data into a
+ new variable called `X_min_max`:
+
+ ```
+ X_min_max = min_max_scaler.transform(X)
+ X_min_max
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+ Caption: Min-max-scaled data
+
+8. Fit a k-means model on the scaled data with the following
+ hyperparameters: `random_state=1`,
+ `n_clusters=4, init='k-means++', n_init=5`, as shown in
+ the following code snippet:
+ ```
+ kmeans = KMeans(random_state=1, n_clusters=4, \
+ init='k-means++', n_init=5)
+ kmeans.fit(X_min_max)
+ ```
+
+
+9. Assign the k-means predictions of each value of `X` in a
+ new column called `'cluster8'` in the `df`
+ DataFrame:
+ ```
+ df['cluster8'] = kmeans.predict(X_min_max)
+ ```
+
+
+10. Plot the k-means results into a scatter plot using the
+ `altair` package:
+
+ ```
+ scatter_plot = alt.Chart(df).mark_circle()
+ scatter_plot.encode(x='Average total business income', \
+ y='Average total business expenses',\
+ color='cluster8:N',\
+ tooltip=['Postcode', 'cluster8', \
+ 'Average total business income',\
+ 'Average total business expenses'])\
+ .interactive()
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+ Caption: Scatter plot of k-means results using the
+ min-max-scaled data
+
+11. Re-train the k-means model but on the z-score-standardized data with
+ the same hyperparameter values,
+ `random_state=1, n_clusters=4, init='k-means++', n_init=5`:
+ ```
+ standard_scaler = StandardScaler()
+ X_scaled = standard_scaler.fit_transform(X)
+ kmeans = KMeans(random_state=1, n_clusters=4, \
+ init='k-means++', n_init=5)
+ kmeans.fit(X_scaled)
+ ```
+
+
+12. Assign the k-means predictions of each value of `X_scaled`
+ in a new column called `'cluster9' `in the `df`
+ DataFrame:
+ ```
+ df['cluster9'] = kmeans.predict(X_scaled)
+ ```
+
+
+13. Plot the k-means results in a scatter plot using the
+ `altair` package:
+
+ ```
+ scatter_plot = alt.Chart(df).mark_circle()
+ scatter_plot.encode(x='Average total business income', \
+ y='Average total business expenses', \
+ color='cluster9:N', \
+ tooltip=['Postcode', 'cluster9', \
+ 'Average total business income',\
+ 'Average total business expenses'])\
+ .interactive()
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+
+
+Activity 5.01: Perform Customer Segmentation Analysis in a Bank Using k-means
+-----------------------------------------------------------------------------
+
+You are working for an international bank. The credit department is
+reviewing its offerings and wants to get a better understanding of its
+current customers. You have been tasked with performing customer
+segmentation analysis. You will perform cluster analysis with k-means to
+identify groups of similar customers.
+
+The following steps will help you complete this activity:
+
+1. Download the dataset and load it into Python.
+
+2. Read the CSV file using the `read_csv()` method.
+
+ Note
+
+ This dataset is in the `.dat` file format. You can still
+ load the file using `read_csv()` but you will need to
+ specify the following parameter:
+ `header=None, sep= '\s\s+' and prefix='X'`.
+
+3. You will be using the fourth and tenth columns (`X3` and
+ `X9`). Extract these.
+
+4. Perform data standardization by instantiating a
+ `StandardScaler` object.
+
+5. Analyze and define the optimal number of clusters.
+
+6. Fit a k-means algorithm with the number of clusters you\'ve defined.
+
+7. Create a scatter plot of the clusters.
+
+ Note
+
+ This is the German Credit Dataset from the UCI Machine Learning
+ Repository.Even though all the columns in this
+ dataset are integers, most of them are actually categorical
+ variables. The data in these columns is not continuous. Only two
+ variables are really numeric. Those are the ones you will use for
+ your clustering.
+
+You should get something similar to the following output:
+
+
+
+Caption: Scatter plot of the four clusters found
+
+
+Summary
+=======
+
+
+You are now ready to perform cluster analysis with the k-means algorithm
+on your own dataset. This type of analysis is very popular in the
+industry for segmenting customer profiles as well as detecting
+suspicious transactions or anomalies.
+
+We learned about a lot of different concepts, such as centroids and
+squared Euclidean distance. We went through the main k-means
+hyperparameters: `init` (initialization method),
+`n_init` (number of initialization runs),
+`n_clusters` (number of clusters), and
+`random_state` (specified seed). We also discussed the
+importance of choosing the optimal number of clusters, initializing
+centroids properly, and standardizing data. You have learned how to use
+the following Python packages: `pandas`, `altair`,
+`sklearn`, and `KMeans`.
+
+In this lab, we only looked at k-means, but it is not the only
+clustering algorithm. There are quite a lot of algorithms that use
+different approaches, such as hierarchical clustering, principal
+component analysis, and the Gaussian mixture model, to name a few. If
+you are interested in this field, you now have all the basic knowledge
+you need to explore these other algorithms on your own.
+
+Next, you will see how we can assess the performance of these models and
+what tools can be used to make them even better.
diff --git a/lab_guides/Lab_6.md b/lab_guides/Lab_6.md
new file mode 100644
index 0000000..00e5436
--- /dev/null
+++ b/lab_guides/Lab_6.md
@@ -0,0 +1,2357 @@
+
+6. How to Assess Performance
+============================
+
+
+
+Overview
+
+This lab will introduce you to model evaluation, where you evaluate
+or assess the performance of each model that you train before you decide
+to put it into production. By the end of this lab, you will be able
+to create an evaluation dataset. You will be equipped to assess the
+performance of linear regression models using **mean absolute error**
+(**MAE**) and **mean squared error** (**MSE**). You will also be able to
+evaluate the performance of logistic regression models using accuracy,
+precision, recall, and F1 score.
+
+
+Introduction
+============
+
+
+When you assess the performance of a model, you look at certain
+measurements or values that tell you how well the model is performing
+under certain conditions, and that helps you make an informed decision
+about whether or not to make use of the model that you have trained in
+the real world. Some of the measurements you will encounter in this
+lab are MAE, precision, recall, and R[2] score.
+
+You learned how to train a regression model in *Lab 2, Regression*,
+and how to train classification models in *Lab 3, Binary
+Classification*. Consider the task of predicting whether or not a
+customer is likely to purchase a term deposit, which you addressed in
+*Lab 3, Binary Classification*. You have learned how to train a
+model to perform this sort of classification. You are now concerned with
+how useful this model might be. You might start by training one model,
+and then evaluating how often the predictions from that model are
+correct. You might then proceed to train more models and evaluate
+whether they perform better than previous models you have trained.
+
+You have already seen an example of splitting data using
+`train_test_split` in *Exercise 3.06*, *A Logistic Regression
+Model for Predicting the Propensity of Term Deposit Purchases in a
+Bank*. You will go further into the necessity and application of
+splitting data in *Lab 7, The Generalization of Machine Learning
+Models*, but for now, you should note that it is important to split your
+data into one set that is used for training a model, and a second set
+that is used for validating the model. It is this validation step that
+helps you decide whether or not to put a model into production.
+
+
+Splitting Data
+==============
+
+
+You will learn more about splitting data in *Lab 7, The
+Generalization of Machine Learning Models*, where we will cover the
+following:
+
+- Simple data splits using `train_test_split`
+- Multiple data splits using cross-validation
+
+For now, you will learn how to split data using a function from
+`sklearn` called `train_test_split`.
+
+It is very important that you do not use all of your data to train a
+model. You must set aside some data for validation, and this data must
+not have been used previously for training. When you train a model, it
+tries to generate an equation that fits your data. The longer you train,
+the more complex the equation becomes so that it passes through as many
+of the data points as possible.
+
+When you shuffle the data and set some aside for validation, it ensures
+that the model learns to not overfit the hypotheses you are trying to
+generate.
+
+
+
+Exercise 6.01: Importing and Splitting Data
+-------------------------------------------
+
+In this exercise, you will import data from a repository and split it
+into a training and an evaluation set to train a model. Splitting your
+data is required so that you can evaluate the model later. This exercise
+will get you familiar with the process of splitting data; this is
+something you will be doing frequently.
+
+Note
+
+The Car dataset that you will be using in this lab was taken from the UCI Machine Learning Repository.
+
+This dataset is about cars. A text file is provided with the following
+information:
+
+- `buying` -- the cost of purchasing this vehicle
+- `maint` -- the maintenance cost of the vehicle
+- `doors` -- the number of doors the vehicle has
+- `persons` -- the number of persons the vehicle is capable
+ of transporting
+- `lug_boot` -- the cargo capacity of the vehicle
+- `safety` -- the safety rating of the vehicle
+- `car` -- this is the category that the model attempts to
+ predict
+
+The following steps will help you complete the exercise:
+
+1. Open a new Colab notebook.
+
+2. Import the required libraries:
+
+ ```
+ import pandas as pd
+ from sklearn.model_selection import train_test_split
+ ```
+
+
+ You started by importing a library called `pandas` in the
+ first line. This library is useful for reading files into a data
+ structure that is called a `DataFrame`, which you have
+ used in previous labs. This structure is like a spreadsheet or a
+ table with rows and columns that we can manipulate. Because you
+ might need to reference the library lots of times, we have created
+ an alias for it, `pd`.
+
+ In the second line, you import a function called
+ `train_test_split` from a module called
+ `model_selection`, which is within `sklearn`.
+ This function is what you will make use of to split the data that
+ you read in using `pandas`.
+
+3. Create a Python list:
+
+ ```
+ # data doesn't have headers, so let's create headers
+ _headers = ['buying', 'maint', 'doors', 'persons', \
+ 'lug_boot', 'safety', 'car']
+ ```
+
+
+ The data that you are reading in is stored as a CSV file.
+
+ The browser will download the file to your computer. You can open
+ the file using a text editor. If you do, you will see something
+ similar to the following:
+
+
+
+
+
+ Caption: The car dataset without headers
+
+ Note
+
+ Alternatively, you can enter the dataset URL in the browser to view
+ the dataset.
+
+ `CSV` files normally have the name of each column written
+ in the first row of the data. For instance, have a look at this
+ dataset\'s CSV file, which you used in *Lab 3, Binary
+ Classification*:
+
+
+
+
+
+ Caption: CSV file without headers
+
+ But, in this case, the column name is missing. That is not a
+ problem, however. The code in this step creates a Python list called
+ `_headers` that contains the name of each column. You will
+ supply this list when you read in the data in the next step.
+
+4. Read the data:
+
+ ```
+ df = pd.read_csv('https://raw.githubusercontent.com/'\
+ 'fenago/data-science/'\
+ 'master/Lab06/Dataset/car.data', \
+ names=_headers, index_col=None)
+ ```
+
+
+ In this step, the code reads in the file using a function called
+ `read_csv`. The first parameter,
+ `'https://raw.githubusercontent.com/fenago/data-science/master/Lab06/Dataset/car.data'`,
+ is mandatory and is the location of the file. In our case, the file
+ is on the internet. It can also be optionally downloaded, and we can
+ then point to the local file\'s location.
+
+ The second parameter (`names=_headers`) asks the function
+ to add the row headers to the data after reading it in. The third
+ parameter (`index_col=None`) asks the function to generate
+ a new index for the table because the data doesn\'t contain an
+ index. The function will produce a DataFrame, which we assign to a
+ variable called `df`.
+
+5. Print out the top five records:
+
+ ```
+ df.head()
+ ```
+
+
+ The code in this step is used to print the top five rows of the
+ DataFrame. The output from that operation is shown in the following
+ screenshot:
+
+
+
+
+
+ Caption: The top five rows of the DataFrame
+
+6. Create a training and an evaluation DataFrame:
+
+ ```
+ training, evaluation = train_test_split(df, test_size=0.3, \
+ random_state=0)
+ ```
+
+
+ The preceding code will split the DataFrame containing your data
+ into two new DataFrames. The first is called `training`
+ and is used for training the model. The second is called
+ `evaluation` and will be further split into two in the
+ next step. We mentioned earlier that you must separate your dataset
+ into a training and an evaluation dataset, the former for training
+ your model and the latter for evaluating your model.
+
+ At this point, the `train_test_split` function takes two
+ parameters. The first parameter is the data we want to split. The
+ second is the ratio we would like to split it by. What we have done
+ is specified that we want our evaluation data to be 30% of our data.
+
+ Note
+
+ The third parameter random\_state is set to 0 to ensure
+ reproducibility of results.
+
+7. Create a validation and test dataset:
+
+ ```
+ validation, test = train_test_split(evaluation, test_size=0.5, \
+ random_state=0)
+ ```
+
+
+ This code is similar to the code in *Step 6*. In this step, the code
+ splits our evaluation data into two equal parts because we specified
+ `0.5`, which means `50%`.
+
+
+Assessing Model Performance for Regression Models
+=================================================
+
+
+When you create a regression model, you create a model that predicts a
+continuous numerical variable, as you learned in *Lab 2,
+Regression*. When you set aside your evaluation dataset, you have
+something that you can use to compare the quality of your model.
+
+What you need to do to assess your model quality is compare the quality
+of your prediction to what is called the ground truth, which is the
+actual observed value that you are trying to predict. Take a look at
+*Figure 6.4*, in which the first column contains the ground truth
+(called actuals) and the second column contains the predicted values:
+
+
+
+Caption: Actual versus predicted values
+
+Line `0` in the output compares the actual value in our
+evaluation dataset to what our model predicted. The actual value from
+our evaluation dataset is `4.891`. The value that the model
+predicted is `4.132270`.
+
+Line `1` compares the actual value of `4.194` to
+what the model predicted, which is `4.364320`.
+
+In practice, the evaluation dataset will contain a lot of records, so
+you will not be making this comparison visually. Instead, you will make
+use of some equations.
+
+You would carry out this comparison by computing the loss. The loss is
+the difference between the actuals and the predicted values in the
+preceding screenshot. In data mining, it is called a **distance
+measure**. There are various approaches to computing distance measures
+that give rise to different loss functions. Two of these are:
+
+- Manhattan distance
+- Euclidean distance
+
+There are various loss functions for regression, but in this book, we
+will be looking at two of the commonly used loss functions for
+regression, which are:
+
+- Mean absolute error (MAE) -- this is based on Manhattan distance
+- Mean squared error (MSE) -- this is based on Euclidean distance
+
+The goal of these functions is to measure the usefulness of your models
+by giving you a numerical value that shows how much deviation there is
+between the ground truths and the predicted values from your models.
+
+Your mission is to train new models with consistently lower errors.
+Before we do that, let\'s have a quick introduction to some data
+structures.
+
+
+
+Data Structures -- Vectors and Matrices
+---------------------------------------
+
+In this section, we will look at different data structures, as follows.
+
+
+
+### Scalars
+
+A scalar variable is a simple number, such as 23. Whenever you make use
+of numbers on their own, they are scalars. You assign them to variables,
+such as in the following expression:
+
+```
+temperature = 23
+```
+If you had to store the temperature for 5 days, you would need to store
+the values in 5 different values, such as in the following code snippet:
+
+```
+temp_1 = 23
+temp_2 = 24
+temp_3 = 23
+temp_4 = 22
+temp_5 = 22
+```
+
+In data science, you will frequently work with a large number of data
+points, such as hourly temperature measurements for an entire year. A
+more efficient way of storing lots of values is called a vector. Let\'s
+look at vectors in the next topic.
+
+
+
+### Vectors
+
+A vector is a collection of scalars. Consider the five temperatures in
+the previous code snippet. A vector is a data type that lets you collect
+all of the previous temperatures in one variable that supports
+arithmetic operations. Vectors look similar to Python lists and can be
+created from Python lists. Consider the following code snippet for
+creating a Python list:
+
+```
+temps_list = [23, 24, 23, 22, 22]
+```
+You can create a vector from the list using the `.array()`
+method from `numpy` by first importing `numpy` and
+then using the following snippet:
+
+```
+import numpy as np
+temps_ndarray = np.array(temps_list)
+```
+You can proceed to verify the data type using the following code
+snippet:
+
+```
+print(type(temps_ndarray))
+```
+
+The code snippet will cause the compiler to print out the following:
+
+
+
+Caption: The temps\_ndarray vector data type
+
+You may inspect the contents of the vector using the following code
+snippet:
+
+```
+print(temps_ndarray)
+```
+This generates the following output:
+
+
+
+Caption: The temps\_ndarray vector
+
+Note that the output contains single square brackets, `[` and
+`]`, and the numbers are separated by spaces. This is
+different from the output from a Python list, which you can obtain using
+the following code snippet:
+
+```
+print(temps_list)
+```
+
+The code snippet yields the following output:
+
+
+
+Caption: List of elements in temps\_list
+
+Note that the output contains single square brackets, `[` and
+`]`, and the numbers are separated by commas.
+
+Vectors have a shape and a dimension. Both of these can be determined by
+using the following code snippet:
+
+```
+print(temps_ndarray.shape)
+```
+
+The output is a Python data structure called a **tuple** and looks like
+this:
+
+
+
+Caption: Shape of the temps\_ndarray vector
+
+Notice that the output consists of brackets, `(` and
+`)`, with a number and a comma. The single number followed by
+a comma implies that this object has only one dimension. The value of
+the number is the number of elements. The output is read as \"a vector
+with five elements.\" This is very important because it is very
+different from a matrix, which we will discuss next.
+
+
+
+### Matrices
+
+A matrix is also made up of scalars but is different from a scalar in
+the sense that a matrix has both rows and columns3
+
+There are times when you need to convert between vectors and matrices.
+Let\'s revisit `temps_ndarray`. You may recall that it has
+five elements because the shape was `(5,)`. To convert it into
+a matrix with five rows and one column, you would use the following
+snippet:
+
+```
+temps_matrix = temps_ndarray.reshape(-1, 1)
+```
+
+The code snippet makes use of the `.reshape()` method. The
+first parameter, `-1`, instructs the interpreter to keep the
+first dimension constant. The second parameter, `1`, instructs
+the interpreter to add a new dimension. This new dimension is the
+column. To see the new shape, use the following snippet:
+
+```
+print(temps_matrix.shape)
+```
+You will get the following output:
+
+
+
+Caption: Shape of the matrix
+
+Notice that the tuple now has two numbers, `5` and
+`1`. The first number, `5`, represents the rows, and
+the second number, `1`, represents the columns. You can print
+out the value of the matrix using the following snippet:
+
+```
+print(temps_matrix)
+```
+
+The output of the code is as follows:
+
+
+
+Caption: Elements of the matrix
+
+Notice that the output is different from that of the vector. First, we
+have an outer set of square brackets. Then, each row has its element
+enclosed in square brackets. Each row contains only one number because
+the matrix has only one column.
+
+You may reshape the matrix to contain `1` row and
+`5` columns and print out the value using the following code
+snippet:
+
+```
+print(temps_matrix.reshape(1,5))
+```
+
+The output will be as follows:
+
+
+
+Caption: Reshaping the matrix
+
+Notice that you now have all the numbers on one row because this matrix
+has one row and five columns. The outer square brackets represent the
+matrix, while the inner square brackets represent the row.
+
+Finally, you can convert the matrix back into a vector by dropping the
+column using the following snippet:
+
+```
+vector = temps_matrix.reshape(-1)
+```
+You can print out the value of the vector to confirm that you get the
+following:
+
+
+
+Caption: The value of the vector
+
+Notice that you now have only one set of square brackets. You still have
+the same number of elements.
+
+
+
+
+Exercise 6.02: Computing the R[2] Score of a Linear Regression Model
+----------------------------------------------------------------------------------
+
+As mentioned in the preceding sections, R[2] score is an
+important factor in evaluating the performance of a model. Thus, in this
+exercise, we will be creating a linear regression model and then
+calculating the R[2] score for it.
+
+
+
+The following attributes are useful for our task:
+
+- CIC0: information indices
+- SM1\_Dz(Z): 2D matrix-based descriptors
+- GATS1i: 2D autocorrelations
+- NdsCH: Pimephales promelas
+- NdssC: atom-type counts
+- MLOGP: molecular properties
+- Quantitative response, LC50 \[-LOG(mol/L)\]: This attribute
+ represents the concentration that causes death in 50% of test fish
+ over a test duration of 96 hours.
+
+The following steps will help you to complete the exercise:
+
+1. Open a new Colab notebook to write and execute your code.
+
+2. Next, import the libraries mentioned in the following code snippet:
+
+ ```
+ # import libraries
+ import pandas as pd
+ from sklearn.model_selection import train_test_split
+ from sklearn.linear_model import LinearRegression
+ ```
+
+
+ In this step, you import `pandas`, which you will use to
+ read your data. You also import `train_test_split()`,
+ which you will use to split your data into training and validation
+ sets, and you import `LinearRegression`, which you will
+ use to train your model.
+
+3. Now, read the data from the dataset:
+
+ ```
+ # column headers
+ _headers = ['CIC0', 'SM1', 'GATS1i', 'NdsCH', 'Ndssc', \
+ 'MLOGP', 'response']
+ # read in data
+ df = pd.read_csv('https://raw.githubusercontent.com/'\
+ 'fenago/data-science/'\
+ 'master/Lab06/Dataset/'\
+ 'qsar_fish_toxicity.csv', \
+ names=_headers, sep=';')
+ ```
+
+
+ In this step, you create a Python list to hold the names of the
+ columns in your data. You do this because the CSV file containing
+ the data does not have a first row that contains the column headers.
+ You proceed to read in the file and store it in a variable called
+ `df` using the `read_csv()` method in pandas.
+ You specify the list containing column headers by passing it into
+ the `names` parameter. This CSV uses semi-colons as column
+ separators, so you specify that using the `sep` parameter.
+ You can use `df.head()` to see what the DataFrame looks
+ like:
+
+
+
+
+
+ Caption: The first five rows of the DataFrame
+
+4. Split the data into features and labels and into training and
+ evaluation datasets:
+
+ ```
+ # Let's split our data
+ features = df.drop('response', axis=1).values
+ labels = df[['response']].values
+ X_train, X_eval, y_train, y_eval = train_test_split\
+ (features, labels, \
+ test_size=0.2, \
+ random_state=0)
+ X_val, X_test, y_val, y_test = train_test_split(X_eval, y_eval,\
+ random_state=0)
+ ```
+
+
+ In this step, you create two `numpy` arrays called
+ `features` and `labels`. You then proceed to
+ split them twice. The first split produces a `training`
+ set and an `evaluation` set. The second split creates a
+ `validation` set and a `test` set.
+
+5. Create a linear regression model:
+
+ ```
+ model = LinearRegression()
+ ```
+
+
+ In this step, you create an instance of `LinearRegression`
+ and store it in a variable called `model`. You will make
+ use of this to train on the training dataset.
+
+6. Train the model:
+
+ ```
+ model.fit(X_train, y_train)
+ ```
+
+
+ In this step, you train the model using the `fit()` method
+ and the training dataset that you made in *Step 4*. The first
+ parameter is the `features` NumPy array, and the second
+ parameter is `labels`.
+
+ You should get an output similar to the following:
+
+
+
+
+
+ Caption: Training the model
+
+7. Make a prediction, as shown in the following code snippet:
+
+ ```
+ y_pred = model.predict(X_val)
+ ```
+
+
+ In this step, you make use of the validation dataset to make a
+ prediction. This is stored in `y_pred`.
+
+8. Compute the R[2] score:
+
+ ```
+ r2 = model.score(X_val, y_val)
+ print('R^2 score: {}'.format(r2))
+ ```
+
+
+ In this step, you compute `r2`, which is the
+ R[2] score of the model. The R[2] score
+ is computed using the `score()` method of the model. The
+ next line causes the interpreter to print out the R[2]
+ score.
+
+ The output is similar to the following:
+
+
+
+
+
+ Caption: R2 score
+
+ Note
+
+ The MAE and R[2] score may vary depending on the
+ distribution of the datasets.
+
+9. You see that the R[2] score we achieved is
+ `0.56238`, which is not close to 1. In the next step, we
+ will be making comparisons.
+
+10. Compare the predictions to the actual ground truth:
+
+ ```
+ _ys = pd.DataFrame(dict(actuals=y_val.reshape(-1), \
+ predicted=y_pred.reshape(-1)))
+ _ys.head()
+ ```
+
+
+
+ The output looks similar to the following:
+
+
+
+
+
+
+
+
+Mean Absolute Error
+-------------------
+
+The **mean absolute error** (**MAE**) is an evaluation metric for
+regression models that measures the absolute distance between your
+predictions and the ground truth. The absolute distance is the distance
+regardless of the sign, whether positive or negative. For example, if
+the ground truth is 6 and you predict 5, the distance is 1. However, if
+you predict 7, the distance becomes -1. The absolute distance, without
+taking the signs into consideration, is 1 in both cases. This is called
+the **magnitude**. The MAE is computed by summing all of the magnitudes
+and dividing by the number of observations.
+
+
+
+Exercise 6.03: Computing the MAE of a Model
+-------------------------------------------
+
+The goal of this exercise is to find the score and loss of a model using
+the same dataset as *Exercise 6.02*, *Computing the R2 Score of a Linear
+Regression Model*.
+
+In this exercise, we will be calculating the MAE of a model.
+
+The following steps will help you with this exercise:
+
+1. Open a new Colab notebook file.
+
+2. Import the necessary libraries:
+
+ ```
+ # Import libraries
+ import pandas as pd
+ from sklearn.model_selection import train_test_split
+ from sklearn.linear_model import LinearRegression
+ from sklearn.metrics import mean_absolute_error
+ ```
+
+
+ In this step, you import the function called
+ `mean_absolute_error` from `sklearn.metrics`.
+
+3. Import the data:
+
+ ```
+ # column headers
+ _headers = ['CIC0', 'SM1', 'GATS1i', 'NdsCH', 'Ndssc', \
+ 'MLOGP', 'response']
+ # read in data
+ df = pd.read_csv('https://raw.githubusercontent.com/'\
+ 'fenago/data-science/'\
+ 'master/Lab06/Dataset/'\
+ 'qsar_fish_toxicity.csv', \
+ names=_headers, sep=';')
+ ```
+
+
+ In the preceding code, you read in your data. This data is hosted
+ online and contains some information about fish toxicity. The data
+ is stored as a CSV but does not contain any headers. Also, the
+ columns in this file are not separated by a comma, but rather by a
+ semi-colon. The Python list called `_headers` contains the
+ names of the column headers.
+
+ In the next line, you make use of the function called
+ `read_csv`, which is contained in the `pandas`
+ library, to load the data. The first parameter specifies the file
+ location. The second parameter specifies the Python list that
+ contains the names of the columns in the data. The third parameter
+ specifies the character that is used to separate the columns in the
+ data.
+
+4. Split the data into `features` and `labels` and
+ into training and evaluation sets:
+
+ ```
+ # Let's split our data
+ features = df.drop('response', axis=1).values
+ labels = df[['response']].values
+ X_train, X_eval, y_train, y_eval = train_test_split\
+ (features, labels, \
+ test_size=0.2, \
+ random_state=0)
+ X_val, X_test, y_val, y_test = train_test_split(X_eval, y_eval,\
+ random_state=0)
+ ```
+
+
+ In this step, you split your data into training, validation, and
+ test datasets. In the first line, you create a `numpy`
+ array in two steps. In the first step, the `drop` method
+ takes a parameter with the name of the column to drop from the
+ DataFrame. In the second step, you use `values` to convert
+ the DataFrame into a two-dimensional `numpy` array that is
+ a tabular structure with rows and columns. This array is stored in a
+ variable called `features`.
+
+ In the second line, you convert the column into a `numpy`
+ array that contains the label that you would like to predict. You do
+ this by picking out the column from the DataFrame and then using
+ `values` to convert it into a `numpy` array.
+
+ In the third line, you split the `features` and
+ `labels` using `train_test_split` and a ratio of
+ 80:20. The training data is contained in `X_train` for the
+ features and `y_train` for the labels. The evaluation
+ dataset is contained in `X_eval` and `y_eval`.
+
+ In the fourth line, you split the evaluation dataset into validation
+ and testing using `train_test_split`. Because you don\'t
+ specify the `test_size`, a value of `25%` is
+ used. The validation data is stored in `X_val `and
+ `y_val`, while the test data is stored in
+ `X_test` and `y_test`.
+
+5. Create a simple linear regression model and train it:
+
+ ```
+ # create a simple Linear Regression model
+ model = LinearRegression()
+ # train the model
+ model.fit(X_train, y_train)
+ ```
+
+
+ In this step, you make use of your training data to train a model.
+ In the first line, you create an instance of
+ `LinearRegression`, which you call `model`. In
+ the second line, you train the model using `X_train` and
+ `y_train`. `X_train` contains the
+ `features`, while `y_train` contains the
+ `labels`.
+
+6. Now predict the values of our validation dataset:
+
+ ```
+ # let's use our model to predict on our validation dataset
+ y_pred = model.predict(X_val)
+ ```
+
+
+ At this point, your model is ready to use. You make use of the
+ `predict` method to predict on your data. In this case,
+ you are passing `X_val` as a parameter to the function.
+ Recall that `X_va`l is your validation dataset. The result
+ is assigned to a variable called `y_pred` and will be used
+ in the next step to compute the MAE of the model.
+
+7. Compute the MAE:
+
+ ```
+ # Let's compute our MEAN ABSOLUTE ERROR
+ mae = mean_absolute_error(y_val, y_pred)
+ print('MAE: {}'.format(mae))
+ ```
+
+
+ In this step, you compute the MAE of the model by using the
+ `mean_absolute_error` function and passing in
+ `y_val` and `y_pred`. `y_val` is the
+ label that was provided with your training data, and
+ `y_pred `is the prediction from the model. The preceding
+ code should give you an MAE value of \~ 0.72434:
+
+
+
+
+
+ Figure 6.17 MAE score
+
+
+8. Compute the R[2] score of the model:
+
+ ```
+ # Let's get the R2 score
+ r2 = model.score(X_val, y_val)
+ print('R^2 score: {}'.format(r2))
+ ```
+
+
+ You should get an output similar to the following:
+
+
+
+
+
+In this exercise, we have calculated the MAE, which is a significant
+parameter when it comes to evaluating models.
+
+You will now train a second model and compare its R[2]
+score and MAE to the first model to evaluate which is a better
+performing model.
+
+
+
+Exercise 6.04: Computing the Mean Absolute Error of a Second Model
+------------------------------------------------------------------
+
+In this exercise, we will be engineering new features and finding the
+score and loss of a new model.
+
+The following steps will help you with this exercise:
+
+1. Open a new Colab notebook file.
+
+2. Import the required libraries:
+
+ ```
+ # Import libraries
+ import pandas as pd
+ from sklearn.model_selection import train_test_split
+ from sklearn.linear_model import LinearRegression
+ from sklearn.metrics import mean_absolute_error
+ # pipeline
+ from sklearn.pipeline import Pipeline
+ # preprocessing
+ from sklearn.preprocessing import MinMaxScaler
+ from sklearn.preprocessing import StandardScaler
+ from sklearn.preprocessing import PolynomialFeatures
+ ```
+
+
+ In the first step, you will import libraries such as
+ `train_test_split`, `LinearRegression`, and
+ `mean_absolute_error`. We make use of a pipeline to
+ quickly transform our features and engineer new features using
+ `MinMaxScaler` and `PolynomialFeatures`.
+ `MinMaxScaler` reduces the variance in your data by
+ adjusting all values to a range between 0 and 1. It does this by
+ subtracting the mean of the data and dividing by the range, which is
+ the minimum value subtracted from the maximum value.
+ `PolynomialFeatures` will engineer new features by raising
+ the values in a column up to a certain power and creating new
+ columns in your DataFrame to accommodate them.
+
+3. Read in the data from the dataset:
+
+ ```
+ # column headers
+ _headers = ['CIC0', 'SM1', 'GATS1i', 'NdsCH', 'Ndssc', \
+ 'MLOGP', 'response']
+ # read in data
+ df = pd.read_csv('https://raw.githubusercontent.com/'\
+ 'fenago/data-science/'\
+ 'master/Lab06/Dataset/'\
+ 'qsar_fish_toxicity.csv', \
+ names=_headers, sep=';')
+ ```
+
+
+ In this step, you will read in your data. While the data is stored
+ in a CSV, it doesn\'t have a first row that lists the names of the
+ columns. The Python list called `_headers` will hold the
+ column names that you will supply to the `pandas` method
+ called `read_csv`.
+
+ In the next line, you call the `read_csv`
+ `pandas` method and supply the location and name of the
+ file to be read in, along with the header names and the file
+ separator. Columns in the file are separated with a semi-colon.
+
+4. Split the data into training and evaluation sets:
+
+ ```
+ # Let's split our data
+ features = df.drop('response', axis=1).values
+ labels = df[['response']].values
+ X_train, X_eval, y_train, y_eval = train_test_split\
+ (features, labels, \
+ test_size=0.2, \
+ random_state=0)
+ X_val, X_test, y_val, y_test = train_test_split(X_eval, y_eval,\
+ random_state=0)
+ ```
+
+
+ In this step, you begin by splitting the DataFrame called
+ `df` into two. The first DataFrame is called
+ `features` and contains all of the independent variables
+ that you will use to make your predictions. The second is called
+ `labels` and contains the values that you are trying to
+ predict.
+
+ In the third line, you split `features` and
+ `labels` into four sets using
+ `train_test_split`. `X_train` and
+ `y_train` contain 80% of the data and are used for
+ training your model. `X_eval` and `y_eval`
+ contain the remaining 20%.
+
+ In the fourth line, you split `X_eval` and
+ `y_eval` into two additional sets. `X_val` and
+ `y_val` contain 75% of the data because you did not
+ specify a ratio or size. `X_test` and `y_test`
+ contain the remaining 25%.
+
+5. Create a pipeline:
+
+ ```
+ # create a pipeline and engineer quadratic features
+ steps = [('scaler', MinMaxScaler()),\
+ ('poly', PolynomialFeatures(2)),\
+ ('model', LinearRegression())]
+ ```
+
+
+ In this step, you begin by creating a Python list called
+ `steps`. The list contains three tuples, each one
+ representing a transformation of a model. The first tuple represents
+ a scaling operation. The first item in the tuple is the name of the
+ step, which you call `scaler`. This uses
+ `MinMaxScaler` to transform the data. The second, called
+ `poly`, creates additional features by crossing the
+ columns of data up to the degree that you specify. In this case, you
+ specify `2`, so it crosses these columns up to a power
+ of 2. Next comes your `LinearRegression` model.
+
+6. Create a pipeline:
+
+ ```
+ # create a simple Linear Regression model with a pipeline
+ model = Pipeline(steps)
+ ```
+
+
+ In this step, you create an instance of `Pipeline` and
+ store it in a variable called `model`.
+ `Pipeline` performs a series of transformations, which are
+ specified in the steps you defined in the previous step. This
+ operation works because the transformers (`MinMaxScaler`
+ and `PolynomialFeatures`) implement two methods called
+ `fit()` and `fit_transform()`. You may recall
+ from previous examples that models are trained using the
+ `fit()` method that `LinearRegression`
+ implements.
+
+7. Train the model:
+
+ ```
+ # train the model
+ model.fit(X_train, y_train)
+ ```
+
+
+ On the next line, you call the `fit` method and provide
+ `X_train` and `y_train` as parameters. Because
+ the model is a pipeline, three operations will happen. First,
+ `X_train` will be scaled. Next, additional features will
+ be engineered. Finally, training will happen using the
+ `LinearRegression` model. The output from this step is
+ similar to the following:
+
+
+
+
+
+ Caption: Training the model
+
+8. Predict using the validation dataset:
+ ```
+ # let's use our model to predict on our validation dataset
+ y_pred = model.predict(X_val)
+ ```
+
+
+9. Compute the MAE of the model:
+
+ ```
+ # Let's compute our MEAN ABSOLUTE ERROR
+ mae = mean_absolute_error(y_val, y_pred)
+ print('MAE: {}'.format(mae))
+ ```
+
+
+ In the first line, you make use of `mean_absolute_error`
+ to compute the mean absolute error. You supply `y_val` and
+ `y_pred`, and the result is stored in the `mae`
+ variable. In the following line, you print out `mae`:
+
+
+
+
+
+ Caption: MAE score
+
+ The loss that you compute at this step is called a validation loss
+ because you make use of the validation dataset. This is different
+ from a training loss that is computed using the training dataset.
+ This distinction is important to note as you study other
+ documentation or books, which might refer to both.
+
+10. Compute the R[2] score:
+
+ ```
+ # Let's get the R2 score
+ r2 = model.score(X_val, y_val)
+ print('R^2 score: {}'.format(r2))
+ ```
+
+
+ In the final two lines, you compute the R[2] score and
+ also display it, as shown in the following screenshot:
+
+
+
+
+
+
+Exercise 6.05: Creating a Classification Model for Computing Evaluation Metrics
+-------------------------------------------------------------------------------
+
+In this exercise, you will create a classification model that you will
+make use of later on for model assessment.
+
+You will make use of the cars dataset from the UCI Machine Learning
+Repository. You will use this dataset to classify cars as either
+acceptable or unacceptable based on the following categorical features:
+
+- `buying`: the purchase price of the car
+
+- `maint`: the maintenance cost of the car
+
+- `doors`: the number of doors on the car
+
+- `persons`: the carrying capacity of the vehicle
+
+- `lug_boot`: the size of the luggage boot
+
+- `safety`: the estimated safety of the car
+
+
+
+The following steps will help you achieve the task:
+
+1. Open a new Colab notebook.
+
+2. Import the libraries you will need:
+
+ ```
+ # import libraries
+ import pandas as pd
+ from sklearn.model_selection import train_test_split
+ from sklearn.linear_model import LogisticRegression
+ ```
+
+
+ In this step, you import `pandas` and alias it as
+ `pd`. `pandas` is needed for reading data into a
+ DataFrame. You also import `train_test_split`, which is
+ needed for splitting your data into training and evaluation
+ datasets. Finally, you also import the
+ `LogisticRegression` class.
+
+3. Import your data:
+
+ ```
+ # data doesn't have headers, so let's create headers
+ _headers = ['buying', 'maint', 'doors', 'persons', \
+ 'lug_boot', 'safety', 'car']
+ # read in cars dataset
+ df = pd.read_csv('https://raw.githubusercontent.com/'\
+ 'fenago/data-science/'\
+ 'master/Lab06/Dataset/car.data', \
+ names=_headers, index_col=None)
+ df.head()
+ ```
+
+
+ In this step, you create a Python list called `_headers`
+ to hold the names of the columns in the file you will be importing
+ because the file doesn\'t have a header. You then proceed to read
+ the file into a DataFrame named `df` by using
+ `pd.read_csv` and specifying the file location as well as
+ the list containing the file headers. Finally, you display the first
+ five rows using `df.head()`.
+
+ You should get an output similar to the following:
+
+
+
+
+
+ Caption: Inspecting the DataFrame
+
+4. Encode categorical variables as shown in the following code snippet:
+
+ ```
+ # encode categorical variables
+ _df = pd.get_dummies(df, columns=['buying', 'maint', 'doors',\
+ 'persons', 'lug_boot', \
+ 'safety'])
+ _df.head()
+ ```
+
+
+ In this step, you convert categorical columns into numeric columns
+ using a technique called one-hot encoding. You saw an example of
+ this in *Step 13* of *Exercise 3.04*, *Feature Engineering --
+ Creating New Features from Existing Ones*. You need to do this
+ because the inputs to your model must be numeric. You get numeric
+ variables from categorical variables using `get_dummies`
+ from the `pandas` library. You provide your DataFrame as
+ input and specify the columns to be encoded. You assign the result
+ to a new DataFrame called `_df`, and then inspect the
+ result using `head()`.
+
+ The output should now resemble the following screenshot:
+
+
+
+
+
+ Caption: Encoding categorical variables
+
+
+5. Split the data into training and validation sets:
+
+ ```
+ # split data into training and evaluation datasets
+ features = _df.drop('car', axis=1).values
+ labels = _df['car'].values
+ X_train, X_eval, y_train, y_eval = train_test_split\
+ (features, labels, \
+ test_size=0.3, \
+ random_state=0)
+ X_val, X_test, y_val, y_test = train_test_split(X_eval, y_eval,\
+ test_size=0.5, \
+ random_state=0)
+ ```
+
+
+ In this step, you begin by extracting your feature columns and your
+ labels into two NumPy arrays called `features` and
+ `labels`. You then proceed to extract 70% into
+ `X_train` and `y_train`, with the remaining 30%
+ going into `X_eval` and `y_eval`. You then
+ further split `X_eval` and `y_eval` into two
+ equal parts and assign those to `X_val` and
+ `y_val` for validation, and `X_test` and
+ `y_test` for testing much later.
+
+6. Train a logistic regression model:
+
+ ```
+ # train a Logistic Regression model
+ model = LogisticRegression()
+ model.fit(X_train, y_train)
+ ```
+
+
+ In this step, you create an instance of
+ `LogisticRegression` and train the model on your training
+ data by passing in `X_train` and `y_train` to
+ the `fit` method.
+
+ You should get an output that looks similar to the following:
+
+
+
+
+
+ Caption: Training a logistic regression model
+
+7. Make a prediction:
+
+ ```
+ # make predictions for the validation set
+ y_pred = model.predict(X_val)
+ ```
+
+
+ In this step, you make a prediction on the validation dataset,
+ `X_val`, and store the result in `y_pred`. A
+ look at the first 10 predictions (by executing
+ `y_pred[0:9]`) should provide an output similar to the
+ following:
+
+
+
+
+
+Caption: Prediction for the validation set
+
+
+
+The Confusion Matrix
+====================
+
+
+You encountered the confusion matrix in *Lab 3, Binary
+Classification*. You may recall that the confusion matrix compares the
+number of classes that the model predicted against the actual
+occurrences of those classes in the validation dataset. The output is a
+square matrix that has the number of rows and columns equal to the
+number of classes you are predicting. The columns represent the actual
+values, while the rows represent the predictions. You get a confusion
+matrix by using `confusion_matrix` from
+`sklearn.metrics`.
+
+
+
+Exercise 6.06: Generating a Confusion Matrix for the Classification Model
+-------------------------------------------------------------------------
+
+The goal of this exercise is to create a confusion matrix for the
+classification model you trained in *Exercise 6.05*, *Creating a
+Classification Model for Computing Evaluation Metrics*.
+
+Note
+
+You should continue this exercise in the same notebook as that used in
+*Exercise 6.05, Creating a Classification Model for Computing Evaluation
+Metrics.* If you wish to use a new notebook, make sure you copy and run
+the entire code from *Exercise 6.05*, *Creating a Classification Model
+for Computing Evaluation Metrics*, and then begin with the execution of
+the code of this exercise.
+
+The following steps will help you achieve the task:
+
+1. Open a new Colab notebook file.
+
+2. Import `confusion_matrix`:
+
+ ```
+ from sklearn.metrics import confusion_matrix
+ ```
+
+
+ In this step, you import `confusion_matrix` from
+ `sklearn.metrics`. This function will let you generate a
+ confusion matrix.
+
+3. Generate a confusion matrix:
+
+ ```
+ confusion_matrix(y_val, y_pred)
+ ```
+
+
+ In this step, you generate a confusion matrix by supplying
+ `y_val`, the actual classes, and `y_pred`, the
+ predicted classes.
+
+ The output should look similar to the following:
+
+
+
+
+
+
+
+More on the Confusion Matrix
+----------------------------
+
+The confusion matrix helps you analyze the impact of the choices you
+would have to make if you put the model into production. Let\'s consider
+the example of predicting the presence of a disease based on the inputs
+to the model. This is a binary classification problem, where 1 implies
+that the disease is present and 0 implies the disease is absent. The
+confusion matrix for this model would have two columns and two rows.
+
+The first column would show the items that fall into class **0**. The
+first row would show the items that were correctly classified into class
+**0** and are called `true negatives`. The second row would
+show the items that were wrongly classified as **1** but should have
+been **0**. These are `false positives`.
+
+The second column would show the items that fall into class **1**. The
+first row would show the items that were wrongly classified into class 0
+when they should have been **1** and are
+called` false negatives`. Finally, the second row shows items
+that were correctly classified into class 1 and are called
+`true positives`.
+
+False positives are the cases in which the samples were wrongly
+predicted to be infected when they are actually healthy. The implication
+of this is that these cases would be treated for a disease that they do
+not have.
+
+False negatives are the cases that were wrongly predicted to be healthy
+when they actually have the disease. The implication of this is that
+these cases would not be treated for a disease that they actually have.
+
+The question you need to ask about this model depends on the nature of
+the disease and requires domain expertise about the disease. For
+example, if the disease is contagious, then the untreated cases will be
+released into the general population and could infect others. What would
+be the implication of this versus placing cases into quarantine and
+observing them for symptoms?
+
+On the other hand, if the disease is not contagious, the question
+becomes that of the implications of treating people for a disease they
+do not have versus the implications of not treating cases of a disease.
+
+It should be clear that there isn\'t a definite answer to these
+questions. The model would need to be tuned to provide performance that
+is acceptable to the users.
+
+
+
+Precision
+---------
+
+Precision was introduced in *Lab 3, Binary Classification*; however,
+we will be looking at it in more detail in this lab. The precision
+is the total number of cases that were correctly classified as positive
+(called **true positive** and abbreviated as **TP**) divided by the
+total number of cases in that prediction (that is, the total number of
+entries in the row, both correctly classified (TP) and wrongly
+classified (FP) from the confusion matrix). Suppose 10 entries were
+classified as positive. If 7 of the entries were actually positive, then
+TP would be 7 and FP would be 3. The precision would, therefore, be 0.7.
+The equation is given as follows:
+
+
+
+Caption: Equation for precision
+
+In the preceding equation:
+
+- `tp` is true positive -- the number of predictions that
+ were correctly classified as belonging to that class.
+- `fp` is false positive -- the number of predictions that
+ were wrongly classified as belonging to that class.
+- The function in `sklearn.metrics` to compute precision is
+ called `precision_score`. Go ahead and give it a try.
+
+
+
+Exercise 6.07: Computing Precision for the Classification Model
+---------------------------------------------------------------
+
+In this exercise, you will be computing the precision for the
+classification model you trained in *Exercise 6.05*, *Creating a
+Classification Model for Computing Evaluation Metrics*.
+
+Note
+
+You should continue this exercise in the same notebook as that used in
+*Exercise 6.05, Creating a Classification Model for Computing Evaluation
+Metrics.* If you wish to use a new notebook, make sure you copy and run
+the entire code from *Exercise 6.05*, *Creating a Classification Model
+for Computing Evaluation Metrics*, and then begin with the execution of
+the code of this exercise.
+
+The following steps will help you achieve the task:
+
+1. Import the required libraries:
+
+ ```
+ from sklearn.metrics import precision_score
+ ```
+
+
+ In this step, you import `precision_score` from
+ `sklearn.metrics`.
+
+2. Next, compute the precision score as shown in the following code
+ snippet:
+
+ ```
+ precision_score(y_val, y_pred, average='macro')
+ ```
+
+
+ In this step, you compute the precision score using
+ `precision_score`.
+
+ The output is a floating-point number between 0 and 1. It might look
+ like this:
+
+
+
+
+
+
+Recall
+------
+
+Recall is the total number of predictions that were true divided by the
+number of predictions for the class, both true and false. Think of it as
+the true positive divided by the sum of entries in the column. The
+equation is given as follows:
+
+
+
+Caption: Equation for recall
+
+The function for this is `recall_score`, which is available
+from `sklearn.metrics`.
+
+
+
+Exercise 6.08: Computing Recall for the Classification Model
+------------------------------------------------------------
+
+The goal of this exercise is to compute the recall for the
+classification model you trained in *Exercise 6.05*, *Creating a
+Classification Model for Computing Evaluation Metrics*.
+
+Note
+
+You should continue this exercise in the same notebook as that used in
+*Exercise 6.05, Creating a Classification Model for Computing Evaluation
+Metrics.* If you wish to use a new notebook, make sure you copy and run
+the entire code from *Exercise 6.05*, *Creating a Classification Model
+for Computing Evaluation Metrics*, and then begin with the execution of
+the code of this exercise.
+
+The following steps will help you accomplish the task:
+
+1. Open a new Colab notebook file.
+
+2. Now, import the required libraries:
+
+ ```
+ from sklearn.metrics import recall_score
+ ```
+
+
+ In this step, you import `recall_score` from
+ `sklearn.metrics`. This is the function that you will make
+ use of in the second step.
+
+3. Compute the recall:
+
+ ```
+ recall_score(y_val, y_pred, average='macro')
+ ```
+
+
+ In this step, you compute the recall by using
+ `recall_score`. You need to specify `y_val` and
+ `y_pred` as parameters to the function. The documentation
+ for `recall_score` explains the values that you can supply
+ to `average`. If your model does binary prediction and the
+ labels are `0` and `1`, you can set
+ `average` to `binary`. Other options are
+ `micro`, `macro`, `weighted`, and
+ `samples`. You should read the documentation to see what
+ they do.
+
+ You should get an output that looks like the following:
+
+
+
+
+
+Caption: Recall score
+
+Note
+
+The recall score can vary, depending on the data.
+
+As you can see, we have calculated the recall score in the exercise,
+which is `0.622`. This means that of the total number of
+classes that were predicted, `62%` of them were correctly
+predicted. On its own, this value might not mean much until it is
+compared to the recall score from another model.
+
+
+
+Let\'s now move toward calculating the F1 score, which also helps
+greatly in evaluating the model performance, which in turn aids in
+making better decisions when choosing models.
+
+
+
+F1 Score
+--------
+
+The F1 score is another important parameter that helps us to evaluate
+the model performance. It considers the contribution of both precision
+and recall using the following equation:
+
+
+
+Caption: F1 score
+
+The F1 score ranges from 0 to 1, with 1 being the best possible score.
+You compute the F1 score using `f1_score` from
+`sklearn.metrics`.
+
+
+
+Exercise 6.09: Computing the F1 Score for the Classification Model
+------------------------------------------------------------------
+
+In this exercise, you will compute the F1 score for the classification
+model you trained in *Exercise 6.05*, *Creating a Classification Model
+for Computing Evaluation Metrics*.
+
+Note
+
+You should continue this exercise in the same notebook as that used in
+*Exercise 6.05, Creating a Classification Model for Computing Evaluation
+Metrics.* If you wish to use a new notebook, make sure you copy and run
+the entire code from *Exercise 6.05*, *Creating a Classification Model
+for Computing Evaluation Metrics*, and then begin with the execution of
+the code of this exercise.
+
+The following steps will help you accomplish the task:
+
+1. Open a new Colab notebook file.
+
+2. Import the necessary modules:
+
+ ```
+ from sklearn.metrics import f1_score
+ ```
+
+
+ In this step, you import the `f1_score` method from
+ `sklearn.metrics`. This score will let you compute
+ evaluation metrics.
+
+3. Compute the F1 score:
+
+ ```
+ f1_score(y_val, y_pred, average='macro')
+ ```
+
+
+ In this step, you compute the F1 score by passing in
+ `y_val` and `y_pred`. You also specify
+ `average='macro'` because this is not binary
+ classification.
+
+ You should get an output similar to the following:
+
+
+
+
+
+Caption: F1 score
+
+
+By the end of this exercise, you will see that the `F1` score
+we achieved is `0.6746`. There is a lot of room for
+improvement, and you would engineer new features and train a new model
+to try and get a better F1 score.
+
+
+
+Accuracy
+--------
+
+Accuracy is an evaluation metric that is applied to classification
+models. It is computed by counting the number of labels that were
+correctly predicted, meaning that the predicted label is exactly the
+same as the ground truth. The `accuracy_score()` function
+exists in `sklearn.metrics` to provide this value.
+
+
+
+Exercise 6.10: Computing Model Accuracy for the Classification Model
+--------------------------------------------------------------------
+
+The goal of this exercise is to compute the accuracy score of the model
+trained in *Exercise 6.04*, *Computing the Mean Absolute Error of a
+Second Model*.
+
+Note
+
+You should continue this exercise in the same notebook as that used in
+*Exercise 6.05, Creating a Classification Model for Computing Evaluation
+Metrics.* If you wish to use a new notebook, make sure you copy and run
+the entire code from *Exercise 6.05*, *Creating a Classification Model
+for Computing Evaluation Metrics*, and then begin with the execution of
+the code of this exercise.
+
+The following steps will help you accomplish the task:
+
+1. Continue from where the code for *Exercise 6.05*, *Creating a
+ Classification Model for Computing Evaluation Metrics*, ends in your
+ notebook.
+
+2. Import `accuracy_score()`:
+
+ ```
+ from sklearn.metrics import accuracy_score
+ ```
+
+
+ In this step, you import `accuracy_score()`, which you
+ will use to compute the model accuracy.
+
+3. Compute the accuracy:
+
+ ```
+ _accuracy = accuracy_score(y_val, y_pred)
+ print(_accuracy)
+ ```
+
+
+ In this step, you compute the model accuracy by passing in
+ `y_val` and `y_pred` as parameters to
+ `accuracy_score()`. The interpreter assigns the result to
+ a variable called `c`. The `print()` method
+ causes the interpreter to render the value of `_accuracy`.
+
+ The result is similar to the following:
+
+
+
+
+
+
+Thus, we have successfully calculated the accuracy of the model as being
+`0.876`. The goal of this exercise is to show you how to
+compute the accuracy of a model and to compare this accuracy value to
+that of another model that you will train in the future.
+
+
+
+Logarithmic Loss
+----------------
+
+The logarithmic loss (or log loss) is the loss function for categorical
+models. It is also called categorical cross-entropy. It seeks to
+penalize incorrect predictions. The `sklearn` documentation
+defines it as \"the negative log-likelihood of the true values given
+your model predictions.\"
+
+
+
+Exercise 6.11: Computing the Log Loss for the Classification Model
+------------------------------------------------------------------
+
+The goal of this exercise is to predict the log loss of the model
+trained in *Exercise 6.05*, *Creating a Classification Model for
+Computing Evaluation Metrics*.
+
+Note
+
+You should continue this exercise in the same notebook as that used in
+*Exercise 6.05, Creating a Classification Model for Computing Evaluation
+Metrics.* If you wish to use a new notebook, make sure you copy and run
+the entire code from *Exercise 6.05* and then begin with the execution
+of the code of this exercise.
+
+The following steps will help you accomplish the task:
+
+1. Open your Colab notebook and continue from where *Exercise 6.05*,
+ *Creating a Classification Model for Computing Evaluation Metrics*,
+ stopped.
+
+2. Import the required libraries:
+
+ ```
+ from sklearn.metrics import log_loss
+ ```
+
+
+ In this step, you import `log_loss()` from
+ `sklearn.metrics`.
+
+3. Compute the log loss:
+ ```
+ _loss = log_loss(y_val, model.predict_proba(X_val))
+ print(_loss)
+ ```
+
+
+In this step, you compute the log loss and store it in a variable called
+`_loss`. You need to observe something very important:
+previously, you made use of `y_val`, the ground truths, and
+`y_pred`, the predictions.
+
+In this step, you do not make use of predictions. Instead, you make use
+of predicted probabilities. You see that in the code where you specify
+`model.predict_proba()`. You specify the validation dataset
+and it returns the predicted probabilities.
+
+The `print()` function causes the interpreter to render the
+log loss.
+
+This should look like the following:
+
+
+
+
+
+
+Exercise 6.12: Computing and Plotting ROC Curve for a Binary Classification Problem
+-----------------------------------------------------------------------------------
+
+The goal of this exercise is to plot the ROC curve for a binary
+classification problem. The data for this problem is used to predict
+whether or not a mother will require a caesarian section to give birth.
+
+
+
+From the UCI Machine Learning Repository, the abstract for this dataset
+follows: \"This dataset contains information about caesarian section
+results of 80 pregnant women with the most important characteristics of
+delivery problems in the medical field.\" The attributes of interest are
+age, delivery number, delivery time, blood pressure, and heart status.
+
+The following steps will help you accomplish this task:
+
+1. Open a Colab notebook file.
+
+2. Import the required libraries:
+
+ ```
+ # import libraries
+ import pandas as pd
+ from sklearn.model_selection import train_test_split
+ from sklearn.linear_model import LogisticRegression
+ from sklearn.metrics import roc_curve
+ from sklearn.metrics import auc
+ ```
+
+
+ In this step, you import `pandas`, which you will use to
+ read in data. You also import `train_test_split` for
+ creating training and validation datasets, and
+ `LogisticRegression` for creating a model.
+
+3. Read in the data:
+
+ ```
+ # data doesn't have headers, so let's create headers
+ _headers = ['Age', 'Delivery_Nbr', 'Delivery_Time', \
+ 'Blood_Pressure', 'Heart_Problem', 'Caesarian']
+ # read in cars dataset
+ df = pd.read_csv('https://raw.githubusercontent.com/'\
+ 'fenago/data-science/'\
+ 'master/Lab06/Dataset/caesarian.csv.arff',\
+ names=_headers, index_col=None, skiprows=15)
+ df.head()
+ # target column is 'Caesarian'
+ ```
+
+
+
+
+
+
+ Caption: Reading the dataset
+
+ You will need to do a few things to work with this file. Skip 15
+ rows and specify the column headers and read the file without an
+ index.
+
+ The code shows how you do that by creating a Python list to hold
+ your column headers and then read in the file using
+ `read_csv()`. The parameters that you pass in are the
+ file\'s location, the column headers as a Python list, the name of
+ the index column (in this case, it is None), and the number of rows
+ to skip.
+
+ The `head()` method will print out the top five rows and
+ should look similar to the following:
+
+
+
+
+
+ Caption: The top five rows of the DataFrame
+
+4. Split the data:
+
+ ```
+ # target column is 'Caesarian'
+ features = df.drop(['Caesarian'], axis=1).values
+ labels = df[['Caesarian']].values
+ # split 80% for training and 20% into an evaluation set
+ X_train, X_eval, y_train, y_eval = train_test_split\
+ (features, labels, \
+ test_size=0.2, \
+ random_state=0)
+ """
+ further split the evaluation set into validation and test sets
+ of 10% each
+ """
+ X_val, X_test, y_val, y_test = train_test_split(X_eval, y_eval,\
+ test_size=0.5, \
+ random_state=0)
+ ```
+
+
+ In this step, you begin by creating two `numpy` arrays,
+ which you call `features` and `labels`. You then
+ split these arrays into a `training` and an
+ `evaluation` dataset. You further split the
+ `evaluation` dataset into `validation` and
+ `test` datasets.
+
+5. Now, train and fit a logistic regression model:
+
+ ```
+ model = LogisticRegression()
+ model.fit(X_train, y_train)
+ ```
+
+
+ In this step, you begin by creating an instance of a logistic
+ regression model. You then proceed to train or fit the model on the
+ training dataset.
+
+ The output should be similar to the following:
+
+
+
+
+
+ Caption: Training a logistic regression model
+
+6. Predict the probabilities, as shown in the following code snippet:
+
+ ```
+ y_proba = model.predict_proba(X_val)
+ ```
+
+
+ In this step, the model predicts the probabilities for each entry in
+ the validation dataset. It stores the results in
+ `y_proba`.
+
+7. Compute the true positive rate, the false positive rate, and the
+ thresholds:
+
+ ```
+ _false_positive, _true_positive, _thresholds = roc_curve\
+ (y_val, \
+ y_proba[:, 0])
+ ```
+
+
+ In this step, you make a call to `roc_curve()` and specify
+ the ground truth and the first column of the predicted
+ probabilities. The result is a tuple of false positive rate, true
+ positive rate, and thresholds.
+
+8. Explore the false positive rates:
+
+ ```
+ print(_false_positive)
+ ```
+
+
+ In this step, you instruct the interpreter to print out the false
+ positive rate. The output should be similar to the following:
+
+
+
+
+
+ Caption: False positive rates
+
+ Note
+
+ The false positive rates can vary, depending on the data.
+
+9. Explore the true positive rates:
+
+ ```
+ print(_true_positive)
+ ```
+
+
+ In this step, you instruct the interpreter to print out the true
+ positive rates. This should be similar to the following:
+
+
+
+
+
+ Caption: True positive rates
+
+10. Explore the thresholds:
+
+ ```
+ print(_thresholds)
+ ```
+
+
+ In this step, you instruct the interpreter to display the
+ thresholds. The output should be similar to the following:
+
+
+
+
+
+ Caption: Thresholds
+
+11. Now, plot the ROC curve:
+
+ ```
+ # Plot the RoC
+ import matplotlib.pyplot as plt
+ %matplotlib inline
+ plt.plot(_false_positive, _true_positive, lw=2, \
+ label='Receiver Operating Characteristic')
+ plt.xlim(0.0, 1.2)
+ plt.ylim(0.0, 1.2)
+ plt.xlabel('False Positive Rate')
+ plt.ylabel('True Positive Rate')
+ plt.title('Receiver Operating Characteristic')
+ plt.show()
+ ```
+
+ The output should look similar to the following:
+
+
+
+
+
+Caption: ROC curve
+
+
+
+Exercise 6.13: Computing the ROC AUC for the Caesarian Dataset
+--------------------------------------------------------------
+
+The goal of this exercise is to compute the ROC AUC for the binary
+classification model that you trained in *Exercise 6.12*, *Computing and
+Plotting ROC Curve for a Binary Classification Problem*.
+
+Note
+
+You should continue this exercise in the same notebook as that used in
+*Exercise 6.12, Computing and Plotting ROC Curve for a Binary
+Classification Problem.* If you wish to use a new notebook, make sure
+you copy and run the entire code from *Exercise 6.12* and then begin
+with the execution of the code of this exercise.
+
+The following steps will help you accomplish the task:
+
+1. Open a Colab notebook to the code for *Exercise 6.12*, *Computing
+ and Plotting ROC Curve for a Binary Classification Problem,* and
+ continue writing your code.
+
+2. Predict the probabilities:
+
+ ```
+ y_proba = model.predict_proba(X_val)
+ ```
+
+
+ In this step, you compute the probabilities of the classes in the
+ validation dataset. You store the result in `y_proba`.
+
+3. Compute the ROC AUC:
+
+ ```
+ from sklearn.metrics import roc_auc_score
+ _auc = roc_auc_score(y_val, y_proba[:, 0])
+ print(_auc)
+ ```
+
+
+ In this step, you compute the ROC AUC and store the result in
+ `_auc`. You then proceed to print this value out. The
+ result should look similar to the following:
+
+
+
+
+
+Caption: Computing the ROC AUC
+
+Note
+
+The AUC can be different, depending on the data.
+
+
+
+Saving and Loading Models
+=========================
+
+
+You will eventually need to transfer some of the models you have trained
+to a different computer so they can be put into production. There are
+various utilities for doing this, but the one we will discuss is called
+`joblib`.
+
+`joblib` supports saving and loading models, and it saves the
+models in a format that is supported by other machine learning
+architectures, such as `ONNX`.
+
+`joblib` is found in the `sklearn.externals` module.
+
+
+
+Exercise 6.14: Saving and Loading a Model
+-----------------------------------------
+
+In this exercise, you will train a simple model and use it for
+prediction. You will then proceed to save the model and then load it
+back in. You will use the loaded model for a second prediction, and then
+compare the predictions from the first model to those from the second
+model. You will make use of the car dataset for this exercise.
+
+The following steps will guide you toward the goal:
+
+1. Open a Colab notebook.
+
+2. Import the required libraries:
+ ```
+ import pandas as pd
+ from sklearn.model_selection import train_test_split
+ from sklearn.linear_model import LinearRegression
+ ```
+
+
+3. Read in the data:
+ ```
+ _headers = ['CIC0', 'SM1', 'GATS1i', 'NdsCH', 'Ndssc', \
+ 'MLOGP', 'response']
+ # read in data
+ df = pd.read_csv('https://raw.githubusercontent.com/'\
+ 'fenago/data-science/'\
+ 'master/Lab06/Dataset/'\
+ 'qsar_fish_toxicity.csv', \
+ names=_headers, sep=';')
+ ```
+
+
+4. Inspect the data:
+
+ ```
+ df.head()
+ ```
+
+
+ The output should be similar to the following:
+
+
+
+
+
+ Caption: Inspecting the first five rows of the DataFrame
+
+5. Split the data into `features` and `labels`, and
+ into training and validation sets:
+ ```
+ features = df.drop('response', axis=1).values
+ labels = df[['response']].values
+ X_train, X_eval, y_train, y_eval = train_test_split\
+ (features, labels, \
+ test_size=0.2, \
+ random_state=0)
+ X_val, X_test, y_val, y_test = train_test_split(X_eval, y_eval,\
+ random_state=0)
+ ```
+
+
+6. Create a linear regression model:
+
+ ```
+ model = LinearRegression()
+ print(model)
+ ```
+
+
+ The output will be as follows:
+
+
+
+
+
+ Caption: Training a linear regression model
+
+7. Fit the training data to the model:
+ ```
+ model.fit(X_train, y_train)
+ ```
+
+
+8. Use the model for prediction:
+ ```
+ y_pred = model.predict(X_val)
+ ```
+
+
+9. Import `joblib`:
+ ```
+ from sklearn.externals import joblib
+ ```
+
+
+10. Save the model:
+
+ ```
+ joblib.dump(model, './model.joblib')
+ ```
+
+
+ The output should be similar to the following:
+
+
+
+
+
+ Caption: Saving the model
+
+11. Load it as a new model:
+ ```
+ m2 = joblib.load('./model.joblib')
+ ```
+
+
+12. Use the new model for predictions:
+ ```
+ m2_preds = m2.predict(X_val)
+ ```
+
+
+13. Compare the predictions:
+
+ ```
+ ys = pd.DataFrame(dict(predicted=y_pred.reshape(-1), \
+ m2=m2_preds.reshape(-1)))
+ ys.head()
+ ```
+
+
+ The output should be similar to the following:
+
+
+
+
+
+Caption: Comparing predictions
+
+
+
+Activity 6.01: Train Three Different Models and Use Evaluation Metrics to Pick the Best Performing Model
+--------------------------------------------------------------------------------------------------------
+
+You work as a data scientist at a bank. The bank would like to implement
+a model that predicts the likelihood of a customer purchasing a term
+deposit. The bank provides you with a dataset, which is the same as the
+one in *Lab 3*, *Binary Classification*. You have previously learned
+how to train a logistic regression model for binary classification.
+You have also heard about other non-parametric modeling techniques and
+would like to try out a decision tree as well as a random forest to see
+how well they perform against the logistic regression models you have
+been training.
+
+In this activity, you will train a logistic regression model and compute
+a classification report. You will then proceed to train a decision tree
+classifier and compute a classification report. You will compare the
+models using the classification reports. Finally, you will train a
+random forest classifier and generate the classification report. You
+will then compare the logistic regression model with the random forest
+using the classification reports to determine which model you should put
+into production.
+
+The steps to accomplish this task are:
+
+1. Open a Colab notebook.
+
+2. Load the necessary libraries.
+
+3. Read in the data.
+
+4. Explore the data.
+
+5. Convert categorical variables using
+ `pandas.get_dummies()`.
+
+6. Prepare the `X` and `y` variables.
+
+7. Split the data into training and evaluation sets.
+
+8. Create an instance of `LogisticRegression`.
+
+9. Fit the training data to the `LogisticRegression` model.
+
+10. Use the evaluation set to make a prediction.
+
+11. Use the prediction from the `LogisticRegression` model to
+ compute the classification report.
+
+12. Create an instance of `DecisionTreeClassifier`:
+ ```
+ dt_model = DecisionTreeClassifier(max_depth= 6)
+ ```
+
+
+13. Fit the training data to the `DecisionTreeClassifier`
+ model:
+ ```
+ dt_model.fit(train_X, train_y)
+ ```
+
+
+14. Using the `DecisionTreeClassifier` model, make a
+ prediction on the evaluation dataset:
+ ```
+ dt_preds = dt_model.predict(val_X)
+ ```
+
+
+15. Use the prediction from the `DecisionTreeClassifier` model
+ to compute the classification report:
+
+ ```
+ dt_report = classification_report(val_y, dt_preds)
+ print(dt_report)
+ ```
+
+
+ Note
+
+ We will be studying decision trees in detail in *Lab 7, The
+ Generalization of Machine Learning Models*.
+
+16. Compare the classification report from the linear regression model
+ and the classification report from the decision tree classifier to
+ determine which is the better model.
+
+17. Create an instance of `RandomForestClassifier`.
+
+18. Fit the training data to the `RandomForestClassifier`
+ model.
+
+19. Using the `RandomForestClassifier` model, make a
+ prediction on the evaluation dataset.
+
+20. Using the prediction from the random forest classifier, compute the
+ classification report.
+
+21. Compare the classification report from the linear regression model
+ with the classification report from the random forest classifier to
+ decide which model to keep or improve upon.
+
+22. Compare the R[2] scores of all three models. The
+ output should be similar to the following:
+
+
+
+
+
+
+Summary
+=======
+
+In this lab we observed that some of the evaluation metrics for
+classification models require a binary classification model. We saw that
+when we worked with more than two classes, we were required to use the
+one-versus-all approach. The one-versus-all approach builds one model
+for each class and tries to predict the probability that the input
+belongs to a specific class. We saw that once this was done, we then
+predicted that the input belongs to the class where the model has the
+highest prediction probability. We also split our evaluation dataset
+into two, it\'s because `X_test` and `y_test` are
+used once for a final evaluation of the model\'s performance. You
+can make use of them before putting your model into production to see
+how the model would perform in a production environment.
diff --git a/lab_guides/Lab_7.md b/lab_guides/Lab_7.md
new file mode 100644
index 0000000..1a89366
--- /dev/null
+++ b/lab_guides/Lab_7.md
@@ -0,0 +1,2919 @@
+
+7. The Generalization of Machine Learning Models
+================================================
+
+
+
+Overview
+
+This lab will teach you how to make use of the data you have to
+train better models by either splitting your data if it is sufficient or
+making use of cross-validation if it is not. By the end of this lab,
+you will know how to split your data into training, validation, and test
+datasets. You will be able to identify the ratio in which data has to be
+split and also consider certain features while splitting. You will also
+be able to implement cross-validation to use limited data for testing
+and use regularization to reduce overfitting in models.
+
+
+Introduction
+============
+
+
+In the previous lab, you learned about model assessment using
+various metrics such as R2 score, MAE, and accuracy. These metrics help
+you decide which models to keep and which ones to discard. In this
+lab, you will learn some more techniques for training better models.
+
+Generalization deals with getting your models to perform well enough on
+data points that they have not encountered in the past (that is, during
+training). We will address two specific areas:
+
+- How to make use of as much of your data as possible to train a model
+- How to reduce overfitting in a model
+
+
+Overfitting
+===========
+
+
+A model is said to overfit the training data when it generates a
+hypothesis that accounts for every example. What this means is that it
+correctly predicts the outcome of every example. The problem with this
+scenario is that the model equation becomes extremely complex, and such
+models have been observed to be incapable of correctly predicting new
+observations.
+
+Overfitting occurs when a model has been over-engineered. Two of the
+ways in which this could occur are:
+
+- The model is trained on too many features.
+- The model is trained for too long.
+
+We\'ll discuss each of these two points in the following sections.
+
+
+
+Training on Too Many Features
+-----------------------------
+
+When a model trains on too many features, the hypothesis becomes
+extremely complicated. Consider a case in which you have one column of
+features and you need to generate a hypothesis. This would be a simple
+linear equation, as shown here:
+
+
+
+Caption: Equation for a hypothesis for a line
+
+Now, consider a case in which you have two columns, and in which you
+cross the columns by multiplying them. The hypothesis becomes the
+following:
+
+
+
+Caption: Equation for a hypothesis for a curve
+
+While the first equation yields a line, the second equation yields a
+curve, because it is now a quadratic equation. But the same two features
+could become even more complicated depending on how you engineer your
+features. Consider the following equation:
+
+
+
+Caption: Cubic equation for a hypothesis
+
+The same set of features has now given rise to a cubic equation. This
+equation will have the property of having a large number of weights, for
+example:
+
+- The simple linear equation has one weight and one bias.
+- The quadratic equation has three weights and one bias.
+- The cubic equation has five weights and one bias.
+
+One solution to overfitting as a result of too many features is to
+eliminate certain features. The technique for this is called lasso
+regression.
+
+A second solution to overfitting as a result of too many features is to
+provide more data to the model. This might not always be a feasible
+option, but where possible, it is always a good idea to do so.
+
+
+
+Training for Too Long
+---------------------
+
+The model starts training by initializing the vector of weights such
+that all values are equal to zero. During training, the weights are
+updated according to the gradient update rule. This systematically adds
+or subtracts a small value to each weight. As training progresses, the
+magnitude of the weights increases. If the model trains for too long,
+these model weights become too large.
+
+The solution to overfitting as a result of large weights is to reduce
+the magnitude of the weights to as close to zero as possible. The
+technique for this is called ridge regression.
+
+
+Underfitting
+============
+
+
+Consider an alternative situation in which the data has 10 features, but
+you only make use of 1 feature. Your model hypothesis would still be the
+following:
+
+
+
+Caption: Equation for a hypothesis for a line
+
+However, that is the equation of a straight line, but your model is
+probably ignoring a lot of information. The model is over-simplified and
+is said to underfit the data.
+
+The solution to underfitting is to provide the model with more features,
+or conversely, less data to train on; but more features is the better
+approach.
+
+
+Data
+====
+
+
+In the world of machine learning, the data that you have is not used in
+its entirety to train your model. Instead, you need to separate your
+data into three sets, as mentioned here:
+
+- A training dataset, which is used to train your model and measure
+ the training loss.
+- An evaluation or validation dataset, which you use to measure the
+ validation loss of the model to see whether the validation loss
+ continues to reduce as well as the training loss.
+- A test dataset for final testing to see how well the model performs
+ before you put it into production.
+
+
+
+The Ratio for Dataset Splits
+----------------------------
+
+The evaluation dataset is set aside from your entire training data and
+is never used for training. There are various schools of thought around
+the particular ratio that is set aside for evaluation, but it generally
+ranges from a high of 30% to a low of 10%. This evaluation dataset is
+normally further split into a validation dataset that is used during
+training and a test dataset that is used at the end for a sanity check.
+If you are using 10% for evaluation, you might set 5% aside for
+validation and the remaining 5% for testing. If using 30%, you might set
+20% aside for validation and 10% for testing.
+
+To summarize, you might split your data into 70% for training, 20% for
+validation, and 10% for testing, or you could split your data into 80%
+for training, 15% for validation, and 5% for test. Or, finally, you
+could split your data into 90% for training, 5% for validation, and 5%
+for testing.
+
+The choice of what ratio to use is dependent on the amount of data that
+you have. If you are working with 100,000 records, for example, then 20%
+validation would give you 20,000 records. However, if you were working
+with 100,000,000 records, then 5% would give you 5 million records for
+validation, which would be more than sufficient.
+
+
+
+Creating Dataset Splits
+-----------------------
+
+At a very basic level, splitting your data involves random sampling.
+Let\'s say you have 10 items in a bowl. To get 30% of the items, you
+would reach in and take any 3 items at random.
+
+In the same way, because you are writing code, you could do the
+following:
+
+1. Create a Python list.
+2. Place 10 numbers in the list.
+3. Generate 3 non-repeating random whole numbers from 0 to 9.
+4. Pick items whose indices correspond to the random numbers
+ previously generated.
+
+
+
+
+Caption: Visualization of data splitting
+
+This is something you will only do once for a particular dataset. You
+might write a function for it. If it is something that you need to do
+repeatedly and you also need to handle advanced functionality, you might
+want to write a class for it.
+
+`sklearn` has a class called `train_test_split`,
+which provides the functionality for splitting data. It is available as
+`sklearn.model_selection.train_test_split`. This function will
+let you split a DataFrame into two parts.
+
+Have a look at the following exercise on importing and splitting data.
+
+
+
+Exercise 7.01: Importing and Splitting Data
+-------------------------------------------
+
+The goal of this exercise is to import data from a repository and to
+split it into a training and an evaluation set.
+We will be using the Cars dataset from the UCI Machine Learning
+Repository.
+
+This dataset is about the cost of owning cars with certain attributes.
+The abstract from the website states: \"*Derived from simple
+hierarchical decision model, this database may be useful for testing
+constructive induction and structure discovery methods*.\" Here are some
+of the key attributes of this dataset:
+
+```
+CAR car acceptability
+. PRICE overall price
+. . buying buying price
+. . maint price of the maintenance
+. TECH technical characteristics
+. . COMFORT comfort
+. . . doors number of doors
+. . . persons capacity in terms of persons to carry
+. . . lug_boot the size of luggage boot
+. . safety estimated safety of the car
+```
+
+The following steps will help you complete the exercise:
+
+1. Open a new Colab notebook file.
+
+2. Import the necessary libraries:
+
+ ```
+ # import libraries
+ import pandas as pd
+ from sklearn.model_selection import train_test_split
+ ```
+
+
+ In this step, you have imported `pandas` and aliased it as
+ `pd`. As you know, `pandas` is required to read
+ in the file. You also import `train_test_split` from
+ `sklearn.model_selection` to split the data into two
+ parts.
+
+3. Before reading the file into your notebook, open and inspect the
+ file (`car.data`) with an editor. You should see an output
+ similar to the following:
+
+
+
+
+
+ Caption: Car data
+
+ You will notice from the preceding screenshot that the file doesn\'t
+ have a first row containing the headers.
+
+4. Create a Python list to hold the headers for the data:
+ ```
+ # data doesn't have headers, so let's create headers
+ _headers = ['buying', 'maint', 'doors', 'persons', \
+ 'lug_boot', 'safety', 'car']
+ ```
+
+
+5. Now, import the data as shown in the following code snippet:
+
+ ```
+ # read in cars dataset
+ df = pd.read_csv('https://raw.githubusercontent.com/'\
+ 'fenago/data-science/'\
+ 'master/Lab07/Dataset/car.data', \
+ names=_headers, index_col=None)
+ ```
+
+
+ You then proceed to import the data into a variable called
+ `df` by using `pd.read_csv`. You specify the
+ location of the data file, as well as the list of column headers.
+ You also specify that the data does not have a column index.
+
+6. Show the top five records:
+
+ ```
+ df.info()
+ ```
+
+
+ In order to get information about the columns in the data as well as
+ the number of records, you make use of the `info()`
+ method. You should get an output similar to the following:
+
+
+
+
+
+ Caption: The top five records of the DataFrame
+
+ The `RangeIndex` value shows the number of records, which
+ is `1728`.
+
+7. Now, you need to split the data contained in `df` into a
+ training dataset and an evaluation dataset:
+
+ ```
+ #split the data into 80% for training and 20% for evaluation
+ training_df, eval_df = train_test_split(df, train_size=0.8, \
+ random_state=0)
+ ```
+
+
+ In this step, you make use of `train_test_split` to create
+ two new DataFrames called `training_df` and
+ `eval_df`.
+
+ You specify a value of `0.8` for `train_size` so
+ that `80%` of the data is assigned to
+ `training_df`.
+
+ `random_state` ensures that your experiments are
+ reproducible. Without `random_state`, the data is split
+ differently every time using a different random number. With
+ `random_state`, the data is split the same way every time.
+ We will be studying `random_state` in depth in the next
+ lab.
+
+8. Check the information of `training_df`:
+
+ ```
+ training_df.info()
+ ```
+
+
+ In this step, you make use of `.info()` to get the details
+ of `training_df`. This will print out the column names as
+ well as the number of records.
+
+ You should get an output similar to the following:
+
+
+
+
+
+ Caption: Information on training\_df
+
+ You should observe that the column names match those in
+ `df`, but you should have `80%` of the records
+ that you did in `df`, which is `1382` out of
+ `1728`.
+
+9. Check the information on `eval_df`:
+
+ ```
+ eval_df.info()
+ ```
+
+
+ In this step, you print out the information about
+ `eval_df`. This will give you the column names and the
+ number of records. The output should be similar to the following:
+
+
+
+
+
+Caption: Information on eval\_df
+
+
+
+**Random State**
+
+
+
+Caption: Numbers generated using random state
+
+
+
+Exercise 7.02: Setting a Random State When Splitting Data
+---------------------------------------------------------
+
+The goal of this exercise is to have a reproducible way of splitting the
+data that you imported in *Exercise 7.01*, *Importing and Splitting
+Data*.
+
+Note
+
+We going to refactor the code from the previous exercise. Hence, if you
+are using a new Colab notebook then make sure you copy the code from the
+previous exercise. Alternatively, you can make a copy of the notebook
+used in *Exercise 7.01* and use the revised the code as suggested in the
+following steps.
+
+The following steps will help you complete the exercise:
+
+1. Continue from the previous *Exercise 7.01* notebook.
+
+2. Set the random state as `1` and split the data:
+
+ ```
+ """
+ split the data into 80% for training and 20% for evaluation
+ using a random state
+ """
+ training_df, eval_df = train_test_split(df, train_size=0.8, \
+ random_state=1)
+ ```
+
+
+ In this step, you specify a `random_state` value of 1 to
+ the `train_test_split` function.
+
+3. Now, view the top five records in `training_df`:
+
+ ```
+ #view the head of training_eval
+ training_df.head()
+ ```
+
+
+ In this step, you print out the first five records in
+ `training_df`.
+
+ The output should be similar to the following:
+
+
+
+
+
+ Caption: The top five rows for the training evaluation set
+
+4. View the top five records in `eval_df`:
+
+ ```
+ #view the top of eval_df
+ eval_df.head()
+ ```
+
+
+ In this step, you print out the first five records in
+ `eval_df`.
+
+ The output should be similar to the following:
+
+
+
+
+
+
+
+Cross-Validation
+================
+
+
+Consider an example where you split your data into five parts of 20%
+each. You would then make use of four parts for training and one part
+for evaluation. Because you have five parts, you can make use of the
+data five times, each time using one part for validation and the
+remaining data for training.
+
+
+
+Caption: Cross-validation
+
+
+Exercise 7.03: Creating a Five-Fold Cross-Validation Dataset
+------------------------------------------------------------
+
+The goal of this exercise is to create a five-fold cross-validation
+dataset from the data that you imported in *Exercise 7.01*, *Importing
+and Splitting Data*.
+
+Note
+
+If you are using a new Colab notebook then make sure you copy the code
+from *Exercise 7.01*, *Importing and Splitting Data*. Alternatively, you
+can make a copy of the notebook used in *Exercise 7.01* and then use the
+code as suggested in the following steps.
+
+The following steps will help you complete the exercise:
+
+1. Continue from the notebook file of *Exercise 7.01.*
+
+2. Import all the necessary libraries:
+
+ ```
+ from sklearn.model_selection import KFold
+ ```
+
+
+ In this step, you import `KFold` from
+ `sklearn.model_selection`.
+
+3. Now create an instance of the class:
+
+ ```
+ _kf = KFold(n_splits=5)
+ ```
+
+
+ In this step, you create an instance of `KFold` and assign
+ it to a variable called `_kf`. You specify a value of
+ `5` for the `n_splits` parameter so that it
+ splits the dataset into five parts.
+
+4. Now split the data as shown in the following code snippet:
+
+ ```
+ indices = _kf.split(df)
+ ```
+
+
+ In this step, you call the `split` method, which is
+ `.split()` on `_kf`. The result is stored in a
+ variable called `indices`.
+
+5. Find out what data type `indices` has:
+
+ ```
+ print(type(indices))
+ ```
+
+
+ In this step, you inspect the call to split the output returns.
+
+ The output should be a `generator`, as seen in the
+ following output:
+
+
+
+
+
+ Caption: Data type for indices
+
+6. Get the first set of indices:
+
+ ```
+ #first set
+ train_indices, val_indices = next(indices)
+ ```
+
+
+ In this step, you make use of the `next()` Python function
+ on the generator function. Using `next()` is the way that
+ you get a generator to return results to you. You asked for five
+ splits, so you can call `next()` five times on this
+ particular generator. Calling `next()` a sixth time will
+ cause the Python runtime to raise an exception.
+
+ The call to `next()` yields a tuple. In this case, it is a
+ pair of indices. The first one contains your training indices and
+ the second one contains your validation indices. You assign these to
+ `train_indices` and `val_indices`.
+
+7. Create a training dataset as shown in the following code snippet:
+
+ ```
+ train_df = df.drop(val_indices)
+ train_df.info()
+ ```
+
+
+ In this step, you create a new DataFrame called `train_df`
+ by dropping the validation indices from `df`, the
+ DataFrame that contains all of the data. This is a subtractive
+ operation similar to what is done in set theory. The `df`
+ set is a union of `train` and `val`. Once you
+ know what `val` is, you can work backward to determine
+ `train` by subtracting `val` from
+ `df`. If you consider `df` to be a set called
+ `A`, `val` to be a set called `B`, and
+ train to be a set called `C`, then the following holds
+ true:
+
+
+
+
+
+ Caption: Dataframe A
+
+ Similarly, set `C` can be the difference between set
+ `A` and set `B`, as depicted in the following:
+
+
+
+
+
+ Caption: Dataframe C
+
+ The way to accomplish this with a pandas DataFrame is to drop the
+ rows with the indices of the elements of `B` from
+ `A`, which is what you see in the preceding code snippet.
+
+ You can see the result of this by calling the `info()`
+ method on the new DataFrame.
+
+ The result of that call should be similar to the following
+ screenshot:
+
+
+
+
+
+ Caption: Information on the new dataframe
+
+8. Create a validation dataset:
+
+ ```
+ val_df = df.drop(train_indices)
+ val_df.info()
+ ```
+
+
+ In this step, you create the `val_df` validation dataset
+ by dropping the training indices from the `df` DataFrame.
+ Again, you can see the details of this new DataFrame by calling the
+ `info()` method.
+
+ The output should be similar to the following:
+
+
+
+
+
+Caption: Information for the validation dataset
+
+
+Exercise 7.04: Creating a Five-Fold Cross-Validation Dataset Using a Loop for Calls
+-----------------------------------------------------------------------------------
+
+The goal of this exercise is to create a five-fold cross-validation
+dataset from the data that you imported in *Exercise 7.01*, *Importing
+and Splitting Data*. You will make use of a loop for calls to the
+generator function.
+
+
+The following steps will help you complete this exercise:
+
+1. Open a new Colab notebook and repeat the steps you used to import
+ data in *Exercise 7.01*, *Importing and Splitting Data*.
+
+2. Define the number of splits you would like:
+
+ ```
+ from sklearn.model_selection import KFold
+ #define number of splits
+ n_splits = 5
+ ```
+
+
+ In this step, you set the number of splits to `5`. You
+ store this in a variable called `n_splits`.
+
+3. Create an instance of `Kfold`:
+
+ ```
+ #create an instance of KFold
+ _kf = KFold(n_splits=n_splits)
+ ```
+
+
+ In this step, you create an instance of `Kfold`. You
+ assign this instance to a variable called `_kf`.
+
+4. Generate the split indices:
+
+ ```
+ #create splits as _indices
+ _indices = _kf.split(df)
+ ```
+
+
+ In this step, you call the `split()` method on
+ `_kf`, which is the instance of `KFold` that you
+ defined earlier. You provide `df` as a parameter so that
+ the splits are performed on the data contained in the DataFrame
+ called `df`. The resulting generator is stored as
+ `_indices`.
+
+5. Create two Python lists:
+
+ ```
+ _t, _v = [], []
+ ```
+
+
+ In this step, you create two Python lists. The first is called
+ `_t` and holds the training DataFrames, and the second is
+ called `_v` and holds the validation DataFrames.
+
+6. Iterate over the generator and create DataFrames called
+ `train_idx`, `val_idx`, `_train_df`
+ and `_val_df`:
+
+ ```
+ #iterate over _indices
+ for i in range(n_splits):
+ train_idx, val_idx = next(_indices)
+ _train_df = df.drop(val_idx)
+ _t.append(_train_df)
+ _val_df = df.drop(train_idx)
+ _v.append(_val_df)
+ ```
+
+
+ In this step, you create a loop using `range` to determine
+ the number of iterations. You specify the number of iterations by
+ providing `n_splits` as a parameter to
+ `range()`. On every iteration, you execute
+ `next()` on the `_indices` generator and store
+ the results in `train_idx` and `val_idx`. You
+ then proceed to create `_train_df` by dropping the
+ validation indices, `val_idx`, from `df`. You
+ also create `_val_df` by dropping the training indices
+ from `df`.
+
+7. Iterate over the training list:
+
+ ```
+ for d in _t:
+ print(d.info())
+ ```
+
+
+ In this step, you verify that the compiler created the DataFrames.
+ You do this by iterating over the list and using the
+ `.info()` method to print out the details of each element.
+ The output is similar to the following screenshot, which is
+ incomplete due to the size of the output. Each element in the list
+ is a DataFrame with 1,382 entries:
+
+
+
+
+
+ Caption: Iterating over the training list
+
+ Note
+
+ The preceding output is a truncated version of the actual output.
+
+8. Iterate over the validation list:
+
+ ```
+ for d in _v:
+ print(d.info())
+ ```
+
+
+ In this step, you iterate over the validation list and make use of
+ `.info()` to print out the details of each element. The
+ output is similar to the following screenshot, which is incomplete
+ due to the size. Each element is a DataFrame with 346 entries:
+
+
+
+
+
+
+
+Exercise 7.05: Getting the Scores from Five-Fold Cross-Validation
+-----------------------------------------------------------------
+
+The goal of this exercise is to create a five-fold cross-validation
+dataset from the data that you imported in *Exercise 7.01*, *Importing
+and Splitting Data*. You will then use `cross_val_score` to
+get the scores of models trained on those datasets.
+
+
+The following steps will help you complete the exercise:
+
+1. Open a new Colab notebook and repeat *steps 1-6* that you took to
+ import data in *Exercise 7.01*, *Importing and Splitting Data*.
+
+2. Encode the categorical variables in the dataset:
+
+ ```
+ # encode categorical variables
+ _df = pd.get_dummies(df, columns=['buying', 'maint', 'doors', \
+ 'persons', 'lug_boot', \
+ 'safety'])
+ _df.head()
+ ```
+
+
+ In this step, you make use of `pd.get_dummies()` to
+ convert categorical variables into an encoding. You store the result
+ in a new DataFrame variable called `_df`. You then proceed
+ to take a look at the first five records.
+
+ The result should look similar to the following:
+
+
+
+
+
+ Caption: Encoding categorical variables
+
+3. Split the data into features and labels:
+
+ ```
+ # separate features and labels DataFrames
+ features = _df.drop(['car'], axis=1).values
+ labels = _df[['car']].values
+ ```
+
+
+ In this step, you create a `features` DataFrame by
+ dropping `car` from `_df`. You also create
+ `labels` by selecting only `car` in a new
+ DataFrame. Here, a feature and a label are similar in the Cars
+ dataset.
+
+4. Create an instance of the `LogisticRegression` class to be
+ used later:
+
+ ```
+ from sklearn.linear_model import LogisticRegression
+ # create an instance of LogisticRegression
+ _lr = LogisticRegression()
+ ```
+
+
+ In this step, you import `LogisticRegression` from
+ `sklearn.linear_model`. We use
+ `LogisticRegression` because it lets us create a
+ classification model, as you learned in *Lab 3, Binary
+ Classification*. You then proceed to create an instance and store it
+ as `_lr`.
+
+5. Import the `cross_val_score` function:
+
+ ```
+ from sklearn.model_selection import cross_val_score
+ ```
+
+
+ In this step now, you import `cross_val_score`, which you
+ will make use of to compute the scores of the models.
+
+6. Compute the cross-validation scores:
+
+ ```
+ _scores = cross_val_score(_lr, features, labels, cv=5)
+ ```
+
+
+ In this step, you the compute cross-validation scores and store the
+ result in a Python list, which you call `_scores`. You do
+ this using `cross_cal_score`. The function requires the
+ following four parameters: the model to make use of (in our case,
+ it\'s called `_lr`); the features of the dataset; the
+ labels of the dataset; and the number of cross-validation splits to
+ create (five, in our case).
+
+7. Now, display the scores as shown in the following code snippet:
+
+ ```
+ print(_scores)
+ ```
+
+
+ In this step, you display the scores using `print()`.
+
+ The output should look similar to the following:
+
+
+
+
+
+Caption: Printing the cross-validation scores
+
+
+
+LogisticRegressionCV
+====================
+
+
+`LogisticRegressionCV` is a class that implements
+cross-validation inside it. This class will train multiple
+`LogisticRegression` models and return the best one.
+
+
+
+Exercise 7.06: Training a Logistic Regression Model Using Cross-Validation
+--------------------------------------------------------------------------
+
+The goal of this exercise is to train a logistic regression model using
+cross-validation and get the optimal R2 result. We will be making use of
+the Cars dataset that you worked with previously.
+
+The following steps will help you complete the exercise:
+
+1. Open a new Colab notebook.
+
+2. Import the necessary libraries:
+
+ ```
+ # import libraries
+ import pandas as pd
+ from sklearn.model_selection import train_test_split
+ ```
+
+
+ In this step, you import `pandas` and alias it as
+ `pd`. You will make use of pandas to read in the file you
+ will be working with.
+
+3. Create headers for the data:
+
+ ```
+ # data doesn't have headers, so let's create headers
+ _headers = ['buying', 'maint', 'doors', 'persons', \
+ 'lug_boot', 'safety', 'car']
+ ```
+
+
+ In this step, you start by creating a Python list to hold the
+ `headers` column for the file you will be working with.
+ You store this list as `_headers`.
+
+4. Read the data:
+
+ ```
+ # read in cars dataset
+ df = pd.read_csv('https://raw.githubusercontent.com/'\
+ 'fenago/data-science/'\
+ 'master/Lab07/Dataset/car.data', \
+ names=_headers, index_col=None)
+ ```
+
+
+ You then proceed to read in the file and store it as `df`.
+ This is a DataFrame.
+
+5. Print out the top five records:
+
+ ```
+ df.info()
+ ```
+
+
+ Finally, you look at the summary of the DataFrame using
+ `.info()`.
+
+ The output looks similar to the following:
+
+
+
+
+
+ Caption: The top five records of the dataframe
+
+6. Encode the categorical variables as shown in the following code
+ snippet:
+
+ ```
+ # encode categorical variables
+ _df = pd.get_dummies(df, columns=['buying', 'maint', 'doors', \
+ 'persons', 'lug_boot', \
+ 'safety'])
+ _df.head()
+ ```
+
+
+ In this step, you convert categorical variables into encodings using
+ the `get_dummies()` method from pandas. You supply the
+ original DataFrame as a parameter and also specify the columns you
+ would like to encode.
+
+ Finally, you take a peek at the top five rows. The output looks
+ similar to the following:
+
+
+
+
+
+ Caption: Encoding categorical variables
+
+7. Split the DataFrame into features and labels:
+
+ ```
+ # separate features and labels DataFrames
+ features = _df.drop(['car'], axis=1).values
+ labels = _df[['car']].values
+ ```
+
+
+ In this step, you create two NumPy arrays. The first, called
+ `features`, contains the independent variables. The
+ second, called `labels`, contains the values that the
+ model learns to predict. These are also called `targets`.
+
+8. Import logistic regression with cross-validation:
+
+ ```
+ from sklearn.linear_model import LogisticRegressionCV
+ ```
+
+
+ In this step, you import the `LogisticRegressionCV` class.
+
+9. Instantiate `LogisticRegressionCV` as shown in the
+ following code snippet:
+
+ ```
+ model = LogisticRegressionCV(max_iter=2000, multi_class='auto',\
+ cv=5)
+ ```
+
+
+ In this step, you create an instance of
+ `LogisticRegressionCV`. You specify the following
+ parameters:
+
+ `max_iter` : You set this to `2000` so that the
+ trainer continues training for `2000` iterations to find
+ better weights.
+
+ `multi_class`: You set this to `auto` so that
+ the model automatically detects that your data has more than two
+ classes.
+
+ `cv`: You set this to `5`, which is the number
+ of cross-validation sets you would like to train on.
+
+10. Now fit the model:
+
+ ```
+ model.fit(features, labels.ravel())
+ ```
+
+
+ In this step, you train the model. You pass in `features`
+ and `labels`. Because `labels` is a 2D array,
+ you make use of `ravel()` to convert it into a 1D array
+ or vector.
+
+ The interpreter produces an output similar to the following:
+
+
+
+
+
+ Caption: Fitting the model
+
+ In the preceding output, you see that the model fits the training
+ data. The output shows you the parameters that were used in
+ training, so you are not taken by surprise. Notice, for example,
+ that `max_iter` is `2000`, which is the value
+ that you set. Other parameters you didn\'t set make use of default
+ values, which you can find out more about from the documentation.
+
+11. Evaluate the training R2:
+
+ ```
+ print(model.score(features, labels.ravel()))
+ ```
+
+
+ In this step, we make use of the training dataset to compute the R2
+ score. While we didn\'t set aside a specific validation dataset, it
+ is important to note that the model only saw 80% of our training
+ data, so it still has new data to work with for this evaluation.
+
+ The output looks similar to the following:
+
+
+
+
+
+Caption: Computing the R2 score
+
+
+
+Hyperparameter Tuning with GridSearchCV
+=======================================
+
+
+`GridSearchCV` will take a model and parameters and train one
+model for each permutation of the parameters. At the end of the
+training, it will provide access to the parameters and the model scores.
+This is called hyperparameter tuning and you will be looking at this in
+much more depth in *Lab 8, Hyperparameter Tuning*.
+
+The usual practice is to make use of a small training set to find the
+optimal parameters using hyperparameter tuning and then to train a final
+model with all of the data.
+
+Before the next exercise, let\'s take a brief look at decision trees,
+which are a type of model or estimator.
+
+
+
+Decision Trees
+--------------
+
+A decision tree works by generating a separating hyperplane or a
+threshold for the features in data. It does this by considering every
+feature and finding the correlation between the spread of the values in
+that feature and the label that you are trying to predict.
+
+Consider the following data about balloons. The label you need to
+predict is called `inflated`. This dataset is used for
+predicting whether the balloon is inflated or deflated given the
+features. The features are:
+
+- `color`
+- `size`
+- `act`
+- `age`
+
+The following table displays the distribution of features:
+
+
+
+Caption: Tabular data for balloon features
+
+Now consider the following charts, which are visualized depending on the
+spread of the features against the label:
+
+- If you consider the `Color` feature, the values are
+ `PURPLE` and `YELLOW`, but the number of
+ observations is the same, so you can\'t infer whether the balloon is
+ inflated or not based on the color, as you can see in the following
+ figure:
+
+
+
+
+Caption: Barplot for the color feature
+
+- The `Size` feature has two values: `LARGE` and
+ `SMALL`. These are equally spread, so we can\'t infer
+ whether the balloon is inflated or not based on the color, as you
+ can see in the following figure:
+
+
+
+
+Caption: Barplot for the size feature
+
+- The `Act` feature has two values: `DIP` and
+ `STRETCH`. You can see from the chart that the majority of
+ the `STRETCH` values are inflated. If you had to make a
+ guess, you could easily say that if `Act` is
+ `STRETCH`, then the balloon is inflated. Consider the
+ following figure:
+
+
+
+
+Caption: Barplot for the act feature
+
+- Finally, the `Age` feature also has two values:
+ `ADULT` and `CHILD`. It\'s also visible from the
+ chart that the `ADULT` value constitutes the majority of
+ inflated balloons:
+
+
+
+
+Caption: Barplot for the age feature
+
+The two features that are useful to the decision tree are
+`Act` and `Age`. The tree could start by considering
+whether `Act` is `STRETCH`. If it is, the prediction
+will be true. This tree would look like the following figure:
+
+
+
+Caption: Decision tree with depth=1
+
+The left side evaluates to the condition being false, and the right side
+evaluates to the condition being true. This tree has a depth of 1. F
+means that the prediction is false, and T means that the prediction is
+true.
+
+To get better results, the decision tree could introduce a second level.
+The second level would utilize the `Age` feature and evaluate
+whether the value is `ADULT`. It would look like the following
+figure:
+
+
+
+Caption: Decision tree with depth=2
+
+This tree has a depth of 2. At the first level, it predicts true if
+`Act` is `STRETCH`. If `Act` is not
+`STRETCH`, it checks whether `Age` is
+`ADULT`. If it is, it predicts true, otherwise, it predicts
+false.
+
+The decision tree can have as many levels as you like but starts to
+overfit at a certain point. As with everything in data science, the
+optimal depth depends on the data and is a hyperparameter, meaning you
+need to try different values to find the optimal one.
+
+In the following exercise, we will be making use of grid search with
+cross-validation to find the best parameters for a decision tree
+estimator.
+
+
+
+Exercise 7.07: Using Grid Search with Cross-Validation to Find the Best Parameters for a Model
+----------------------------------------------------------------------------------------------
+
+The goal of this exercise is to make use of grid search to find the best
+parameters for a `DecisionTree` classifier. We will be making
+use of the Cars dataset that you worked with previously.
+
+The following steps will help you complete the exercise:
+
+1. Open a Colab notebook file.
+
+2. Import `pandas`:
+
+ ```
+ import pandas as pd
+ ```
+
+
+ In this step, you import `pandas`. You alias it as
+ `pd`. `Pandas` is used to read in the data you
+ will work with subsequently.
+
+3. Create `headers`:
+ ```
+ _headers = ['buying', 'maint', 'doors', 'persons', \
+ 'lug_boot', 'safety', 'car']
+ ```
+
+
+4. Read in the `headers`:
+ ```
+ # read in cars dataset
+ df = pd.read_csv('https://raw.githubusercontent.com/'\
+ 'fenago/data-science/'\
+ 'master/Lab07/Dataset/car.data', \
+ names=_headers, index_col=None)
+ ```
+
+
+5. Inspect the top five records:
+
+ ```
+ df.info()
+ ```
+
+
+ The output looks similar to the following:
+
+
+
+
+
+ Caption: The top five records of the dataframe
+
+6. Encode the categorical variables:
+
+ ```
+ _df = pd.get_dummies(df, columns=['buying', 'maint', 'doors',\
+ 'persons', 'lug_boot', \
+ 'safety'])
+ _df.head()
+ ```
+
+
+ In this step, you utilize `.get_dummies()` to convert the
+ categorical variables into encodings. The `.head()` method
+ instructs the Python interpreter to output the top five columns.
+
+ The output is similar to the following:
+
+
+
+
+
+ Caption: Encoding categorical variables
+
+7. Separate `features` and `labels`:
+
+ ```
+ features = _df.drop(['car'], axis=1).values
+ labels = _df[['car']].values
+ ```
+
+
+ In this step, you create two `numpy` arrays,
+ `features` and `labels`, the first containing
+ independent variables or predictors, and the second containing
+ dependent variables or targets.
+
+8. Import more libraries -- `numpy`,
+ `DecisionTreeClassifier`, and `GridSearchCV`:
+
+ ```
+ import numpy as np
+ from sklearn.tree import DecisionTreeClassifier
+ from sklearn.model_selection import GridSearchCV
+ ```
+
+
+ In this step, you import `numpy`. NumPy is a numerical
+ computation library. You alias it as `np`. You also import
+ `DecisionTreeClassifier`, which you use to create decision
+ trees. Finally, you import `GridSearchCV`, which will use
+ cross-validation to train multiple models.
+
+9. Instantiate the decision tree:
+
+ ```
+ clf = DecisionTreeClassifier()
+ ```
+
+
+ In this step, you create an instance of
+ `DecisionTreeClassifier` as `clf`. This instance
+ will be used repeatedly by the grid search.
+
+10. Create parameters -- `max_depth`:
+
+ ```
+ params = {'max_depth': np.arange(1, 8)}
+ ```
+
+
+ In this step, you create a dictionary of parameters. There are two
+ parts to this dictionary:
+
+ The key of the dictionary is a parameter that is passed into the
+ model. In this case, `max_depth` is a parameter that
+ `DecisionTreeClassifier` takes.
+
+ The value is a Python list that grid search iterates over and passes
+ to the model. In this case, we create an array that starts at 1 and
+ ends at 7, inclusive.
+
+11. Instantiate the grid search as shown in the following code snippet:
+
+ ```
+ clf_cv = GridSearchCV(clf, param_grid=params, cv=5)
+ ```
+
+
+ In this step, you create an instance of `GridSearchCV`.
+ The first parameter is the model to train. The second parameter is
+ the parameters to search over. The third parameter is the number of
+ cross-validation splits to create.
+
+12. Now train the models:
+
+ ```
+ clf_cv.fit(features, labels)
+ ```
+
+
+ In this step, you train the models using the features and labels.
+ Depending on the type of model, this could take a while. Because we
+ are using a decision tree, it trains quickly.
+
+ The output is similar to the following:
+
+
+
+
+
+ Caption: Training the model
+
+ You can learn a lot by reading the output, such as the number of
+ cross-validation datasets created (called `cv` and equal
+ to `5`), the estimator used
+ (`DecisionTreeClassifier`), and the parameter search space
+ (called `param_grid`).
+
+13. Print the best parameter:
+
+ ```
+ print("Tuned Decision Tree Parameters: {}"\
+ .format(clf_cv.best_params_))
+ ```
+
+
+ In this step, you print out what the best parameter is. In this
+ case, what we were looking for was the best `max_depth`.
+ The output looks like the following:
+
+
+
+
+
+ Caption: Printing the best parameter
+
+ In the preceding output, you see that the best performing model is
+ one with a `max_depth` of `2`.
+
+ Accessing `best_params_` lets you train another model with
+ the best-known parameters using a larger training dataset.
+
+14. Print the best `R2`:
+
+ ```
+ print("Best score is {}".format(clf_cv.best_score_))
+ ```
+
+
+ In this step, you print out the `R2` score of the best
+ performing model.
+
+ The output is similar to the following:
+
+ ```
+ Best score is 0.7777777777777778
+ ```
+
+
+ In the preceding output, you see that the best performing model has
+ an `R2` score of `0.778`.
+
+15. Access the best model:
+
+ ```
+ model = clf_cv.best_estimator_
+ model
+ ```
+
+
+ In this step, you access the best model (or estimator) using
+ `best_estimator_`. This will let you analyze the model, or
+ optionally use it to make predictions and find other metrics.
+ Instructing the Python interpreter to print the best estimator will
+ yield an output similar to the following:
+
+
+
+
+
+Caption: Accessing the model
+
+In the preceding output, you see that the best model is
+`DecisionTreeClassifier` with a `max_depth` of
+`2`.
+
+
+
+Hyperparameter Tuning with RandomizedSearchCV
+=============================================
+
+
+Grid search goes over the entire search space and trains a model or
+estimator for every combination of parameters. Randomized search goes
+over only some of the combinations. This is a more optimal use of
+resources and still provides the benefits of hyperparameter tuning and
+cross-validation. You will be looking at this in depth in *Lab 8,
+Hyperparameter Tuning*.
+
+Have a look at the following exercise.
+
+
+
+Exercise 7.08: Using Randomized Search for Hyperparameter Tuning
+----------------------------------------------------------------
+
+The goal of this exercise is to perform hyperparameter tuning using
+randomized search and cross-validation.
+
+The following steps will help you complete this exercise:
+
+1. Open a new Colab notebook file.
+
+2. Import `pandas`:
+
+ ```
+ import pandas as pd
+ ```
+
+
+ In this step, you import `pandas`. You will make use of it
+ in the next step.
+
+3. Create `headers`:
+ ```
+ _headers = ['buying', 'maint', 'doors', 'persons', \
+ 'lug_boot', 'safety', 'car']
+ ```
+
+
+4. Read in the data:
+ ```
+ # read in cars dataset
+ df = pd.read_csv('https://raw.githubusercontent.com/'\
+ 'fenago/data-science/'\
+ 'master/Lab07/Dataset/car.data', \
+ names=_headers, index_col=None)
+ ```
+
+
+5. Check the first five rows:
+
+ ```
+ df.info()
+ ```
+
+
+ You need to provide a Python list of column headers because the data
+ does not contain column headers. You also inspect the DataFrame that
+ you created.
+
+ The output is similar to the following:
+
+
+
+
+
+ Caption: The top five rows of the DataFrame
+
+6. Encode categorical variables as shown in the following code snippet:
+
+ ```
+ _df = pd.get_dummies(df, columns=['buying', 'maint', 'doors',\
+ 'persons', 'lug_boot', \
+ 'safety'])
+ _df.head()
+ ```
+
+
+ In this step, you find a numerical representation of text data using
+ one-hot encoding. The operation results in a new DataFrame. You will
+ see that the resulting data structure looks similar to the
+ following:
+
+
+
+
+
+ Caption: Encoding categorical variables
+
+7. Separate the data into independent and dependent variables, which
+ are the `features` and `labels`:
+
+ ```
+ features = _df.drop(['car'], axis=1).values
+ labels = _df[['car']].values
+ ```
+
+
+ In this step, you separate the DataFrame into two `numpy`
+ arrays called `features` and `labels`.
+ `Features` contains the independent variables, while
+ `labels` contains the target or dependent variables.
+
+8. Import additional libraries -- `numpy`,
+ `RandomForestClassifier`, and
+ `RandomizedSearchCV`:
+
+ ```
+ import numpy as np
+ from sklearn.ensemble import RandomForestClassifier
+ from sklearn.model_selection import RandomizedSearchCV
+ ```
+
+
+ In this step, you import `numpy` for numerical
+ computations, `RandomForestClassifier` to create an
+ ensemble of estimators, and `RandomizedSearchCV` to
+ perform a randomized search with cross-validation.
+
+9. Create an instance of `RandomForestClassifier`:
+
+ ```
+ clf = RandomForestClassifier()
+ ```
+
+
+ In this step, you instantiate `RandomForestClassifier`. A
+ random forest classifier is a voting classifier. It makes use of
+ multiple decision trees, which are trained on different subsets of
+ the data. The results from the trees contribute to the output of the
+ random forest by using a voting mechanism.
+
+10. Specify the parameters:
+
+ ```
+ params = {'n_estimators':[500, 1000, 2000], \
+ 'max_depth': np.arange(1, 8)}
+ ```
+
+
+ `RandomForestClassifier` accepts many parameters, but we
+ specify two: the number of trees in the forest, called
+ `n_estimators`, and the depth of the nodes in each tree,
+ called `max_depth`.
+
+11. Instantiate a randomized search:
+
+ ```
+ clf_cv = RandomizedSearchCV(clf, param_distributions=params, \
+ cv=5)
+ ```
+
+
+ In this step, you specify three parameters when you instantiate the
+ `clf` class, the estimator, or model to use, which is a
+ random forest classifier, `param_distributions`, the
+ parameter search space, and `cv`, the number of
+ cross-validation datasets to create.
+
+12. Perform the search:
+
+ ```
+ clf_cv.fit(features, labels.ravel())
+ ```
+
+
+ In this step, you perform the search by calling `fit()`.
+ This operation trains different models using the cross-validation
+ datasets and various combinations of the hyperparameters. The output
+ from this operation is similar to the following:
+
+
+
+
+
+ Caption: Output of the search operation
+
+ In the preceding output, you see that the randomized search will be
+ carried out using cross-validation with five splits
+ (`cv=5`). The estimator to be used is
+ `RandomForestClassifier`.
+
+13. Print the best parameter combination:
+
+ ```
+ print("Tuned Random Forest Parameters: {}"\
+ .format(clf_cv.best_params_))
+ ```
+
+
+ In this step, you print out the best hyperparameters.
+
+ The output is similar to the following:
+
+
+
+
+
+ Caption: Printing the best parameter combination
+
+ In the preceding output, you see that the best estimator is a Random
+ Forest classifier with 1,000 trees (`n_estimators=1000`)
+ and `max_depth=5`. You can print the best score by
+ executing
+ `print("Best score is {}".format(clf_cv.best_score_))`.
+ For this exercise, this value is \~ `0.76`.
+
+14. Inspect the best model:
+
+ ```
+ model = clf_cv.best_estimator_
+ model
+ ```
+
+
+ In this step, you find the best performing estimator (or model) and
+ print out its details. The output is similar to the following:
+
+
+
+
+
+Caption: Inspecting the model
+
+In the preceding output, you see that the best estimator is
+`RandomForestClassifier` with `n_estimators=1000`
+and `max_depth=5`.
+
+
+In this exercise, you learned to make use of cross-validation and random
+search to find the best model using a combination of hyperparameters.
+This process is called hyperparameter tuning, in which you find the best
+combination of hyperparameters to use to train the model that you will
+put into production.
+
+
+Model Regularization with Lasso Regression
+==========================================
+
+
+As mentioned at the beginning of this lab models can overfit
+training data. One reason for this is having too many features with
+large coefficients (also called weights). The key to solving this type
+of overfitting problem is reducing the magnitude of the coefficients.
+
+You may recall that weights are optimized during model training. One
+method for optimizing weights is called gradient descent. The gradient
+update rule makes use of a differentiable loss function. Examples of
+differentiable loss functions are:
+
+- Mean Absolute Error (MAE)
+- Mean Squared Error (MSE)
+
+For lasso regression, a penalty is introduced in the loss function. The
+technicalities of this implementation are hidden by the class. The
+penalty is also called a regularization parameter.
+
+Consider the following exercise in which you over-engineer a model to
+introduce overfitting, and then use lasso regression to get better
+results.
+
+
+
+Exercise 7.09: Fixing Model Overfitting Using Lasso Regression
+--------------------------------------------------------------
+
+The goal of this exercise is to teach you how to identify when your
+model starts overfitting, and to use lasso regression to fix overfitting
+in your model.
+
+
+The attribute information states \"Features consist of hourly average
+ambient variables:
+
+- Temperature (T) in the range 1.81°C and 37.11°C,
+- Ambient Pressure (AP) in the range 992.89-1033.30 millibar,
+- Relative Humidity (RH) in the range 25.56% to 100.16%
+- Exhaust Vacuum (V) in the range 25.36-81.56 cm Hg
+- Net hourly electrical energy output (EP) 420.26-495.76 MW
+
+The averages are taken from various sensors located around the plant
+that record the ambient variables every second. The variables are given
+without normalization.\"
+
+The following steps will help you complete the exercise:
+
+1. Open a Colab notebook.
+
+2. Import the required libraries:
+ ```
+ import pandas as pd
+ from sklearn.model_selection import train_test_split
+ from sklearn.linear_model import LinearRegression, Lasso
+ from sklearn.metrics import mean_squared_error
+ from sklearn.pipeline import Pipeline
+ from sklearn.preprocessing import MinMaxScaler, \
+ PolynomialFeatures
+ ```
+
+
+3. Read in the data:
+ ```
+ _df = pd.read_csv('https://raw.githubusercontent.com/'\
+ 'fenago/data-science/'\
+ 'master/Lab07/Dataset/ccpp.csv')
+ ```
+
+
+4. Inspect the DataFrame:
+
+ ```
+ _df.info()
+ ```
+
+
+ The `.info()` method prints out a summary of the
+ DataFrame, including the names of the columns and the number of
+ records. The output might be similar to the following:
+
+
+
+
+
+ Caption: Inspecting the dataframe
+
+ You can see from the preceding figure that the DataFrame has 5
+ columns and 9,568 records. You can see that all columns contain
+ numeric data and that the columns have the following names:
+ `AT`, `V`, `AP`, `RH`, and
+ `PE`.
+
+5. Extract features into a column called `X`:
+ ```
+ X = _df.drop(['PE'], axis=1).values
+ ```
+
+
+6. Extract labels into a column called `y`:
+ ```
+ y = _df['PE'].values
+ ```
+
+
+7. Split the data into training and evaluation sets:
+ ```
+ train_X, eval_X, train_y, eval_y = train_test_split\
+ (X, y, train_size=0.8, \
+ random_state=0)
+ ```
+
+
+8. Create an instance of a `LinearRegression` model:
+ ```
+ lr_model_1 = LinearRegression()
+ ```
+
+
+9. Fit the model on the training data:
+
+ ```
+ lr_model_1.fit(train_X, train_y)
+ ```
+
+
+ The output from this step should look similar to the following:
+
+
+
+
+
+ Caption: Fitting the model on training data
+
+10. Use the model to make predictions on the evaluation dataset:
+ ```
+ lr_model_1_preds = lr_model_1.predict(eval_X)
+ ```
+
+
+11. Print out the `R2` score of the model:
+
+ ```
+ print('lr_model_1 R2 Score: {}'\
+ .format(lr_model_1.score(eval_X, eval_y)))
+ ```
+
+
+ The output of this step should look similar to the following:
+
+
+
+
+
+ Caption: Printing the R2 score
+
+ You will notice that the `R2` score for this model is
+ `0.926`. You will make use of this figure to compare with
+ the next model you train. Recall that this is an evaluation metric.
+
+12. Print out the Mean Squared Error (MSE) of this model:
+
+ ```
+ print('lr_model_1 MSE: {}'\
+ .format(mean_squared_error(eval_y, lr_model_1_preds)))
+ ```
+
+
+ The output of this step should look similar to the following:
+
+
+
+
+
+ Caption: Printing the MSE
+
+ You will notice that the MSE is `21.675`. This is an
+ evaluation metric that you will use to compare this model to
+ subsequent models.
+
+ The first model was trained on four features. You will now train a
+ new model on four cubed features.
+
+13. Create a list of tuples to serve as a pipeline:
+
+ ```
+ steps = [('scaler', MinMaxScaler()),\
+ ('poly', PolynomialFeatures(degree=3)),\
+ ('lr', LinearRegression())]
+ ```
+
+
+ In this step, you create a list with three tuples. The first tuple
+ represents a scaling operation that makes use of
+ `MinMaxScaler`. The second tuple represents a feature
+ engineering step and makes use of `PolynomialFeatures`.
+ The third tuple represents a `LinearRegression` model.
+
+ The first element of the tuple represents the name of the step,
+ while the second element represents the class that performs a
+ transformation or an estimator.
+
+14. Create an instance of a pipeline:
+ ```
+ lr_model_2 = Pipeline(steps)
+ ```
+
+
+15. Train the instance of the pipeline:
+
+ ```
+ lr_model_2.fit(train_X, train_y)
+ ```
+
+
+ The pipeline implements a `.fit()` method, which is also
+ implemented in all instances of transformers and estimators. The
+ `.fit()` method causes `.fit_transform()` to be
+ called on transformers, and causes `.fit()` to be called
+ on estimators. The output of this step is similar to the following:
+
+
+
+
+
+ Caption: Training the instance of the pipeline
+
+ You can see from the output that a pipeline was trained. You can see
+ that the steps are made up of `MinMaxScaler` and
+ `PolynomialFeatures`, and that the final step is made up
+ of `LinearRegression`.
+
+16. Print out the `R2` score of the model:
+
+ ```
+ print('lr_model_2 R2 Score: {}'\
+ .format(lr_model_2.score(eval_X, eval_y)))
+ ```
+
+
+ The output is similar to the following:
+
+
+
+
+
+ Caption: The R2 score of the model
+
+ You can see from the preceding that the `R2` score is
+ `0.944`, which is better than the `R2` score of
+ the first model, which was `0.932`. You can start to
+ observe that the metrics suggest that this model is better than the
+ first one.
+
+17. Use the model to predict on the evaluation data:
+ ```
+ lr_model_2_preds = lr_model_2.predict(eval_X)
+ ```
+
+
+18. Print the MSE of the second model:
+
+ ```
+ print('lr_model_2 MSE: {}'\
+ .format(mean_squared_error(eval_y, lr_model_2_preds)))
+ ```
+
+
+ The output is similar to the following:
+
+
+
+
+
+ Caption: The MSE of the second model
+
+ You can see from the output that the MSE of the second model is
+ `16.27`. This is less than the MSE of the first model,
+ which is `19.73`. You can safely conclude that the second
+ model is better than the first.
+
+19. Inspect the model coefficients (also called weights):
+
+ ```
+ print(lr_model_2[-1].coef_)
+ ```
+
+
+ In this step, you will note that `lr_model_2` is a
+ pipeline. The final object in this pipeline is the model, so you
+ make use of list addressing to access this by setting the index of
+ the list element to `-1`.
+
+ Once you have the model, which is the final element in the pipeline,
+ you make use of `.coef_` to get the model coefficients.
+ The output is similar to the following:
+
+
+
+
+
+ Caption: Print the model coefficients
+
+ You will note from the preceding output that the majority of the
+ values are in the tens, some values are in the hundreds, and one
+ value has a really small magnitude.
+
+20. Check for the number of coefficients in this model:
+
+ ```
+ print(len(lr_model_2[-1].coef_))
+ ```
+
+
+ The output for this step is similar to the following:
+
+ ```
+ 35
+ ```
+
+
+ You can see from the preceding screenshot that the second model has
+ `35` coefficients.
+
+21. Create a `steps` list with `PolynomialFeatures`
+ of degree `10`:
+ ```
+ steps = [('scaler', MinMaxScaler()),\
+ ('poly', PolynomialFeatures(degree=10)),\
+ ('lr', LinearRegression())]
+ ```
+
+
+22. Create a third model from the preceding steps:
+ ```
+ lr_model_3 = Pipeline(steps)
+ ```
+
+
+23. Fit the third model on the training data:
+
+ ```
+ lr_model_3.fit(train_X, train_y)
+ ```
+
+
+ The output from this step is similar to the following:
+
+
+
+
+
+ Caption: Fitting the third model on the data
+
+ You can see from the output that the pipeline makes use of
+ `PolynomialFeatures` of degree `10`. You are
+ doing this in the hope of getting a better model.
+
+24. Print out the `R2` score of this model:
+
+ ```
+ print('lr_model_3 R2 Score: {}'\
+ .format(lr_model_3.score(eval_X, eval_y)))
+ ```
+
+
+ The output of this model is similar to the following:
+
+
+
+
+
+ Caption: R2 score of the model
+
+ You can see from the preceding figure that the R2 score is now
+ `0.56`. The previous model had an `R2` score of
+ `0.944`. This model has an R2 score that is considerably
+ worse than the one of the previous model, `lr_model_2`.
+ This happens when your model is overfitting.
+
+25. Use `lr_model_3` to predict on evaluation data:
+ ```
+ lr_model_3_preds = lr_model_3.predict(eval_X)
+ ```
+
+
+26. Print out the MSE for `lr_model_3`:
+
+ ```
+ print('lr_model_3 MSE: {}'\
+ .format(mean_squared_error(eval_y, lr_model_3_preds)))
+ ```
+
+
+ The output for this step might be similar to the following:
+
+
+
+
+
+ Caption: The MSE of the model
+
+ You can see from the preceding figure that the MSE is also
+ considerably worse. The MSE is `126.25`, as compared to
+ `16.27` for the previous model.
+
+27. Print out the number of coefficients (also called weights) in this
+ model:
+
+ ```
+ print(len(lr_model_3[-1].coef_))
+ ```
+
+
+ The output might resemble the following:
+
+
+
+
+
+ Caption: Printing the number of coefficients
+
+ You can see that the model has 1,001 coefficients.
+
+28. Inspect the first 35 coefficients to get a sense of the individual
+ magnitudes:
+
+ ```
+ print(lr_model_3[-1].coef_[:35])
+ ```
+
+
+ The output might be similar to the following:
+
+
+
+
+
+ Caption: Inspecting the first 35 coefficients
+
+ You can see from the output that the coefficients have significantly
+ larger magnitudes than the coefficients from `lr_model_2`.
+
+ In the next steps, you will train a lasso regression model on the
+ same set of features to reduce overfitting.
+
+29. Create a list of steps for the pipeline you will create later on:
+
+ ```
+ steps = [('scaler', MinMaxScaler()),\
+ ('poly', PolynomialFeatures(degree=10)),\
+ ('lr', Lasso(alpha=0.01))]
+ ```
+
+
+ You create a list of steps for the pipeline you will create. Note
+ that the third step in this list is an instance of lasso. The
+ parameter called `alpha` in the call to
+ `Lasso()` is the regularization parameter. You can play
+ around with any values from 0 to 1 to see how it affects the
+ performance of the model that you train.
+
+30. Create an instance of a pipeline:
+ ```
+ lasso_model = Pipeline(steps)
+ ```
+
+
+31. Fit the pipeline on the training data:
+
+ ```
+ lasso_model.fit(train_X, train_y)
+ ```
+
+
+ The output from this operation might be similar to the following:
+
+
+
+
+
+ Caption: Fitting the pipeline on the training data
+
+ You can see from the output that the pipeline trained a lasso model
+ in the final step. The regularization parameter was `0.01`
+ and the model trained for a maximum of 1,000 iterations.
+
+32. Print the `R2` score of `lasso_model`:
+
+ ```
+ print('lasso_model R2 Score: {}'\
+ .format(lasso_model.score(eval_X, eval_y)))
+ ```
+
+
+ The output of this step might be similar to the following:
+
+
+
+
+
+ Caption: R2 score
+
+ You can see that the `R2` score has climbed back up to
+ `0.94`, which is considerably better than the score of
+ `0.56` that `lr_model_3` had. This is already
+ looking like a better model.
+
+33. Use `lasso_model` to predict on the evaluation data:
+ ```
+ lasso_preds = lasso_model.predict(eval_X)
+ ```
+
+
+34. Print the MSE of `lasso_model`:
+
+ ```
+ print('lasso_model MSE: {}'\
+ .format(mean_squared_error(eval_y, lasso_preds)))
+ ```
+
+
+ The output might be similar to the following:
+
+
+
+
+
+ Caption: MSE of lasso model
+
+ You can see from the output that the MSE is `17.01`, which
+ is way lower than the MSE value of `126.25` that
+ `lr_model_3` had. You can safely conclude that this is a
+ much better model.
+
+35. Print out the number of coefficients in `lasso_model`:
+
+ ```
+ print(len(lasso_model[-1].coef_))
+ ```
+
+
+ The output might be similar to the following:
+
+ ```
+ 1001
+ ```
+
+
+ You can see that this model has 1,001 coefficients, which is the
+ same number of coefficients that `lr_model_3` had.
+
+36. Print out the values of the first 35 coefficients:
+
+ ```
+ print(lasso_model[-1].coef_[:35])
+ ```
+
+
+ The output might be similar to the following:
+
+
+
+
+
+Caption: Printing the values of 35 coefficients
+
+You can see from the preceding output that some of the coefficients are
+set to `0`. This has the effect of ignoring the corresponding
+column of data in the input. You can also see that the remaining
+coefficients have magnitudes of less than 100. This goes to show that
+the model is no longer overfitting.
+
+This exercise taught you how to fix overfitting by using
+`LassoRegression` to train a new model.
+
+In the next section, you will learn about using ridge regression to
+solve overfitting in a model.
+
+
+Ridge Regression
+================
+
+
+You just learned about lasso regression, which introduces a penalty and
+tries to eliminate certain features from the data. Ridge regression
+takes an alternative approach by introducing a penalty that penalizes
+large weights. As a result, the optimization process tries to reduce the
+magnitude of the coefficients without completely eliminating them.
+
+
+
+Exercise 7.10: Fixing Model Overfitting Using Ridge Regression
+--------------------------------------------------------------
+
+The goal of this exercise is to teach you how to identify when your
+model starts overfitting, and to use ridge regression to fix overfitting
+in your model.
+
+Note
+
+You will be using the same dataset as in *Exercise 7.09*, *Fixing Model
+Overfitting Using Lasso Regression.*
+
+The following steps will help you complete the exercise:
+
+1. Open a Colab notebook.
+
+2. Import the required libraries:
+ ```
+ import pandas as pd
+ from sklearn.model_selection import train_test_split
+ from sklearn.linear_model import LinearRegression, Ridge
+ from sklearn.metrics import mean_squared_error
+ from sklearn.pipeline import Pipeline
+ from sklearn.preprocessing import MinMaxScaler, \
+ PolynomialFeatures
+ ```
+
+
+3. Read in the data:
+ ```
+ _df = pd.read_csv('https://raw.githubusercontent.com/'\
+ 'fenago/data-science/'\
+ 'master/Lab07/Dataset/ccpp.csv')
+ ```
+
+
+4. Inspect the DataFrame:
+
+ ```
+ _df.info()
+ ```
+
+
+ The `.info()` method prints out a summary of the
+ DataFrame, including the names of the columns and the number of
+ records. The output might be similar to the following:
+
+
+
+
+
+ Caption: Inspecting the dataframe
+
+ You can see from the preceding figure that the DataFrame has 5
+ columns and 9,568 records. You can see that all columns contain
+ numeric data and that the columns have the names: `AT`,
+ `V`, `AP`, `RH`, and `PE`.
+
+5. Extract features into a column called `X`:
+ ```
+ X = _df.drop(['PE'], axis=1).values
+ ```
+
+
+6. Extract labels into a column called `y`:
+ ```
+ y = _df['PE'].values
+ ```
+
+
+7. Split the data into training and evaluation sets:
+ ```
+ train_X, eval_X, train_y, eval_y = train_test_split\
+ (X, y, train_size=0.8, \
+ random_state=0)
+ ```
+
+
+8. Create an instance of a `LinearRegression` model:
+ ```
+ lr_model_1 = LinearRegression()
+ ```
+
+
+9. Fit the model on the training data:
+
+ ```
+ lr_model_1.fit(train_X, train_y)
+ ```
+
+
+ The output from this step should look similar to the following:
+
+
+
+
+
+ Caption: Fitting the model on data
+
+10. Use the model to make predictions on the evaluation dataset:
+ ```
+ lr_model_1_preds = lr_model_1.predict(eval_X)
+ ```
+
+
+11. Print out the `R2` score of the model:
+
+ ```
+ print('lr_model_1 R2 Score: {}'\
+ .format(lr_model_1.score(eval_X, eval_y)))
+ ```
+
+
+ The output of this step should look similar to the following:
+
+
+
+
+
+ Caption: R2 score
+
+ You will notice that the R2 score for this model is
+ `0.933`. You will make use of this figure to compare it
+ with the next model you train. Recall that this is an evaluation
+ metric.
+
+12. Print out the MSE of this model:
+
+ ```
+ print('lr_model_1 MSE: {}'\
+ .format(mean_squared_error(eval_y, lr_model_1_preds)))
+ ```
+
+
+ The output of this step should look similar to the following:
+
+
+
+
+
+ Caption: The MSE of the model
+
+ You will notice that the MSE is `19.734`. This is an
+ evaluation metric that you will use to compare this model to
+ subsequent models.
+
+ The first model was trained on four features. You will now train a
+ new model on four cubed features.
+
+13. Create a list of tuples to serve as a pipeline:
+
+ ```
+ steps = [('scaler', MinMaxScaler()),\
+ ('poly', PolynomialFeatures(degree=3)),\
+ ('lr', LinearRegression())]
+ ```
+
+
+ In this step, you create a list with three tuples. The first tuple
+ represents a scaling operation that makes use of
+ `MinMaxScaler`. The second tuple represents a feature
+ engineering step and makes use of `PolynomialFeatures`.
+ The third tuple represents a `LinearRegression` model.
+
+ The first element of the tuple represents the name of the step,
+ while the second element represents the class that performs a
+ transformation or an estimation.
+
+14. Create an instance of a pipeline:
+ ```
+ lr_model_2 = Pipeline(steps)
+ ```
+
+
+15. Train the instance of the pipeline:
+
+ ```
+ lr_model_2.fit(train_X, train_y)
+ ```
+
+
+ The pipeline implements a `.fit()` method, which is also
+ implemented in all instances of transformers and estimators. The
+ `.fit()` method causes `.fit_transform()` to be
+ called on transformers, and causes `.fit()` to be called
+ on estimators. The output of this step is similar to the following:
+
+
+
+
+
+ Caption: Training the instance of a pipeline
+
+ You can see from the output that a pipeline was trained. You can see
+ that the steps are made up of `MinMaxScaler` and
+ `PolynomialFeatures`, and that the final step is made up
+ of `LinearRegression`.
+
+16. Print out the `R2` score of the model:
+
+ ```
+ print('lr_model_2 R2 Score: {}'\
+ .format(lr_model_2.score(eval_X, eval_y)))
+ ```
+
+
+ The output is similar to the following:
+
+
+
+
+
+ Caption: R2 score
+
+ You can see from the preceding that the R2 score is
+ `0.944`, which is better than the R2 score of the first
+ model, which was `0.933`. You can start to observe that
+ the metrics suggest that this model is better than the first one.
+
+17. Use the model to predict on the evaluation data:
+ ```
+ lr_model_2_preds = lr_model_2.predict(eval_X)
+ ```
+
+
+18. Print the MSE of the second model:
+
+ ```
+ print('lr_model_2 MSE: {}'\
+ .format(mean_squared_error(eval_y, lr_model_2_preds)))
+ ```
+
+
+ The output is similar to the following:
+
+
+
+
+
+ Caption: The MSE of the model
+
+ You can see from the output that the MSE of the second model is
+ `16.272`. This is less than the MSE of the first model,
+ which is `19.734`. You can safely conclude that the second
+ model is better than the first.
+
+19. Inspect the model coefficients (also called weights):
+
+ ```
+ print(lr_model_2[-1].coef_)
+ ```
+
+
+ In this step, you will note that `lr_model_2` is a
+ pipeline. The final object in this pipeline is the model, so you
+ make use of list addressing to access this by setting the index of
+ the list element to `-1`.
+
+ Once you have the model, which is the final element in the pipeline,
+ you make use of `.coef_` to get the model coefficients.
+ The output is similar to the following:
+
+
+
+
+
+ Caption: Printing model coefficients
+
+ You will note from the preceding output that the majority of the
+ values are in the tens, some values are in the hundreds, and one
+ value has a really small magnitude.
+
+20. Check the number of coefficients in this model:
+
+ ```
+ print(len(lr_model_2[-1].coef_))
+ ```
+
+
+ The output of this step is similar to the following:
+
+
+
+
+
+ Caption: Checking the number of coefficients
+
+ You will see from the preceding that the second model has 35
+ coefficients.
+
+21. Create a `steps` list with `PolynomialFeatures`
+ of degree `10`:
+ ```
+ steps = [('scaler', MinMaxScaler()),\
+ ('poly', PolynomialFeatures(degree=10)),\
+ ('lr', LinearRegression())]
+ ```
+
+
+22. Create a third model from the preceding steps:
+ ```
+ lr_model_3 = Pipeline(steps)
+ ```
+
+
+23. Fit the third model on the training data:
+
+ ```
+ lr_model_3.fit(train_X, train_y)
+ ```
+
+
+ The output from this step is similar to the following:
+
+
+
+
+
+ Caption: Fitting lr\_model\_3 on the training data
+
+ You can see from the output that the pipeline makes use of
+ `PolynomialFeatures` of degree `10`. You are
+ doing this in the hope of getting a better model.
+
+24. Print out the `R2` score of this model:
+
+ ```
+ print('lr_model_3 R2 Score: {}'\
+ .format(lr_model_3.score(eval_X, eval_y)))
+ ```
+
+
+ The output of this model is similar to the following:
+
+
+
+
+
+ Caption: R2 score
+
+ You can see from the preceding figure that the `R2` score
+ is now `0.568` The previous model had an `R2`
+ score of `0.944`. This model has an `R2` score
+ that is worse than the one of the previous model,
+ `lr_model_2`. This happens when your model is overfitting.
+
+25. Use `lr_model_3` to predict on evaluation data:
+ ```
+ lr_model_3_preds = lr_model_3.predict(eval_X)
+ ```
+
+
+26. Print out the MSE for `lr_model_3`:
+
+ ```
+ print('lr_model_3 MSE: {}'\
+ .format(mean_squared_error(eval_y, lr_model_3_preds)))
+ ```
+
+
+ The output of this step might be similar to the following:
+
+
+
+
+
+ Caption: The MSE of lr\_model\_3
+
+ You can see from the preceding figure that the MSE is also worse.
+ The MSE is `126.254`, as compared to `16.271`
+ for the previous model.
+
+27. Print out the number of coefficients (also called weights) in this
+ model:
+
+ ```
+ print(len(lr_model_3[-1].coef_))
+ ```
+
+
+ The output might resemble the following:
+
+ ```
+ 1001
+ ```
+
+
+ You can see that the model has `1,001` coefficients.
+
+28. Inspect the first `35` coefficients to get a sense of the
+ individual magnitudes:
+
+ ```
+ print(lr_model_3[-1].coef_[:35])
+ ```
+
+
+ The output might be similar to the following:
+
+
+
+
+
+ Caption: Inspecting 35 coefficients
+
+ You can see from the output that the coefficients have significantly
+ larger magnitudes than the coefficients from `lr_model_2`.
+
+ In the next steps, you will train a ridge regression model on the
+ same set of features to reduce overfitting.
+
+29. Create a list of steps for the pipeline you will create later on:
+
+ ```
+ steps = [('scaler', MinMaxScaler()),\
+ ('poly', PolynomialFeatures(degree=10)),\
+ ('lr', Ridge(alpha=0.9))]
+ ```
+
+
+ You create a list of steps for the pipeline you will create. Note
+ that the third step in this list is an instance of
+ `Ridge`. The parameter called `alpha` in the
+ call to `Ridge()` is the regularization parameter. You can
+ play around with any values from 0 to 1 to see how it affects the
+ performance of the model that you train.
+
+30. Create an instance of a pipeline:
+ ```
+ ridge_model = Pipeline(steps)
+ ```
+
+
+31. Fit the pipeline on the training data:
+
+ ```
+ ridge_model.fit(train_X, train_y)
+ ```
+
+
+ The output of this operation might be similar to the following:
+
+
+
+
+
+ Caption: Fitting the pipeline on training data
+
+ You can see from the output that the pipeline trained a ridge model
+ in the final step. The regularization parameter was `0`.
+
+32. Print the R2 score of `ridge_model`:
+
+ ```
+ print('ridge_model R2 Score: {}'\
+ .format(ridge_model.score(eval_X, eval_y)))
+ ```
+
+
+ The output of this step might be similar to the following:
+
+
+
+
+
+ Caption: R2 score
+
+ You can see that the R2 score has climbed back up to
+ `0.945`, which is way better than the score of
+ `0.568` that `lr_model_3` had. This is already
+ looking like a better model.
+
+33. Use `ridge_model` to predict on the evaluation data:
+ ```
+ ridge_model_preds = ridge_model.predict(eval_X)
+ ```
+
+
+34. Print the MSE of `ridge_model`:
+
+ ```
+ print('ridge_model MSE: {}'\
+ .format(mean_squared_error(eval_y, ridge_model_preds)))
+ ```
+
+
+ The output might be similar to the following:
+
+
+
+
+
+ Caption: The MSE of ridge\_model
+
+ You can see from the output that the MSE is `16.030`,
+ which is lower than the MSE value of `126.254` that
+ `lr_model_3` had. You can safely conclude that this is a
+ much better model.
+
+35. Print out the number of coefficients in `ridge_model`:
+
+ ```
+ print(len(ridge_model[-1].coef_))
+ ```
+
+
+ The output might be similar to the following:
+
+
+
+
+
+ Caption: The number of coefficients in the ridge model
+
+ You can see that this model has `1001` coefficients, which
+ is the same number of coefficients that `lr_model_3` had.
+
+36. Print out the values of the first 35 coefficients:
+
+ ```
+ print(ridge_model[-1].coef_[:35])
+ ```
+
+
+ The output might be similar to the following:
+
+
+
+
+
+Caption: The values of the first 35 coefficients
+
+
+This exercise taught you how to fix overfitting by using
+`RidgeRegression` to train a new model.
+
+
+
+Activity 7.01: Find an Optimal Model for Predicting the Critical Temperatures of Superconductors
+------------------------------------------------------------------------------------------------
+
+You work as a data scientist for a cable manufacturer. Management has
+decided to start shipping low-resistance cables to clients around the
+world. To ensure that the right cables are shipped to the right
+countries, they would like to predict the critical temperatures of
+various cables based on certain observed readings.
+
+In this activity, you will train a linear regression model and compute
+the R2 score and the MSE. You will proceed to engineer new features
+using polynomial features of degree 3. You will compare the R2 score and
+MSE of this new model to those of the first model to determine
+overfitting. You will then use regularization to train a model that
+generalizes to previously unseen data.
+
+
+
+The steps to accomplish this task are:
+
+1. Open a Colab notebook.
+
+2. Load the necessary libraries.
+
+3. Read in the data from the `superconduct` folder.
+
+4. Prepare the `X` and `y` variables.
+
+5. Split the data into training and evaluation sets.
+
+6. Create a baseline linear regression model.
+
+7. Print out the R2 score and MSE of the model.
+
+8. Create a pipeline to engineer polynomial features and train a linear
+ regression model.
+
+9. Print out the R2 score and MSE.
+
+10. Determine that this new model is overfitting.
+
+11. Create a pipeline to engineer polynomial features and train a ridge
+ or lasso model.
+
+12. Print out the R2 score and MSE.
+
+ The output will be as follows:
+
+
+
+
+
+ Caption: The R2 score and MSE of the ridge model
+
+13. Determine that this model is no longer overfitting. This is the
+ model to put into production.
+
+ The coefficients for the ridge model are as shown in the following
+ figure:
+
+
+
+
+
+Caption: The coefficients for the ridge model
+
+
+
+Summary
+=======
+
+
+In this lab, we studied the importance of withholding some of the
+available data to evaluate models. We also learned how to make use of
+all of the available data with a technique called cross-validation to
+find the best performing model from a set of models you are training. We
+also made use of evaluation metrics to determine when a model starts to
+overfit and made use of ridge and lasso regression to fix a model that
+is overfitting.
+
+In the next lab, we will go into hyperparameter tuning in depth. You
+will learn about various techniques for finding the best hyperparameters
+to train your models.
diff --git a/lab_guides/Lab_8.md b/lab_guides/Lab_8.md
new file mode 100644
index 0000000..f911134
--- /dev/null
+++ b/lab_guides/Lab_8.md
@@ -0,0 +1,1761 @@
+
+8. Hyperparameter Tuning
+========================
+
+
+
+Overview
+
+In this lab, each hyperparameter tuning strategy will be first
+broken down into its key steps before any high-level scikit-learn
+implementations are demonstrated. This is to ensure that you fully
+understand the concept behind each of the strategies before jumping to
+the more automated methods.
+
+By the end of this lab, you will be able to find further predictive
+performance improvements via the systematic evaluation of estimators
+with different hyperparameters. You will successfully deploy manual,
+grid, and random search strategies to find the optimal hyperparameters.
+You will be able to parameterize **k-nearest neighbors** (**k-NN**),
+**support vector machines** (**SVMs**), ridge regression, and random
+forest classifiers to optimize model performance.
+
+
+Introduction
+============
+
+
+In previous labs, we discussed several methods to arrive at a model
+that performs well. These include transforming the data via
+preprocessing, feature engineering and scaling, or simply choosing an
+appropriate estimator (algorithm) type from the large set of possible
+estimators made available to the users of scikit-learn.
+
+Depending on which estimator you eventually select, there may be
+settings that can be adjusted to improve overall predictive performance.
+These settings are known as hyperparameters, and deriving the best
+hyperparameters is known as tuning or optimizing. Properly tuning your
+hyperparameters can result in performance improvements well into the
+double-digit percentages, so it is well worth doing in any modeling
+exercise.
+
+This lab will discuss the concept of hyperparameter tuning and will
+present some simple strategies that you can use to help find the best
+hyperparameters for your estimators.
+
+In previous labs, we have seen some exercises that use a range of
+estimators, but we haven\'t conducted any hyperparameter tuning. After
+reading this lab, we recommend you revisit these exercises, apply
+the techniques taught, and see if you can improve the results.
+
+
+What Are Hyperparameters?
+=========================
+
+
+Hyperparameters can be thought of as a set of dials and switches for
+each estimator that change how the estimator works to explain
+relationships in the data.
+
+Have a look at *Figure 8.1*:
+
+
+
+Caption: How hyperparameters work
+
+If you read from left to right in the preceding figure, you can see that
+during the tuning process we change the value of the hyperparameter,
+which results in a change to the estimator. This in turn causes a change
+in model performance. Our objective is to find hyperparameterization
+that leads to the best model performance. This will be the *optimal*
+hyperparameterization.
+
+Estimators can have hyperparameters of varying quantities and types,
+which means that sometimes you can be faced with a very large number of
+possible hyperparameterizations to choose for an estimator.
+
+For instance, scikit-learn\'s implementation of the SVM classifier
+(`sklearn.svm.SVC`), which you will be introduced to later in
+the lab, is an estimator that has multiple possible
+hyperparameterizations. We will test out only a small subset of these,
+namely using a linear kernel or a polynomial kernel of degree 2, 3, or
+4.
+
+Some of these hyperparameters are continuous in nature, while others are
+discrete, and the presence of continuous hyperparameters means that the
+number of possible hyperparameterizations is theoretically infinite. Of
+course, when it comes to producing a model with good predictive
+performance, some hyperparameterizations are much better than others,
+and it is your job as a data scientist to find them.
+
+In the next section, we will be looking at setting these hyperparameters
+in more detail. But first, some clarification of terms.
+
+
+
+Difference between Hyperparameters and Statistical Model Parameters
+-------------------------------------------------------------------
+
+In your reading on data science, particularly in the area of statistics,
+you will come across terms such as \"model parameters,\" \"parameter
+estimation,\" and \"(non)-parametric models.\" These terms relate to the
+parameters that feature in the mathematical formulation of models. The
+simplest example is that of the single variable linear model with no
+intercept term that takes the following form:
+
+
+
+Caption: Equation for a single variable linear model
+
+Here, 𝛽 is the statistical model parameter, and if this formulation is
+chosen, it is the data scientist\'s job to use data to estimate what
+value it takes. This could be achieved using **Ordinary Least Squares**
+(**OLS**) regression modeling, or it could be achieved through a method
+called median regression.
+
+Hyperparameters are different in that they are external to the
+mathematical form. An example of a hyperparameter in this case is the
+way in which 𝛽 will be estimated (OLS, or median regression). In some
+cases, hyperparameters can change the algorithm completely (that is,
+generating a completely different mathematical form). You will see
+examples of this occurring throughout this lab.
+
+In the next section, you will be looking at how to set a hyperparameter.
+
+
+
+Setting Hyperparameters
+-----------------------
+
+In *Lab 7*, *The Generalization of Machine Learning Models*, you
+were introduced to the k-NN model for classification and you saw how
+varying k, the number of nearest neighbors, resulted in changes in model
+performance with respect to the prediction of class labels. Here, k is a
+hyperparameter, and the act of manually trying different values of k is
+a simple form of hyperparameter tuning.
+
+Each time you initialize a scikit-learn estimator, it will take on a
+hyperparameterization as determined by the values you set for its
+arguments. If you specify no values, then the estimator will take on a
+default hyperparameterization. If you would like to see how the
+hyperparameters have been set for your estimator, and what
+hyperparameters you can adjust, simply print the output of the
+`estimator.get_params()` method.
+
+For instance, say we initialize a k-NN estimator without specifying any
+arguments (empty brackets). To see the default hyperparameterization, we
+can run:
+
+```
+from sklearn import neighbors
+# initialize with default hyperparameters
+knn = neighbors.KNeighborsClassifier()
+# examine the defaults
+print(knn.get_params())
+```
+You should get the following output:
+
+```
+{'algorithm': 'auto', 'leaf_size': 30, 'metric': 'minkowski',
+ 'metric_params': None, 'n_jobs': None, 'n_neighbors': 5,
+ 'p': 2, 'weights': 'uniform'}
+```
+A dictionary of all the hyperparameters is now printed to the screen,
+revealing their default settings. Notice `k`, our number of
+nearest neighbors, is set to `5`.
+
+To get more information as to what these parameters mean, how they can
+be changed, and what their likely effect may be, you can run the
+following command and view the help file for the estimator in question.
+
+For our k-NN estimator:
+
+```
+?knn
+```
+
+The output will be as follows:
+
+
+
+Caption: Help file for the k-NN estimator
+
+If you look closely at the help file, you will see the default
+hyperparameterization for the estimator under the
+`String form` heading, along with an explanation of what each
+hyperparameter means under the `Parameters` heading.
+
+Coming back to our example, if we want to change the
+hyperparameterization from `k = 5` to `k = 15`, just
+re-initialize the estimator and set the `n_neighbors` argument
+to `15`, which will override the default:
+
+```
+"""
+initialize with k = 15 and all other hyperparameters as default
+"""
+knn = neighbors.KNeighborsClassifier(n_neighbors=15)
+# examine
+print(knn.get_params())
+```
+You should get the following output:
+
+```
+{'algorithm': 'auto', 'leaf_size': 30, 'metric': 'minkowski',
+ 'metric_params': None, 'n_jobs': None, 'n_neighbors': 15,
+ 'p': 2, 'weights': 'uniform'}
+```
+You may have noticed that k is not the only hyperparameter available for
+k-NN classifiers. Setting multiple hyperparameters is as easy as
+specifying the relevant arguments. For example, let\'s increase the
+number of neighbors from `5` to `15` and force the
+algorithm to take the distance of points in the neighborhood, rather
+than a simple majority vote, into account when training. For more
+information, see the description for the `weights` argument in
+the help file (`?knn`):
+
+```
+"""
+initialize with k = 15, weights = distance and all other
+hyperparameters as default
+"""
+knn = neighbors.KNeighborsClassifier(n_neighbors=15, \
+ weights='distance')
+# examine
+print(knn.get_params())
+```
+
+The output will be as follows:
+
+```
+{'algorithm': 'auto', 'leaf_size': 30, 'metric': 'minkowski',
+ 'metric_params': None, 'n_jobs': None, 'n_neighbors': 15,
+ 'p': 2, 'weights': 'distance'}
+```
+
+In the output, you can see `n_neighbors` (`k`) is
+now set to `15`, and `weights` is now set to
+`distance`, rather than `uniform`.
+
+
+
+A Note on Defaults
+------------------
+
+Generally, efforts have been made by the developers of machine learning
+libraries to set sensible default hyperparameters for estimators. That
+said, for certain datasets, significant performance improvements may be
+achieved through tuning.
+
+
+Finding the Best Hyperparameterization
+======================================
+
+
+The best hyperparameterization depends on your overall objective in
+building a machine learning model in the first place. In most cases,
+this is to find the model that has the highest predictive performance on
+unseen data, as measured by its ability to correctly label data points
+(classification) or predict a number (regression).
+
+The prediction of unseen data can be simulated using hold-out test sets
+or cross-validation, the former being the method used in this lab.
+Performance is evaluated differently in each case, for instance, **Mean
+Squared Error** (**MSE**) for regression and accuracy for
+classification. We seek to reduce the MSE or increase the accuracy of
+our predictions.
+
+Let\'s implement manual hyperparameterization in the following exercise.
+
+
+
+Exercise 8.01: Manual Hyperparameter Tuning for a k-NN Classifier
+-----------------------------------------------------------------
+
+In this exercise, we will manually tune a k-NN classifier, which was
+covered in *Lab 7, The Generalization of Machine Learning Models*,
+our goal being to predict incidences of malignant or benign breast
+cancer based on cell measurements sourced from the affected breast
+sample.
+
+
+These are the important attributes of the dataset:
+
+- ID number
+- Diagnosis (M = malignant, B = benign)
+- 3-32)
+
+10 real-valued features are computed for each cell nucleus as follows:
+
+- Radius (mean of distances from the center to points on the
+ perimeter)
+
+- Texture (standard deviation of grayscale values)
+
+- Perimeter
+
+- Area
+
+- Smoothness (local variation in radius lengths)
+
+- Compactness (perimeter\^2 / area - 1.0)
+
+- Concavity (severity of concave portions of the contour)
+
+- Concave points (number of concave portions of the contour)
+
+- Symmetry
+
+- Fractal dimension (refers to the complexity of the tissue
+ architecture; \"coastline approximation\" - 1)
+
+
+The following steps will help you complete this exercise:
+
+1. Create a new notebook in Google Colab.
+
+2. Next, import `neighbors`, `datasets`, and
+ `model_selection` from scikit-learn:
+ ```
+ from sklearn import neighbors, datasets, model_selection
+ ```
+
+
+3. Load the data. We will call this object `cancer`, and
+ isolate the target `y`, and the features, `X`:
+ ```
+ # dataset
+ cancer = datasets.load_breast_cancer()
+ # target
+ y = cancer.target
+ # features
+ X = cancer.data
+ ```
+
+
+4. Initialize a k-NN classifier with its default hyperparameterization:
+ ```
+ # no arguments specified
+ knn = neighbors.KNeighborsClassifier()
+ ```
+
+
+5. Feed this classifier into a 10-fold cross-validation
+ (`cv`), calculating the precision score for each fold.
+ Assume that maximizing precision (the proportion of true positives
+ in all positive classifications) is the primary objective of this
+ exercise:
+ ```
+ # 10 folds, scored on precision
+ cv = model_selection.cross_val_score(knn, X, y, cv=10,\
+ scoring='precision')
+ ```
+
+
+6. Printing `cv` shows the precision score calculated for
+ each fold:
+
+ ```
+ # precision scores
+ print(cv)
+ ```
+
+
+ You will see the following output:
+
+ ```
+ [0.91666667 0.85 0.91666667 0.94736842 0.94594595
+ 0.94444444 0.97222222 0.92105263 0.96969697 0.97142857]
+ ```
+
+
+7. Calculate and print the mean precision score for all folds. This
+ will give us an idea of the overall performance of the model, as
+ shown in the following code snippet:
+
+ ```
+ # average over all folds
+ print(round(cv.mean(), 2))
+ ```
+
+
+ You should get the following output:
+
+ ```
+ 0.94
+ ```
+
+
+ You should see the mean score is close to 94%. Can this be improved
+ upon?
+
+8. Run everything again, this time setting hyperparameter `k`
+ to `15`. You can see that the result is actually
+ marginally worse (1% lower):
+
+ ```
+ # k = 15
+ knn = neighbors.KNeighborsClassifier(n_neighbors=15)
+ cv = model_selection.cross_val_score(knn, X, y, cv=10, \
+ scoring='precision')
+ print(round(cv.mean(), 2))
+ ```
+
+
+ The output will be as follows:
+
+ ```
+ 0.93
+ ```
+
+
+9. Try again with `k` = `7`, `3`, and
+ `1`. In this case, it seems reasonable that the default
+ value of 5 is the best option. To avoid repetition, you may like to
+ define and call a Python function as follows:
+
+ ```
+ def evaluate_knn(k):
+ knn = neighbors.KNeighborsClassifier(n_neighbors=k)
+ cv = model_selection.cross_val_score(knn, X, y, cv=10, \
+ scoring='precision')
+ print(round(cv.mean(), 2))
+ evaluate_knn(k=7)
+ evaluate_knn(k=3)
+ evaluate_knn(k=1)
+ ```
+
+
+ The output will be as follows:
+
+ ```
+ 0.93
+ 0.93
+ 0.92
+ ```
+
+
+ Nothing beats 94%.
+
+10. Let\'s alter a second hyperparameter. Setting `k = 5`,
+ what happens if we change the k-NN weighing system to depend on
+ `distance` rather than having `uniform` weights?
+ Run all code again, this time with the following
+ hyperparameterization:
+
+ ```
+ # k =5, weights evaluated using distance
+ knn = neighbors.KNeighborsClassifier(n_neighbors=5, \
+ weights='distance')
+ cv = model_selection.cross_val_score(knn, X, y, cv=10, \
+ scoring='precision')
+ print(round(cv.mean(), 2))
+ ```
+
+
+ Did performance improve?
+
+ You should see no further improvement on the default
+ hyperparameterization because the output is:
+
+ ```
+ 0.93
+ ```
+
+
+We therefore conclude that the default hyperparameterization is the
+optimal one in this case.
+
+
+
+
+Simple Demonstration of the Grid Search Strategy
+------------------------------------------------
+
+
+This time, instead of manually fitting models with different values of
+`k` we just define the `k` values we would like to
+try, that is, `k = 1, 3, 5, 7` in a Python dictionary. This
+dictionary will be the grid we will search through to find the optimal
+hyperparameterization.
+
+
+The code will be as follows:
+
+```
+from sklearn import neighbors, datasets, model_selection
+# load data
+cancer = datasets.load_breast_cancer()
+# target
+y = cancer.target
+# features
+X = cancer.data
+# hyperparameter grid
+grid = {'k': [1, 3, 5, 7]}
+```
+
+In the code snippet, we have used a dictionary `{}` and set
+the `k` values in a Python dictionary.
+
+In the next part of the code snippet, to conduct the search, we iterate
+through the grid, fitting a model for each value of `k`, each
+time evaluating the model through 10-fold cross-validation.
+
+At the end of each iteration, we extract, format, and report back the
+mean precision score after cross-validation via the `print`
+method:
+
+```
+# for every value of k in the grid
+for k in grid['k']:
+ # initialize the knn estimator
+ knn = neighbors.KNeighborsClassifier(n_neighbors=k)
+ # conduct a 10-fold cross-validation
+ cv = model_selection.cross_val_score(knn, X, y, cv=10, \
+ scoring='precision')
+ # calculate the average precision value over all folds
+ cv_mean = round(cv.mean(), 3)
+ # report the result
+ print('With k = {}, mean precision = {}'.format(k, cv_mean))
+```
+
+The output will be as follows:
+
+
+
+Caption: Average precisions for all folds
+
+We can see from the output that `k = 5` is the best
+hyperparameterization found, with a mean precision score of roughly 94%.
+Increasing `k` to `7` didn\'t significantly improve
+performance. It is important to note that the only parameter we are
+changing here is k and that each time the k-NN estimator is initialized,
+it is done with the remaining hyperparameters set to their default
+values.
+
+To make this point clear, we can run the same loop, this time just
+printing the hyperparameterization that will be tried:
+
+```
+# for every value of k in the grid
+for k in grid['k']:
+ # initialize the knn estimator
+ knn = neighbors.KNeighborsClassifier(n_neighbors=k)
+ # print the hyperparameterization
+ print(knn.get_params())
+```
+
+The output will be as follows:
+
+```
+{'algorithm': 'auto', 'leaf_size': 30, 'metric': 'minkowski',
+ 'metric_params': None, 'n_jobs': None, 'n_neighbors': 1,
+ 'p': 2, 'weights': 'uniform'}
+{'algorithm': 'auto', 'leaf_size': 30, 'metric': 'minkowski',
+ 'metric_params': None, 'n_jobs': None, 'n_neighbors': 3,
+ 'p': 2, 'weights': 'uniform'}
+{'algorithm': 'auto', 'leaf_size': 30, 'metric': 'minkowski',
+ 'metric_params': None, 'n_jobs': None, 'n_neighbors': 5,
+ 'p': 2, 'weights': 'uniform'}
+{'algorithm': 'auto', 'leaf_size': 30, 'metric': 'minkowski',
+ 'metric_params': None, 'n_jobs': None, 'n_neighbors': 7,
+ 'p': 2, 'weights': 'uniform'}
+```
+You can see from the output that the only parameter we are changing is
+k; everything else remains the same in each iteration.
+
+Simple, single-loop structures are fine for a grid search of a single
+hyperparameter, but what if we would like to try a second one? Remember
+that for k-NN we also have weights that can take values
+`uniform` or `distance`, the choice of which
+influences how k-NN learns how to classify points.
+
+To proceed, all we need to do is create a dictionary containing both the
+values of k and the weight functions we would like to try as separate
+key/value pairs:
+
+```
+# hyperparameter grid
+grid = {'k': [1, 3, 5, 7],\
+ 'weight_function': ['uniform', 'distance']}
+# for every value of k in the grid
+for k in grid['k']:
+ # and every possible weight_function in the grid
+ for weight_function in grid['weight_function']:
+ # initialize the knn estimator
+ knn = neighbors.KNeighborsClassifier\
+ (n_neighbors=k, \
+ weights=weight_function)
+ # conduct a 10-fold cross-validation
+ cv = model_selection.cross_val_score(knn, X, y, cv=10, \
+ scoring='precision')
+ # calculate the average precision value over all folds
+ cv_mean = round(cv.mean(), 3)
+ # report the result
+ print('With k = {} and weight function = {}, '\
+ 'mean precision = {}'\
+ .format(k, weight_function, cv_mean))
+```
+
+The output will be as follows:
+
+
+
+Caption: Average precision values for all folds for different values
+of k
+
+You can see that when `k = 5`, the weight function is not
+based on distance and all the other hyperparameters are kept as their
+default values, and the mean precision comes out highest. As we
+discussed earlier, if you would like to see the full set of
+hyperparameterizations evaluated for k-NN, just add
+`print(knn.get_params())` inside the `for` loop
+after the estimator is initialized:
+
+```
+# for every value of k in the grid
+for k in grid['k']:
+ # and every possible weight_function in the grid
+ for weight_function in grid['weight_function']:
+ # initialize the knn estimator
+ knn = neighbors.KNeighborsClassifier\
+ (n_neighbors=k, \
+ weights=weight_function)
+ # print the hyperparameterizations
+ print(knn.get_params())
+```
+
+The output will be as follows:
+
+```
+{'algorithm': 'auto', 'leaf_size': 30, 'metric': 'minkowski',
+ 'metric_params': None, 'n_jobs': None, 'n_neighbors': 1,
+ 'p': 2, 'weights': 'uniform'}
+{'algorithm': 'auto', 'leaf_size': 30, 'metric': 'minkowski',
+ 'metric_params': None, 'n_jobs': None, 'n_neighbors': 1,
+ 'p': 2, 'weights': 'distance'}
+{'algorithm': 'auto', 'leaf_size': 30, 'metric': 'minkowski',
+ 'metric_params': None, 'n_jobs': None, 'n_neighbors': 3,
+ 'p': 2, 'weights': 'uniform'}
+{'algorithm': 'auto', 'leaf_size': 30, 'metric': 'minkowski',
+ 'metric_params': None, 'n_jobs': None, 'n_neighbors': 3,
+ 'p': 2, 'weights': 'distance'}
+{'algorithm': 'auto', 'leaf_size': 30, 'metric': 'minkowski',
+ 'metric_params': None, 'n_jobs': None, 'n_neighbors': 5,
+ 'p': 2, 'weights': 'uniform'}
+{'algorithm': 'auto', 'leaf_size': 30, 'metric': 'minkowski',
+ 'metric_params': None, 'n_jobs': None, 'n_neighbors': 5,
+ 'p': 2, 'weights': 'distance'}
+{'algorithm': 'auto', 'leaf_size': 30, 'metric': 'minkowski',
+ 'metric_params': None, 'n_jobs': None, 'n_neighbors': 7,
+ 'p': 2, 'weights': 'uniform'}
+{'algorithm': 'auto', 'leaf_size': 30, 'metric': 'minkowski',
+ 'metric_params': None, 'n_jobs': None, 'n_neighbors': 7,
+ 'p': 2, 'weights': 'distance'}
+```
+This implementation, while great for demonstrating how the grid search
+process works, may not practical when trying to evaluate estimators that
+have `3`, `4`, or even `10` different
+types of hyperparameters, each with a multitude of possible settings.
+
+To carry on in this way will mean writing and keeping track of multiple
+`for` loops, which can be tedious. Thankfully,
+`scikit-learn`\'s `model_selection` module gives us
+a method called `GridSearchCV` that is much more
+user-friendly. We will be looking at this in the topic ahead.
+
+
+GridSearchCV
+============
+
+
+`GridsearchCV` is a method of tuning wherein the model can be
+built by evaluating the combination of parameters mentioned in a grid.
+In the following figure, we will see how `GridSearchCV` is
+different from manual search and look at grid search in a muchdetailed
+way in a table format.
+
+
+
+Tuning using GridSearchCV
+-------------------------
+
+We can conduct a grid search much more easily in practice by leveraging
+`model_selection.GridSearchCV`.
+
+For the sake of comparison, we will use the same breast cancer dataset
+and k-NN classifier as before:
+
+```
+from sklearn import model_selection, datasets, neighbors
+# load the data
+cancer = datasets.load_breast_cancer()
+# target
+y = cancer.target
+# features
+X = cancer.data
+```
+
+The next thing we need to do after loading the data is to initialize the
+class of the estimator we would like to evaluate under different
+hyperparameterizations:
+
+```
+# initialize the estimator
+knn = neighbors.KNeighborsClassifier()
+```
+We then define the grid:
+
+```
+# grid contains k and the weight function
+grid = {'n_neighbors': [1, 3, 5, 7],\
+ 'weights': ['uniform', 'distance']}
+```
+To set up the search, we pass the freshly initialized estimator and our
+grid of hyperparameters to `model_selection.GridSearchCV()`.
+We must also specify a scoring metric, which is the method that will be
+used to evaluate the performance of the various hyperparameterizations
+tried during the search.
+
+The last thing to do is set the number splits to be used using
+cross-validation via the `cv` argument. We will set this to
+`10`, thereby conducting 10-fold cross-validation:
+
+```
+"""
+ set up the grid search with scoring on precision and
+number of folds = 10
+"""
+gscv = model_selection.GridSearchCV(estimator=knn, \
+ param_grid=grid, \
+ scoring='precision', cv=10)
+```
+
+The last step is to feed data to this object via its `fit()`
+method. Once this has been done, the grid search process will be
+kick-started:
+
+```
+# start the search
+gscv.fit(X, y)
+```
+By default, information relating to the search will be printed to the
+screen, allowing you to see the exact estimator parameterizations that
+will be evaluated for the k-NN estimator:
+
+
+
+Caption: Estimator parameterizations for the k-NN estimator
+
+Once the search is complete, we can examine the results by accessing and
+printing the `cv_results_` attribute. `cv_results_`
+is a dictionary containing helpful information regarding model
+performance under each hyperparameterization, such as the mean test-set
+value of your scoring metric (`mean_test_score`, the lower the
+better), the complete list of hyperparameterizations tried
+(`params`), and the model ranks as they relate to the
+`mean_test_score` (`rank_test_score`).
+
+The best model found will have rank = 1, the second-best model will have
+rank = 2, and so on, as you can see in *Figure 8.8*. The model fitting
+times are reported through `mean_fit_time`.
+
+Although not usually a consideration for smaller datasets, this value
+can be important because in some cases you may find that a marginal
+increase in model performance through a certain hyperparameterization is
+associated with a significant increase in model fit time, which,
+depending on the computing resources you have available, may render that
+hyperparameterization infeasible because it will take too long to fit:
+
+```
+# view the results
+print(gscv.cv_results_)
+```
+
+The output will be as follows:
+
+
+
+Caption: GridsearchCV results
+
+The model ranks can be seen in the following image:
+
+
+
+Caption: Model ranks
+
+
+
+For example, say we are only interested in each hyperparameterization
+(`params`) and mean cross-validated test score
+(`mean_test_score`) for the top five high - performing models:
+
+```
+import pandas as pd
+# convert the results dictionary to a dataframe
+results = pd.DataFrame(gscv.cv_results_)
+"""
+select just the hyperparameterizations tried,
+the mean test scores, order by score and show the top 5 models
+"""
+print(results.loc[:,['params','mean_test_score']]\
+ .sort_values('mean_test_score', ascending=False).head(5))
+```
+Running this code produces the following output:
+
+
+
+Caption: mean\_test\_score for top 5 models
+
+We can also use pandas to produce visualizations of the result as
+follows:
+
+```
+# visualise the result
+results.loc[:,['params','mean_test_score']]\
+ .plot.barh(x = 'params')
+```
+
+The output will be as follows:
+
+
+
+Caption: Using pandas to visualize the output
+
+
+
+Exercise 8.02: Grid Search Hyperparameter Tuning for an SVM
+-----------------------------------------------------------
+
+In this exercise, we will employ a class of estimator called an SVM
+classifier and tune its hyperparameters using a grid search strategy.
+
+The supervised learning objective we will focus on here is the
+classification of handwritten digits (0-9) based solely on images. The
+dataset we will use contains 1,797 labeled images of handwritten digits.
+
+
+
+1. Create a new notebook in Google Colab.
+
+2. Import `datasets`, `svm`, and
+ `model_selection` from scikit-learn:
+ ```
+ from sklearn import datasets, svm, model_selection
+ ```
+
+
+3. Load the data. We will call this object images, and then we\'ll
+ isolate the target `y` and the features `X`. In
+ the training step, the SVM classifier will learn how `y`
+ relates to `X` and will therefore be able to predict new
+ `y` values when given new `X` values:
+ ```
+ # load data
+ digits = datasets.load_digits()
+ # target
+ y = digits.target
+ # features
+ X = digits.data
+ ```
+
+
+4. Initialize the estimator as a multi-class SVM classifier and set the
+ `gamma` argument to `scale`:
+
+ ```
+ # support vector machine classifier
+ clr = svm.SVC(gamma='scale')
+ ```
+
+
+5. Define our grid to cover four distinct hyperparameterizations of the
+ classifier with a linear kernel and with a polynomial kernel of
+ degrees `2`, `3,` and `4`. We want to
+ see which of the four hyperparameterizations leads to more accurate
+ predictions:
+ ```
+ # hyperparameter grid. contains linear and polynomial kernels
+ grid = [{'kernel': ['linear']},\
+ {'kernel': ['poly'], 'degree': [2, 3, 4]}]
+ ```
+
+
+6. Set up grid search k-fold cross-validation with `10` folds
+ and a scoring measure of accuracy. Make sure it has our
+ `grid` and `estimator` objects as inputs:
+ ```
+ """
+ setting up the grid search to score on accuracy and
+ evaluate over 10 folds
+ """
+ cv_spec = model_selection.GridSearchCV\
+ (estimator=clr, param_grid=grid, \
+ scoring='accuracy', cv=10)
+ ```
+
+
+7. Start the search by providing data to the `.fit()` method.
+ Details of the process, including the hyperparameterizations tried
+ and the scoring method selected, will be printed to the screen:
+
+ ```
+ # start the grid search
+ cv_spec.fit(X, y)
+ ```
+
+
+ You should see the following output:
+
+
+
+
+
+ Caption: Grid Search using the .fit() method
+
+8. To examine all of the results, simply print
+ `cv_spec.cv_results_` to the screen. You will see that the
+ results are structured as a dictionary, allowing you to access the
+ information you require using the keys:
+
+ ```
+ # what is the available information
+ print(cv_spec.cv_results_.keys())
+ ```
+
+
+ You will see the following information:
+
+
+
+
+
+ Caption: Results as a dictionary
+
+9. For this exercise, we are primarily concerned with the test-set
+ performance of each distinct hyperparameterization. You can see the
+ first hyperparameterization through
+ `cv_spec.cv_results_['mean_test_score']`, and the second
+ through `cv_spec.cv_results_['params']`.
+
+ Let\'s convert the results dictionary to a `pandas`
+ DataFrame and find the best hyperparameterization:
+
+ ```
+ import pandas as pd
+ # convert the dictionary of results to a pandas dataframe
+ results = pd.DataFrame(cv_spec.cv_results_)
+ # show hyperparameterizations
+ print(results.loc[:,['params','mean_test_score']]\
+ .sort_values('mean_test_score', ascending=False))
+ ```
+
+
+ You will see the following results:
+
+
+
+
+
+ Caption: Parameterization results
+
+ Note
+
+ You may get slightly different results. However, the values you
+ obtain should largely agree with those in the preceding output.
+
+10. It is best practice to visualize any results you produce.
+ `pandas` makes this easy. Run the following code to
+ produce a visualization:
+
+ ```
+ # visualize the result
+ (results.loc[:,['params','mean_test_score']]\
+ .sort_values('mean_test_score', ascending=True)\
+ .plot.barh(x='params', xlim=(0.8)))
+ ```
+
+
+ The output will be as follows:
+
+
+
+
+
+Caption: Using pandas to visualize the results
+
+
+
+Advantages and Disadvantages of Grid Search
+-------------------------------------------
+
+The primary advantage of the grid search compared to a manual search is
+that it is an automated process that one can simply set and forget.
+Additionally, you have the power to dictate the exact
+hyperparameterizations evaluated, which can be a good thing when you
+have prior knowledge of what kind of hyperparameterizations might work
+well in your context. It is also easy to understand exactly what will
+happen during the search thanks to the explicit definitions of the grid.
+
+The major drawback of the grid search strategy is that it is
+computationally very expensive; that is, when the number of
+hyperparameterizations to try increases substantially, processing times
+can be very slow. Also, when you define your grid, you may inadvertently
+omit an hyperparameterization that would in fact be optimal. If it is
+not specified in your grid, it will never be tried
+
+To overcome these drawbacks, we will be looking at random search in the
+next section.
+
+
+Random Search
+=============
+
+
+Instead of searching through every hyperparameterizations in a
+pre-defined set, as is the case with a grid search, in a random search
+we sample from a distribution of possibilities by assuming each
+hyperparameter to be a random variable. Before we go through the process
+in depth, it will be helpful to briefly review what random variables are
+and what we mean by a distribution.
+
+
+
+Random Variables and Their Distributions
+----------------------------------------
+
+A random variable is non-constant (its value can change) and its
+variability can be described in terms of distribution. There are many
+different types of distributions, but each falls into one of two broad
+categories: discrete and continuous. We use discrete distributions to
+describe random variables whose values can take only whole numbers, such
+as counts.
+
+An example is the count of visitors to a theme park in a day, or the
+number of attempted shots it takes a golfer to get a hole-in-one.
+
+We use continuous distributions to describe random variables whose
+values lie along a continuum made up of infinitely small increments.
+Examples include human height or weight, or outside air temperature.
+Distributions often have parameters that control their shape.
+
+Discrete distributions can be described mathematically using what\'s
+called a probability mass function, which defines the exact probability
+of the random variable taking a certain value. Common notation for the
+left-hand side of this function is `P(X=x)`, which in plain
+English means that the probability that the random variable
+`X` equals a certain value `x` is `P`.
+Remember that probabilities range between `0` (impossible) and
+`1` (certain).
+
+By definition, the summation of each `P(X=x)` for all possible
+`x`\'s will be equal to 1, or if expressed another way, the
+probability that `X` will take any value is 1. A simple
+example of this kind of distribution is the discrete uniform
+distribution, where the random variable `X` will take only one
+of a finite range of values and the probability of it taking any
+particular value is the same for all values, hence the term uniform.
+
+For example, if there are 10 possible values the probability that
+`X` is any particular value is exactly 1/10. If there were 6
+possible values, as in the case of a standard 6-sided die, the
+probability would be 1/6, and so on. The probability mass function for
+the discrete uniform distribution is:
+
+
+
+Caption: Probability mass function for the discrete uniform
+distribution
+
+The following code will allow us to see the form of this distribution
+with 10 possible values of X.
+
+First, we create a list of all the possible values `X` can
+take:
+
+```
+# list of all xs
+X = list(range(1, 11))
+print(X)
+```
+
+The output will be as follows:
+
+```
+ [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
+```
+We then calculate the probability that `X` will take up any
+value of `x (P(X=x))`:
+
+```
+# pmf, 1/n * n = 1
+p_X_x = [1/len(X)] * len(X)
+# sums to 1
+print(p_X_x)
+```
+As discussed, the summation of probabilities will equal 1, and this is
+the case with any distribution. We now have everything we need to
+visualize the distribution:
+
+```
+import matplotlib.pyplot as plt
+plt.bar(X, p_X_x)
+plt.xlabel('X')
+plt.ylabel('P(X=x)')
+```
+
+The output will be as follows:
+
+
+
+Caption: Visualizing the bar chart
+
+In the visual output, we see that the probability of `X` being
+a specific whole number between 1 and 10 is equal to 1/10.
+
+Note
+
+Other discrete distributions you commonly see include the binomial,
+negative binomial, geometric, and Poisson distributions, all of which we
+encourage you to investigate. Type these terms into a search engine to
+find out more.
+
+Distributions of continuous random variables are a bit more challenging
+in that we cannot calculate an exact `P(X=x)` directly because
+`X` lies on a continuum. We can, however, use integration to
+approximate probabilities between a range of values, but this is beyond
+the scope of this book. The relationship between `X` and
+probability is described using a probability density function,
+`P(X)`. Perhaps the most well-known continuous distribution is
+the normal distribution, which visually takes the form of a bell.
+
+The normal distribution has two parameters that describe its shape, mean
+(`𝜇`) and variance (`𝜎`[2]). The
+probability density function for the normal distribution is:
+
+
+
+Caption: Probability density function for the normal distribution
+
+The following code shows two normal distributions with the same mean
+(`𝜇`` = 0`) but different variance parameters
+(`𝜎``2 = 1` and `𝜎``2 = 2.25`).
+Let\'s first generate 100 evenly spaced values from `-10` to
+`10` using NumPy\'s `.linspace` method:
+
+```
+import numpy as np
+# range of xs
+x = np.linspace(-10, 10, 100)
+```
+We then generate the approximate `X` probabilities for both
+normal distributions.
+
+Using `scipy.stats` is a good way to work with distributions,
+and its `pdf` method allows us to easily visualize the shape
+of probability density functions:
+
+```
+import scipy.stats as stats
+# first normal distribution with mean = 0, variance = 1
+p_X_1 = stats.norm.pdf(x=x, loc=0.0, scale=1.0**2)
+# second normal distribution with mean = 0, variance = 2.25
+p_X_2 = stats.norm.pdf(x=x, loc=0.0, scale=1.5**2)
+```
+Note
+
+In this case, `loc` corresponds to 𝜇, while `scale`
+corresponds to the standard deviation, which is the square root of
+`𝜎``2`, hence why we square the inputs.
+
+We then visualize the result. Notice that `𝜎``2`
+controls how fat the distribution is and therefore how variable the
+random variable is:
+
+```
+plt.plot(x,p_X_1, color='blue')
+plt.plot(x, p_X_2, color='orange')
+plt.xlabel('X')
+plt.ylabel('P(X)')
+```
+
+The output will be as follows:
+
+
+
+Caption: Visualizing the normal distribution
+
+
+
+Simple Demonstration of the Random Search Process
+-------------------------------------------------
+
+Again, before we get to the scikit-learn implementation of random search
+parameter tuning, we will step through the process using simple Python
+tools. Up until this point, we have only been using classification
+problems to demonstrate tuning concepts, but now we will look at a
+regression problem. Can we find a model that\'s able to predict the
+progression of diabetes in patients based on characteristics such as BMI
+and age?
+
+
+We first load the data:
+
+```
+from sklearn import datasets, linear_model, model_selection
+# load the data
+diabetes = datasets.load_diabetes()
+# target
+y = diabetes.target
+# features
+X = diabetes.data
+```
+To get a feel for the data, we can examine the disease progression for
+the first patient:
+
+```
+# the first patient has index 0
+print(y[0])
+```
+
+The output will be as follows:
+
+```
+ 151.0
+```
+Let\'s now examine their characteristics:
+
+```
+# let's look at the first patients data
+print(dict(zip(diabetes.feature_names, X[0])))
+```
+We should see the following:
+
+
+
+Caption: Dictionary for patient characteristics
+
+
+
+
+For ridge regression, we believe the optimal 𝛼 to be somewhere near 1,
+becoming less likely as you move away from 1. A parameterization of the
+gamma distribution that reflects this idea is where k and 𝜃 are both
+equal to 1. To visualize the form of this distribution, we can run the
+following:
+
+```
+import numpy as np
+from scipy import stats
+import matplotlib.pyplot as plt
+# values of alpha
+x = np.linspace(1, 20, 100)
+# probabilities
+p_X = stats.gamma.pdf(x=x, a=1, loc=1, scale=2)
+plt.plot(x,p_X)
+plt.xlabel('alpha')
+plt.ylabel('P(alpha)')
+```
+
+The output will be as follows:
+
+
+
+Caption: Visualization of probabilities
+
+In the graph, you can see how probability decays sharply for smaller
+values of 𝛼, then decays more slowly for larger values.
+
+The next step in the random search process is to sample n values from
+the chosen distribution. In this example, we will draw 100 𝛼 values.
+Remember that the probability of drawing out a particular value of 𝛼 is
+related to its probability as defined by this distribution:
+
+```
+# n sample values
+n_iter = 100
+# sample from the gamma distribution
+samples = stats.gamma.rvs(a=1, loc=1, scale=2, \
+ size=n_iter, random_state=100)
+```
+Note
+
+We set a random state to ensure reproducible results.
+
+Plotting a histogram of the sample, as shown in the following figure,
+reveals a shape that approximately conforms to the distribution that we
+have sampled from. Note that as your sample sizes increases, the more
+the histogram conforms to the distribution:
+
+```
+# visualize the sample distribution
+plt.hist(samples)
+plt.xlabel('alpha')
+plt.ylabel('sample count')
+```
+
+The output will be as follows:
+
+
+
+Caption: Visualization of the sample distribution
+
+A model will then be fitted for each value of 𝛼 sampled and assessed for
+performance. As we have seen with the other approaches to hyperparameter
+tuning in this lab, performance will be assessed using k-fold
+cross-validation (with `k =10`) but because we are dealing
+with a regression problem, the performance metric will be the test-set
+negative MSE.
+
+Using this metric means larger values are better. We will store the
+results in a dictionary with each 𝛼 value as the key and the
+corresponding cross-validated negative MSE as the value:
+
+```
+# we will store the results inside a dictionary
+result = {}
+# for each sample
+for sample in samples:
+ """
+ initialize a ridge regression estimator with alpha set
+ to the sample value
+ """
+ reg = linear_model.Ridge(alpha=sample)
+ """
+ conduct a 10-fold cross validation scoring on
+ negative mean squared error
+ """
+ cv = model_selection.cross_val_score\
+ (reg, X, y, cv=10, \
+ scoring='neg_mean_squared_error')
+ # retain the result in the dictionary
+ result[sample] = [cv.mean()]
+```
+
+Instead of examining the raw dictionary of results, we will convert it
+to a pandas DataFrame, transpose it, and give the columns names. Sorting
+by descending negative mean squared error reveals that the optimal level
+of regularization for this problem is actually when 𝛼 is approximately
+1, meaning that we did not find evidence to suggest regularization is
+necessary for this problem and that the OLS linear model will suffice:
+
+```
+import pandas as pd
+"""
+convert the result dictionary to a pandas dataframe,
+transpose and reset the index
+"""
+df_result = pd.DataFrame(result).T.reset_index()
+# give the columns sensible names
+df_result.columns = ['alpha', 'mean_neg_mean_squared_error']
+print(df_result.sort_values('mean_neg_mean_squared_error', \
+ ascending=False).head())
+```
+
+The output will be as follows:
+
+
+
+Caption: Output for the random search process
+
+Note
+
+The results will be different, depending on the data used.
+
+It is always beneficial to visualize results where possible. Plotting 𝛼
+by negative mean squared error as a scatter plot makes it clear that
+venturing away from 𝛼 = 1 does not result in improvements in predictive
+performance:
+
+```
+plt.scatter(df_result.alpha, \
+ df_result.mean_neg_mean_squared_error)
+plt.xlabel('alpha')
+plt.ylabel('-MSE')
+```
+
+The output will be as follows:
+
+
+
+Caption: Plotting the scatter plot
+
+The fact that we found the optimal 𝛼 to be 1 (its default value) is a
+special case in hyperparameter tuning in that the optimal
+hyperparameterization is the default one.
+
+
+
+Tuning Using RandomizedSearchCV
+-------------------------------
+
+In practice, we can use the `RandomizedSearchCV` method inside
+scikit-learn\'s `model_selection` module to conduct the
+search. All you need to do is pass in your estimator, the
+hyperparameters you wish to tune along with their distributions, the
+number of samples you would like to sample from each distribution, and
+the metric by which you would like to assess model performance. These
+correspond to the `param_distributions`, `n_iter`,
+and `scoring` arguments respectively. For the sake of
+demonstration, let\'s conduct the search we completed earlier using
+`RandomizedSearchCV`. First, we load the data and initialize
+our ridge regression estimator:
+
+```
+from sklearn import datasets, model_selection, linear_model
+# load the data
+diabetes = datasets.load_diabetes()
+# target
+y = diabetes.target
+# features
+X = diabetes.data
+# initialise the ridge regression
+reg = linear_model.Ridge()
+```
+We then specify that the hyperparameter we would like to tune is
+`alpha` and that we would like 𝛼 to be distributed
+`gamma`, with `k = 1` and
+`𝜃`` = 1`:
+
+```
+from scipy import stats
+# alpha ~ gamma(1,1)
+param_dist = {'alpha': stats.gamma(a=1, loc=1, scale=2)}
+```
+Next, we set up and run the random search process, which will sample 100
+values from our `gamma(1,1)` distribution, fit the ridge
+regression, and evaluate its performance using cross-validation scored
+on the negative mean squared error metric:
+
+```
+"""
+set up the random search to sample 100 values and
+score on negative mean squared error
+"""
+rscv = model_selection.RandomizedSearchCV\
+ (estimator=reg, param_distributions=param_dist, \
+ n_iter=100, scoring='neg_mean_squared_error')
+# start the search
+rscv.fit(X,y)
+```
+After completing the search, we can extract the results and generate a
+pandas DataFrame, as we have done previously. Sorting by
+`rank_test_score` and viewing the first five rows aligns with
+our conclusion that alpha should be set to 1 and regularization does not
+seem to be required for this problem:
+
+```
+import pandas as pd
+# convert the results dictionary to a pandas data frame
+results = pd.DataFrame(rscv.cv_results_)
+# show the top 5 hyperparamaterizations
+print(results.loc[:,['params','rank_test_score']]\
+ .sort_values('rank_test_score').head(5))
+```
+
+The output will be as follows:
+
+
+
+Caption: Output for tuning using RandomizedSearchCV
+
+Note
+
+The preceding results may vary, depending on the data.
+
+
+
+Exercise 8.03: Random Search Hyperparameter Tuning for a Random Forest Classifier
+---------------------------------------------------------------------------------
+
+In this exercise, we will revisit the handwritten digit classification
+problem, this time using a random forest classifier with hyperparameters
+tuned using a random search strategy. The random forest is a popular
+method used for both single-class and multi-class classification
+problems. It learns by growing `n` simple tree models that
+each progressively split the dataset into areas that best separate the
+points of different classes.
+
+The final model produced can be thought of as the average of each of the
+n tree models. In this way, the random forest is an `ensemble`
+method. The parameters we will tune in this exercise are
+`criterion` and `max_features`.
+
+`criterion` refers to the way in which each split is evaluated
+from a class purity perspective (the purer the splits, the better) and
+`max_features` is the maximum number of features the random
+forest can use when finding the best splits.
+
+The following steps will help you complete the exercise.
+
+1. Create a new notebook in Google Colab.
+
+2. Import the data and isolate the features `X` and the
+ target `y`:
+ ```
+ from sklearn import datasets
+ # import data
+ digits = datasets.load_digits()
+ # target
+ y = digits.target
+ # features
+ X = digits.data
+ ```
+
+
+3. Initialize the random forest classifier estimator. We will set the
+ `n_estimators` hyperparameter to `100`, which
+ means the predictions of the final model will essentially be an
+ average of `100` simple tree models. Note the use of a
+ random state to ensure the reproducibility of results:
+ ```
+ from sklearn import ensemble
+ # an ensemble of 100 estimators
+ rfc = ensemble.RandomForestClassifier(n_estimators=100, \
+ random_state=100)
+ ```
+
+
+4. One of the parameters we will be tuning is `max_features`.
+ Let\'s find out the maximum value this could take:
+
+ ```
+ # how many features do we have in our dataset?
+ n_features = X.shape[1]
+ print(n_features)
+ ```
+
+
+ You should see that we have 64 features:
+
+ ```
+ 64
+ ```
+
+
+ Now that we know the maximum value of `max_features` we
+ are free to define our hyperparameter inputs to the randomized
+ search process. At this point, we have no reason to believe any
+ particular value of `max_features` is more optimal.
+
+5. Set a discrete uniform distribution covering the range `1`
+ to `64`. Remember the probability mass function,
+ `P(X=x) = 1/n`, for this distribution, so
+ `P(X=x) = 1/64` in our case. Because `criterion`
+ has only two discrete options, this will also be sampled as a
+ discrete uniform distribution with `P(X=x) = ½`:
+ ```
+ from scipy import stats
+ """
+ we would like to smaple from criterion and
+ max_features as discrete uniform distributions
+ """
+ param_dist = {'criterion': ['gini', 'entropy'],\
+ 'max_features': stats.randint(low=1, \
+ high=n_features)}
+ ```
+
+
+6. We now have everything we need to set up the randomized search
+ process. As before, we will use accuracy as the metric of model
+ evaluation. Note the use of a random state:
+ ```
+ from sklearn import model_selection
+ """
+ setting up the random search sampling 50 times and
+ conducting 5-fold cross-validation
+ """
+ rscv = model_selection.RandomizedSearchCV\
+ (estimator=rfc, param_distributions=param_dist, \
+ n_iter=50, cv=5, scoring='accuracy' , random_state=100)
+ ```
+
+
+7. Let\'s kick off the process with the. `fit` method. Please
+ note that both fitting random forests and cross-validation are
+ computationally expensive processes due to their internal processes
+ of iteration. Generating a result may take some time:
+
+ ```
+ # start the process
+ rscv.fit(X,y)
+ ```
+
+
+ You should see the following:
+
+
+
+
+
+ Caption: RandomizedSearchCV results
+
+8. Next, you need to examine the results. Create a `pandas`
+ DataFrame from the `results` attribute, order by the
+ `rank_test_score`, and look at the top five model
+ hyperparameterizations. Note that because the random search draws
+ samples of hyperparameterizations at random, it is possible to have
+ duplication. We remove the duplicate entries from the DataFrame:
+
+ ```
+ import pandas as pd
+ # convert the dictionary of results to a pandas dataframe
+ results = pd.DataFrame(rscv.cv_results_)
+ # removing duplication
+ distinct_results = results.loc[:,['params',\
+ 'mean_test_score']]
+ # convert the params dictionaries to string data types
+ distinct_results.loc[:,'params'] = distinct_results.loc\
+ [:,'params'].astype('str')
+ # remove duplicates
+ distinct_results.drop_duplicates(inplace=True)
+ # look at the top 5 best hyperparamaterizations
+ distinct_results.sort_values('mean_test_score', \
+ ascending=False).head(5)
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+ Caption: Top five hyperparameterizations
+
+ Note
+
+ You may get slightly different results. However, the values you
+ obtain should largely agree with those in the preceding output.
+
+9. The last step is to visualize the result. Including every
+ parameterization will result in a cluttered plot, so we will filter
+ on parameterizations that resulted in a mean test score \> 0.93:
+
+ ```
+ # top performing models
+ distinct_results[distinct_results.mean_test_score > 0.93]\
+ .sort_values('mean_test_score')\
+ .plot.barh(x='params', xlim=(0.9))
+ ```
+
+
+ The output will be as follows:
+
+
+
+
+
+Caption: Visualizing the test scores of the top-performing models
+
+
+
+Advantages and Disadvantages of a Random Search
+-----------------------------------------------
+
+Because a random search takes a finite sample from a range of possible
+hyperparameterizations (`n_iter` in
+`model_selection.RandomizedSearchCV`), it is feasible to
+expand the range of your hyperparameter search beyond what would be
+practical with a grid search. This is because a grid search has to try
+everything in the range, and setting a large range of values may be too
+slow to process. Searching this wider range gives you the chance of
+discovering a truly optimal solution.
+
+Compared to the manual and grid search strategies, you do sacrifice a
+level of control to obtain this benefit. The other consideration is that
+setting up random search is a bit more involved than other options in
+that you have to specify distributions. There is always a chance of
+getting this wrong. That said, if you are unsure about what
+distributions to use, stick with discrete or continuous uniform for the
+respective variable types as this will assign an equal probability of
+selection to all options.
+
+
+
+Activity 8.01: Is the Mushroom Poisonous?
+-----------------------------------------
+
+Imagine you are a data scientist working for the biology department at
+your local university. Your colleague who is a mycologist (a biologist
+who specializes in fungi) has requested that you help her develop a
+machine learning model capable of discerning whether a particular
+mushroom species is poisonous or not given attributes relating to its
+appearance.
+
+The objective of this activity is to employ the grid and randomized
+search strategies to find an optimal model for this purpose.
+
+
+
+1. Load the data into Python using the `pandas.read_csv()`
+ method, calling the object `mushrooms`.
+
+ Hint: The dataset is in CSV format and has no header. Set
+ `header=None` in `pandas.read_csv()`.
+
+2. Separate the target, `y` and features, `X` from
+ the dataset.
+
+ Hint: The target can be found in the first column
+ (`mushrooms.iloc[:,0]`) and the features in the remaining
+ columns (`mushrooms.iloc[:,1:]`).
+
+3. Recode the target, `y`, so that poisonous mushrooms are
+ represented as `1` and edible mushrooms as `0`.
+
+4. Transform the columns of the feature set `X` into a
+ `numpy` array with a binary representation. This is known
+ as one-hot encoding.
+
+ Hint: Use `preprocessing.OneHotEncoder()` to transform
+ `X`.
+
+5. Conduct both a grid and random search to find an optimal
+ hyperparameterization for a random forest classifier. Use accuracy
+ as your method of model evaluation. Make sure that when you
+ initialize the classifier and when you conduct your random search,
+ `random_state = 100`.
+
+ For the grid search, use the following:
+
+ ```
+ {'criterion': ['gini', 'entropy'],\
+ 'max_features': [2, 4, 6, 8, 10, 12, 14]}
+ ```
+
+
+ For the randomized search, use the following:
+
+ ```
+ {'criterion': ['gini', 'entropy'],\
+ 'max_features': stats.randint(low=1, high=max_features)}
+ ```
+
+
+6. Plot the mean test score versus hyperparameterization for the top 10
+ models found using random search.
+
+ You should see a plot similar to the following:
+
+
+
+Caption: Mean test score plot
+
+
+Summary
+=======
+
+
+In this lab, we have covered three strategies for hyperparameter
+tuning based on searching for estimator hyperparameterizations that
+improve performance.
+
+
+The grid search is an automated method that is the most systematic of
+the three but can be very computationally intensive to run when the
+range of possible hyperparameterizations increases.
+The random search, while the most complicated to set up, is based on
+sampling from distributions of hyperparameters.
\ No newline at end of file
diff --git a/lab_guides/Lab_9.md b/lab_guides/Lab_9.md
new file mode 100644
index 0000000..0b8ad5f
--- /dev/null
+++ b/lab_guides/Lab_9.md
@@ -0,0 +1,1565 @@
+
+9. Interpreting a Machine Learning Model
+========================================
+
+
+
+Overview
+
+This lab will show you how to interpret a machine learning model\'s
+results and get deeper insights into the patterns it found. By the end
+of the lab, you will be able to analyze weights from linear models
+and variable importance for `RandomForest`. You will be able
+to implement variable importance via permutation to analyze feature
+importance. You will use a partial dependence plot to analyze single
+variables and make use of the lime package for local interpretation.
+
+
+Introduction
+============
+
+
+In the previous lab, you saw how to find the optimal hyperparameters
+of some of the most popular machine learning algorithms in order to get
+better predictive performance (that is, more accurate predictions).
+
+Machine learning algorithms are always referred to as black box where we
+can only see the inputs and outputs and the implementation inside the
+algorithm is quite opaque, so people don\'t know what is happening
+inside.
+
+With each day that passes, we can sense the elevated need for more
+transparency in machine learning models. In the last few years, we have
+seen some cases where algorithms have been accused of discriminating
+against groups of people. For instance, a few years ago, a
+not-for-profit news organization called ProPublica highlighted bias in
+the COMPAS algorithm, built by the Northpointe company. The objective of
+the algorithm is to assess the likelihood of re-offending for a
+criminal. It was shown that the algorithm was predicting a higher level
+of risk for specific groups of people based on their demographics rather
+than other features. This example highlighted the importance of
+interpreting the results of your model and its logic properly and
+clearly.
+
+Luckily, some machine learning algorithms provide methods to understand
+the parameters they learned for a given task and dataset. There are also
+some functions that are model-agnostic and can help us to better
+understand the predictions made. So, there are different techniques that
+are either model-specific or model-agnostic for interpreting a model.
+
+These techniques can also differ in their scope. In the literature, we
+either have a global or local interpretation. A global interpretation
+means we are looking at the variables for all observations from a
+dataset and we want to understand which features have the biggest
+overall influence on the target variable. For instance, if you are
+predicting customer churn for a telco company, you may find the most
+important features for your model are customer usage and the average
+monthly amount paid. Local interpretation, on the other hand, focuses
+only on a single observation and analyzes the impact of the different
+variables. We will look at a single specific case and see what led the
+model to make its final prediction. For example, you will look at a
+specific customer who is predicted to churn and will discover that they
+usually buy the new iPhone model every year, in September.
+
+In this lab, we will go through some techniques on how to interpret
+your models or their results.
+
+
+Linear Model Coefficients
+=========================
+
+
+In *Lab 2, Regression*, and *Lab 3, Binary Classification*, you
+saw that linear regression models learn function parameters in the form
+of the following:
+
+
+
+
+In `sklearn`, it is extremely easy to get the coefficient of a
+linear model; you just need to call the `coef_` attribute.
+Let\'s implement this on a real example with the Diabetes dataset from
+`sklearn`:
+
+```
+from sklearn.datasets import load_diabetes
+from sklearn.linear_model import LinearRegression
+data = load_diabetes()
+# fit a linear regression model to the data
+lr_model = LinearRegression()
+lr_model.fit(data.data, data.target)
+lr_model.coef_
+```
+
+The output will be as follows:
+
+
+
+Caption: Coefficients of the linear regression parameters
+
+Let\'s create a DataFrame with these values and column names:
+
+```
+import pandas as pd
+coeff_df = pd.DataFrame()
+coeff_df['feature'] = data.feature_names
+coeff_df['coefficient'] = lr_model.coef_
+coeff_df.head()
+```
+
+The output will be as follows:
+
+
+
+Caption: Coefficients of the linear regression model
+
+A large positive or a large negative number for a feature coefficient
+means it has a strong influence on the outcome. On the other hand, if
+the coefficient is close to 0, this means the variable does not have
+much impact on the prediction.
+
+From this table, we can see that column `s1` has a very low
+coefficient (a large negative number) so it negatively influences the
+final prediction. If `s1` increases by a unit of 1, the
+prediction value will decrease by `-792.184162`. On the other
+hand, `bmi` has a large positive number
+(`519.839787`) on the prediction, so the risk of diabetes is
+highly linked to this feature: an increase in body mass index (BMI)
+means a significant increase in the risk of diabetes.
+
+
+
+Exercise 9.01: Extracting the Linear Regression Coefficient
+-----------------------------------------------------------
+
+In this exercise, we will train a linear regression model to predict the
+customer drop-out ratio and extract its coefficients.
+
+
+The following steps will help you complete the exercise:
+
+1. Open a new Colab notebook.
+
+2. Import the following packages: `pandas`,
+ `train_test_split` from
+ `sklearn.model_selection`, `StandardScaler` from
+ `sklearn.preprocessing`, `LinearRegression` from
+ `sklearn.linear_model`, `mean_squared_error`
+ from `sklearn.metrics`, and `altair`:
+ ```
+ import pandas as pd
+ from sklearn.model_selection import train_test_split
+ from sklearn.preprocessing import StandardScaler
+ from sklearn.linear_model import LinearRegression
+ from sklearn.metrics import mean_squared_error
+ import altair as alt
+ ```
+
+
+3. Create a variable called `file_url` that contains the URL
+ to the dataset:
+ ```
+ file_url = 'https://raw.githubusercontent.com/'\
+ 'fenago/data-science/'\
+ 'master/Lab09/Dataset/phpYYZ4Qc.csv'
+ ```
+
+
+4. Load the dataset into a DataFrame called `df` using
+ `.read_csv()`:
+ ```
+ df = pd.read_csv(file_url)
+ ```
+
+
+5. Print the first five rows of the DataFrame:
+
+ ```
+ df.head()
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+ Caption: First five rows of the loaded DataFrame
+
+
+6. Extract the `rej` column using `.pop()` and save
+ it into a variable called `y`:
+ ```
+ y = df.pop('rej')
+ ```
+
+
+7. Print the summary of the DataFrame using `.describe()`.
+
+ ```
+ df.describe()
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+ Caption: Statistical measures of the DataFrame
+
+ Note
+
+ The preceding figure is a truncated version of the output.
+
+ From this output, we can see the data is not standardized. The
+ variables have different scales.
+
+8. Split the DataFrame into training and testing sets using
+ `train_test_split()` with `test_size=0.3` and
+ `random_state = 1`:
+ ```
+ X_train, X_test, y_train, y_test = train_test_split\
+ (df, y, test_size=0.3, \
+ random_state=1)
+ ```
+
+
+9. Instantiate `StandardScaler`:
+ ```
+ scaler = StandardScaler()
+ ```
+
+
+10. Train `StandardScaler` on the training set and standardize
+ it using `.fit_transform()`:
+ ```
+ X_train = scaler.fit_transform(X_train)
+ ```
+
+
+11. Standardize the testing set using `.transform()`:
+ ```
+ X_test = scaler.transform(X_test)
+ ```
+
+
+12. Instantiate `LinearRegression` and save it to a variable
+ called `lr_model`:
+ ```
+ lr_model = LinearRegression()
+ ```
+
+
+13. Train the model on the training set using `.fit()`:
+
+ ```
+ lr_model.fit(X_train, y_train)
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+ Caption: Logs of LinearRegression
+
+14. Predict the outcomes of the training and testing sets using
+ `.predict()`:
+ ```
+ preds_train = lr_model.predict(X_train)
+ preds_test = lr_model.predict(X_test)
+ ```
+
+
+15. Calculate the mean squared error on the training set and print its
+ value:
+
+ ```
+ train_mse = mean_squared_error(y_train, preds_train)
+ train_mse
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+ Caption: MSE score of the training set
+
+ We achieved quite a low MSE score on the training set.
+
+16. Calculate the mean squared error on the testing set and print its
+ value:
+
+ ```
+ test_mse = mean_squared_error(y_test, preds_test)
+ test_mse
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+ Caption: MSE score of the testing set
+
+ We also have a low MSE score on the testing set that is very similar
+ to the training one. So, our model is not overfitting.
+
+ Note
+
+ You may get slightly different outputs than those present here.
+ However, the values you would obtain should largely agree with those
+ obtained in this exercise.
+
+17. Print the coefficients of the linear regression model using
+ `.coef_`:
+
+ ```
+ lr_model.coef_
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+ Caption: Coefficients of the linear regression model
+
+18. Create an empty DataFrame called `coef_df`:
+ ```
+ coef_df = pd.DataFrame()
+ ```
+
+
+19. Create a new column called `feature` for this DataFrame
+ with the name of the columns of `df` using
+ `.columns`:
+ ```
+ coef_df['feature'] = df.columns
+ ```
+
+
+20. Create a new column called `coefficient` for this
+ DataFrame with the coefficients of the linear regression model using
+ `.coef_`:
+ ```
+ coef_df['coefficient'] = lr_model.coef_
+ ```
+
+
+21. Print the first five rows of `coef_df`:
+
+ ```
+ coef_df.head()
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+ Caption: The first five rows of coef\_df
+
+ From this output, we can see the variables `a1sx` and
+ `a1sy` have the lowest value (the biggest negative value)
+ so they are contributing more to the prediction than the three other
+ variables shown here.
+
+22. Plot a bar chart with Altair using `coef_df` and
+ `coefficient` as the `x` axis and
+ `feature` as the `y` axis:
+
+ ```
+ alt.Chart(coef_df).mark_bar().encode(x='coefficient',\
+ y="feature")
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+
+RandomForest Variable Importance
+================================
+
+
+After training `RandomForest`, you can assess its variable
+importance (or feature importance) with the
+`feature_importances_` attribute.
+
+Let\'s see how to extract this information from the Breast Cancer
+dataset from `sklearn`:
+
+```
+from sklearn.datasets import load_breast_cancer
+from sklearn.ensemble import RandomForestClassifier
+data = load_breast_cancer()
+X, y = data.data, data.target
+rf_model = RandomForestClassifier(random_state=168)
+rf_model.fit(X, y)
+rf_model.feature_importances_
+```
+
+The output will be as shown in the following figure:
+
+
+
+Caption: Feature importance of a Random Forest model
+
+Note
+
+Due to randomization, you may get a slightly different result.
+
+It might be a little difficult to evaluate which importance value
+corresponds to which variable from this output. Let\'s create a
+DataFrame that will contain these values with the name of the columns:
+
+```
+import pandas as pd
+varimp_df = pd.DataFrame()
+varimp_df['feature'] = data.feature_names
+varimp_df['importance'] = rf_model.feature_importances_
+varimp_df.head()
+```
+
+The output will be as follows:
+
+
+
+Caption: RandomForest variable importance for the first five
+features of the Breast Cancer dataset
+
+From this output, we can see that `mean radius` and
+`mean perimeter` have the highest scores, which means they are
+the most important in predicting the target variable. The
+`mean smoothness` column has a very low value, so it seems it
+doesn\'t influence the model much to predict the output.
+
+Note
+
+The range of values of variable importance is different for datasets; it
+is not a standardized measure.
+
+Let\'s plot these variable importance values onto a graph using
+`altair`:
+
+```
+import altair as alt
+alt.Chart(varimp_df).mark_bar().encode(x='importance',\
+ y="feature")
+```
+
+The output will be as follows:
+
+
+
+Caption: Graph showing RandomForest variable importance
+
+
+Exercise 9.02: Extracting RandomForest Feature Importance
+---------------------------------------------------------
+
+In this exercise, we will extract the feature importance of a Random
+Forest classifier model trained to predict the customer drop-out ratio.
+
+We will be using the same dataset as in the previous exercise.
+
+The following steps will help you complete the exercise:
+
+1. Open a new Colab notebook.
+
+2. Import the following packages: `pandas`,
+ `train_test_split` from
+ `sklearn.model_selection`, and
+ `RandomForestRegressor` from `sklearn.ensemble`:
+ ```
+ import pandas as pd
+ from sklearn.model_selection import train_test_split
+ from sklearn.ensemble import RandomForestRegressor
+ from sklearn.metrics import mean_squared_error
+ import altair as alt
+ ```
+
+
+3. Create a variable called `file_url` that contains the URL
+ to the dataset:
+ ```
+ file_url = 'https://raw.githubusercontent.com/'\
+ 'fenago/data-science/'\
+ 'master/Lab09/Dataset/phpYYZ4Qc.csv'
+ ```
+
+
+4. Load the dataset into a DataFrame called `df` using
+ `.read_csv()`:
+ ```
+ df = pd.read_csv(file_url)
+ ```
+
+
+5. Extract the `rej` column using `.pop()` and save
+ it into a variable called `y`:
+ ```
+ y = df.pop('rej')
+ ```
+
+
+6. Split the DataFrame into training and testing sets using
+ `train_test_split()` with `test_size=0.3` and
+ `random_state = 1`:
+ ```
+ X_train, X_test, y_train, y_test = train_test_split\
+ (df, y, test_size=0.3, \
+ random_state=1)
+ ```
+
+
+7. Instantiate `RandomForestRegressor` with
+ `random_state=1`, `n_estimators=50`,
+ `max_depth=6`, and `min_samples_leaf=60`:
+ ```
+ rf_model = RandomForestRegressor(random_state=1, \
+ n_estimators=50, max_depth=6,\
+ min_samples_leaf=60)
+ ```
+
+
+8. Train the model on the training set using `.fit()`:
+
+ ```
+ rf_model.fit(X_train, y_train)
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+ Caption: Logs of the Random Forest model
+
+9. Predict the outcomes of the training and testing sets using
+ `.predict()`:
+ ```
+ preds_train = rf_model.predict(X_train)
+ preds_test = rf_model.predict(X_test)
+ ```
+
+
+10. Calculate the mean squared error on the training set and print its
+ value:
+
+ ```
+ train_mse = mean_squared_error(y_train, preds_train)
+ train_mse
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+ Caption: MSE score of the training set
+
+ We achieved quite a low MSE score on the training set.
+
+11. Calculate the MSE on the testing set and print its value:
+
+ ```
+ test_mse = mean_squared_error(y_test, preds_test)
+ test_mse
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+ Caption: MSE score of the testing set
+
+ We also have a low MSE score on the testing set that is very similar
+ to the training one. So, our model is not overfitting.
+
+12. Print the variable importance using
+ `.feature_importances_`:
+
+ ```
+ rf_model.feature_importances_
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+ Caption: MSE score of the testing set
+
+13. Create an empty DataFrame called `varimp_df`:
+ ```
+ varimp_df = pd.DataFrame()
+ ```
+
+
+14. Create a new column called `feature` for this DataFrame
+ with the name of the columns of `df`, using
+ `.columns`:
+ ```
+ varimp_df['feature'] = df.columns
+ varimp_df['importance'] = rf_model.feature_importances_
+ ```
+
+
+15. Print the first five rows of `varimp_df`:
+
+ ```
+ varimp_df.head()
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+ Caption: Variable importance of the first five variables
+
+ From this output, we can see the variables `a1cy` and
+ `a1sy` have the highest value, so they are more important
+ for predicting the target variable than the three other variables
+ shown here.
+
+16. Plot a bar chart with Altair using `coef_df` and
+ `importance` as the `x` axis and
+ `feature` as the `y` axis:
+
+ ```
+ alt.Chart(varimp_df).mark_bar().encode(x='importance',\
+ y="feature")
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+Caption: Graph showing the variable importance of the first five
+variables
+
+From this output, we can see the variables that impact the prediction
+the most for this Random Forest model are `a2pop`,
+`a1pop`, `a3pop`, `b1eff`, and
+`temp`, by decreasing order of importance.
+
+
+
+Variable Importance via Permutation
+===================================
+
+
+In the previous section, we saw how to extract feature importance for
+RandomForest. There is actually another technique that shares the same
+name, but its underlying logic is different and can be applied to any
+algorithm, not only tree-based ones.
+
+This technique can be referred to as variable importance via
+permutation. Let\'s say we trained a model to predict a target variable
+with five classes and achieved an accuracy of 0.95. One way to assess
+the importance of one of the features is to remove and train a model and
+see the new accuracy score. If the accuracy score dropped significantly,
+then we could infer that this variable has a significant impact on the
+prediction. On the other hand, if the score slightly decreased or stayed
+the same, we could say this variable is not very important and doesn\'t
+influence the final prediction much. So, we can use this difference
+between the model\'s performance to assess the importance of a variable.
+
+The drawback of this method is that you need to retrain a new model for
+each variable. If it took you a few hours to train the original model
+and you have 100 different features, it would take quite a while to
+compute the importance of each variable. It would be great if we didn\'t
+have to retrain different models. So, another solution would be to
+generate noise or new values for a given column and predict the final
+outcomes from this modified data and compare the accuracy score. For
+example, if you have a column with values between 0 and 100, you can
+take the original data and randomly generate new values for this column
+(keeping all other variables the same) and predict the class for them.
+
+This option also has a catch. The randomly generated values can be very
+different from the original data. Going back to the same example we saw
+before, if the original range of values for a column is between 0 and
+100 and we generate values that can be negative or take a very high
+value, it is not very representative of the real distribution of the
+original data. So, we will need to understand the distribution of each
+variable before generating new values.
+
+Rather than generating random values, we can simply swap (or permute)
+values of a column between different rows and use these modified cases
+for predictions. Then, we can calculate the related accuracy score and
+compare it with the original one to assess the importance of this
+variable. For example, we have the following rows in the original
+dataset:
+
+
+
+Caption: Example of the dataset
+
+We can swap the values for the X1 column and get a new dataset:
+
+
+
+Caption: Example of a swapped column from the dataset
+
+The `mlxtend` package provides a function to perform variable
+permutation and calculate variable importance values:
+`feature_importance_permutation`. Let\'s see how to use it
+with the Breast Cancer dataset from `sklearn`.
+
+First, let\'s load the data and train a Random Forest model:
+
+```
+from sklearn.datasets import load_breast_cancer
+from sklearn.ensemble import RandomForestClassifier
+
+data = load_breast_cancer()
+X, y = data.data, data.target
+rf_model = RandomForestClassifier(random_state=168)
+rf_model.fit(X, y)
+```
+
+Then, we will call the `feature_importance_permutation`
+function from `mlxtend.evaluate`. This function takes the
+following parameters:
+
+- `predict_method`: A function that will be called for model
+ prediction. Here, we will provide the `predict` method
+ from our trained `rf_model` model.
+- `X`: The features from the dataset. It needs to be in
+ NumPy array form.
+- `y`: The target variable from the dataset. It needs to be
+ in `Numpy` array form.
+- `metric`: The metric used for comparing the performance of
+ the model. For the classification task, we will use accuracy.
+- `num_round`: The number of rounds `mlxtend` will
+ perform permutation on the data and assess the performance change.
+- `seed`: The seed set for getting reproducible results.
+
+Consider the following code snippet:
+
+```
+from mlxtend.evaluate import feature_importance_permutation
+imp_vals, _ = feature_importance_permutation\
+ (predict_method=rf_model.predict, X=X, y=y, \
+ metric='r2', num_rounds=1, seed=2)
+imp_vals
+```
+
+The output should be as follows:
+
+
+
+Caption: Variable importance by permutation
+
+Let\'s create a DataFrame containing these values and the names of the
+features and plot them on a graph with `altair`:
+
+```
+import pandas as pd
+varimp_df = pd.DataFrame()
+varimp_df['feature'] = data.feature_names
+varimp_df['importance'] = imp_vals
+varimp_df.head()
+import altair as alt
+alt.Chart(varimp_df).mark_bar().encode(x='importance',\
+ y="feature")
+```
+
+The output should be as follows:
+
+
+
+Caption: Graph showing variable importance by permutation
+
+These results are different from the ones we got from
+`RandomForest` in the previous section. Here, worst concave
+points is the most important, followed by worst area, and worst
+perimeter has a higher value than mean radius. So, we got the same list
+of the most important variables but in a different order. This confirms
+these three features are indeed the most important in predicting whether
+a tumor is malignant or not. The variable importance from
+`RandomForest` and the permutation have different logic,
+therefore, you might get different outputs when you run the code given
+in the preceding section.
+
+
+
+Exercise 9.03: Extracting Feature Importance via Permutation
+------------------------------------------------------------
+
+In this exercise, we will compute and extract feature importance by
+permutating a Random Forest classifier model trained to predict the
+customer drop-out ratio.
+
+We will using the same dataset as in the previous exercise.
+
+The following steps will help you complete the exercise:
+
+1. Open a new Colab notebook.
+
+2. Import the following packages: `pandas`,
+ `train_test_split` from
+ `sklearn.model_selection`,
+ `RandomForestRegressor` from `sklearn.ensemble`,
+ `feature_importance_permutation` from
+ `mlxtend.evaluate`, and `altair`:
+ ```
+ import pandas as pd
+ from sklearn.model_selection import train_test_split
+ from sklearn.ensemble import RandomForestRegressor
+ from mlxtend.evaluate import feature_importance_permutation
+ import altair as alt
+ ```
+
+
+3. Create a variable called `file_url` that contains the URL
+ of the dataset:
+ ```
+ file_url = 'https://raw.githubusercontent.com/'\
+ 'fenago/data-science/'\
+ 'master/Lab09/Dataset/phpYYZ4Qc.csv'
+ ```
+
+
+4. Load the dataset into a DataFrame called `df` using
+ `.read_csv()`:
+ ```
+ df = pd.read_csv(file_url)
+ ```
+
+
+5. Extract the `rej` column using `.pop()` and save
+ it into a variable called `y`:
+ ```
+ y = df.pop('rej')
+ ```
+
+
+6. Split the DataFrame into training and testing sets using
+ `train_test_split()` with `test_size=0.3` and
+ `random_state = 1`:
+ ```
+ X_train, X_test, y_train, y_test = train_test_split\
+ (df, y, test_size=0.3, \
+ random_state=1)
+ ```
+
+
+7. Instantiate `RandomForestRegressor` with
+ `random_state=1`, `n_estimators=50`,
+ `max_depth=6`, and `min_samples_leaf=60`:
+ ```
+ rf_model = RandomForestRegressor(random_state=1, \
+ n_estimators=50, max_depth=6, \
+ min_samples_leaf=60)
+ ```
+
+
+8. Train the model on the training set using `.fit()`:
+
+ ```
+ rf_model.fit(X_train, y_train)
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+ Caption: Logs of RandomForest
+
+9. Extract the feature importance via permutation using
+ `feature_importance_permutation` from `mlxtend`
+ with the Random Forest model, the testing set, `r2` as the
+ metric used, `num_rounds=1`, and `seed=2`. Save
+ the results into a variable called `imp_vals` and print
+ its values:
+
+ ```
+ imp_vals, _ = feature_importance_permutation\
+ (predict_method=rf_model.predict, \
+ X=X_test.values, y=y_test.values, \
+ metric='r2', num_rounds=1, seed=2)
+ imp_vals
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+ Caption: Variable importance by permutation
+
+ It is quite hard to interpret the raw results. Let\'s plot the
+ variable importance by permutating the model on a graph.
+
+10. Create a DataFrame called `varimp_df` with two columns:
+ `feature` containing the name of the columns of
+ `df`, using `.columns` and
+ `'importance'` containing the values of
+ `imp_vals`:
+ ```
+ varimp_df = pd.DataFrame({'feature': df.columns, \
+ 'importance': imp_vals})
+ ```
+
+
+11. Plot a bar chart with Altair using `coef_df` and
+ `importance` as the `x` axis and
+ `feature` as the `y` axis:
+
+ ```
+ alt.Chart(varimp_df).mark_bar().encode(x='importance',\
+ y="feature")
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+Caption: Graph showing the variable importance by permutation
+
+
+
+Partial Dependence Plots
+========================
+
+
+Another tool that is model-agnostic is a partial dependence plot. It is
+a visual tool for analyzing the effect of a feature on the target
+variable. To achieve this, we can plot the values of the feature we are
+interested in analyzing on the `x`-axis and the target
+variable on the `y`-axis and then show all the observations
+from the dataset on this graph. Let\'s try it on the Breast Cancer
+dataset from `sklearn`:
+
+```
+from sklearn.datasets import load_breast_cancer
+import pandas as pd
+data = load_breast_cancer()
+df = pd.DataFrame(data.data, columns=data.feature_names)
+df['target'] = data.target
+```
+Now that we have loaded the data and converted it to a DataFrame, let\'s
+have a look at the worst concave points column:
+
+```
+import altair as alt
+alt.Chart(df).mark_circle(size=60)\
+ .encode(x='worst concave points', y='target')
+```
+
+The resulting plot is as follows:
+
+
+
+Caption: Scatter plot of the worst concave points and target
+variables
+
+Note
+
+The preceding code and figure are just examples. We encourage you to
+analyze different features by changing the values assigned to
+`x` and `y` in the preceding code. For example, you
+can possibly analyze worst concavity versus worst perimeter by setting
+`x='worst concavity'` and `y='worst perimeter'` in
+the preceding code.
+
+From this plot, we can see:
+
+- Most cases with 1 for the target variable have values under 0.16 for
+ the worst concave points column.
+- Cases with a 0 value for the target have values of over 0.08 for
+ worst concave points.
+
+With this plot, we are not too sure about which outcome (0 or 1) we will
+get for the values between 0.08 and 0.16 for worst concave points. There
+are multiple possible reasons why the outcome of the observations within
+this range of values is uncertain, such as the fact that there are not
+many records that fall into this case, or other variables might
+influence the outcome for these cases. This is where a partial
+dependence plot can help.
+
+The logic is very similar to variable importance via permutation but
+rather than randomly replacing the values in a column, we will test
+every possible value within that column for all observations and see
+what predictions it leads to. If we take the example from figure 9.21,
+from the three observations we had originally, this method will create
+six new observations by keeping columns `X2` and
+`X3` as they were and replacing the values of `X1`:
+
+
+
+Caption: Example of records generated from a partial dependence plot
+
+With this new data, we can see, for instance, whether the value 12
+really has a strong influence on predicting 1 for the target variable.
+The original records, with the values 42 and 1 for the `X1`
+column, lead to outcome 0 and only value 12 generated a prediction of 1.
+But if we take the same observations for `X1`, equal to 42 and
+1, and replace that value with 12, we can see whether the new
+predictions will lead to 1 for the target variable. This is exactly the
+logic behind a partial dependence plot, and it will assess all the
+permutations possible for a column and plot the average of
+the predictions.
+
+`sklearn` provides a function called
+`plot_partial_dependence()` to display the partial dependence
+plot for a given feature. Let\'s see how to use it on the Breast Cancer
+dataset. First, we need to get the index of the column we are interested
+in. We will use the `.get_loc()` method from
+`pandas` to get the index for the
+`worst concave points` column:
+
+```
+import altair as alt
+from sklearn.inspection import plot_partial_dependence
+feature_index = df.columns.get_loc("worst concave points")
+```
+Now we can call the `plot_partial_dependence()` function. We
+need to provide the following parameters: the trained model, the
+training set, and the indices of the features to be analyzed:
+
+```
+plot_partial_dependence(rf_model, df, \
+ features=[feature_index])
+```
+
+
+Caption: Partial dependence plot for the worst concave points column
+
+This partial dependence plot shows us that, on average, all the
+observations with a value under 0.17 for the worst concave points column
+will most likely lead to a prediction of 1 for the target (probability
+over 0.5) and all the records over 0.17 will have a prediction of 0
+(probability under 0.5).
+
+
+
+Exercise 9.04: Plotting Partial Dependence
+------------------------------------------
+
+In this exercise, we will plot partial dependence plots for two
+variables, `a1pop` and `temp`, from a Random Forest
+classifier model trained to predict the customer drop-out ratio.
+
+We will using the same dataset as in the previous exercise.
+
+1. Open a new Colab notebook.
+
+2. Import the following packages: `pandas`,
+ `train_test_split` from
+ `sklearn.model_selection`,
+ `RandomForestRegressor` from `sklearn.ensemble`,
+ `plot_partial_dependence` from
+ `sklearn.inspection`, and `altair`:
+ ```
+ import pandas as pd
+ from sklearn.model_selection import train_test_split
+ from sklearn.ensemble import RandomForestRegressor
+ from sklearn.inspection import plot_partial_dependence
+ import altair as alt
+ ```
+
+
+3. Create a variable called `file_url` that contains the URL
+ for the dataset:
+ ```
+ file_url = 'https://raw.githubusercontent.com/'\
+ 'fenago/data-science/'\
+ 'master/Lab09/Dataset/phpYYZ4Qc.csv'
+ ```
+
+
+4. Load the dataset into a DataFrame called `df` using
+ `.read_csv()`:
+ ```
+ df = pd.read_csv(file_url)
+ ```
+
+
+5. Extract the `rej` column using `.pop()` and save
+ it into a variable called `y`:
+ ```
+ y = df.pop('rej')
+ ```
+
+
+6. Split the DataFrame into training and testing sets using
+ `train_test_split()` with `test_size=0.3` and
+ `random_state = 1`:
+ ```
+ X_train, X_test, y_train, y_test = train_test_split\
+ (df, y, test_size=0.3, \
+ random_state=1)
+ ```
+
+
+7. Instantiate `RandomForestRegressor` with
+ `random_state=1`, `n_estimators=50`,
+ `max_depth=6`, and `min_samples_leaf=60`:
+ ```
+ rf_model = RandomForestRegressor(random_state=1, \
+ n_estimators=50, max_depth=6,\
+ min_samples_leaf=60)
+ ```
+
+
+8. Train the model on the training set using `.fit()`:
+
+ ```
+ rf_model.fit(X_train, y_train)
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+ Caption: Logs of RandomForest
+
+9. Plot the partial dependence plot using
+ `plot_partial_dependence()` from `sklearn` with
+ the Random Forest model, the testing set, and the index of the
+ `a1pop` column:
+
+ ```
+ plot_partial_dependence(rf_model, X_test, \
+ features=[df.columns.get_loc('a1pop')])
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+ Caption: Partial dependence plot for a1pop
+
+ This partial dependence plot shows that, on average, the
+ `a1pop` variable doesn\'t affect the target variable much
+ when its value is below 2, but from there the target increases
+ linearly by 0.04 for each unit increase of `a1pop`. This
+ means if the population size of area 1 is below the value of 2, the
+ risk of churn is almost null. But over this limit, every increment
+ of population size for area 1 increases the chance of churn by
+ `4%`.
+
+10. Plot the partial dependence plot using
+ `plot_partial_dependence()` from `sklearn` with
+ the Random Forest model, the testing set, and the index of the
+ `temp` column:
+
+ ```
+ plot_partial_dependence(rf_model, X_test, \
+ features=[df.columns.get_loc('temp')])
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+Caption: Partial dependence plot for temp
+
+This partial dependence plot shows that, on average, the
+`temp` variable has a negative linear impact on the target
+variable: when `temp` increases by 1, the target variable will
+decrease by 0.12. This means if the temperature increases by a degree,
+the chance of leaving the queue decreases by 12%.
+
+
+
+Local Interpretation with LIME
+==============================
+
+
+LIME is one way to get more visibility in such cases. The underlying
+logic of LIME is to approximate the original nonlinear model with a
+linear one. Then, it uses the coefficients of that linear model in order
+to explain the contribution of each variable, as we just saw in the
+preceding example. But rather than trying to approximate the entire
+model for the whole dataset, LIME tries to approximate it locally around
+the observation you are interested in. LIME uses the trained model to
+predict new data points near your observation and then fit a linear
+regression on that predicted data.
+
+Let\'s see how we can use it on the Breast Cancer dataset. First, we
+will load the data and train a Random Forest model:
+
+```
+from sklearn.datasets import load_breast_cancer
+from sklearn.model_selection import train_test_split
+from sklearn.ensemble import RandomForestClassifier
+data = load_breast_cancer()
+X, y = data.data, data.target
+X_train, X_test, y_train, y_test = train_test_split\
+ (X, y, test_size=0.3, \
+ random_state=1)
+rf_model = RandomForestClassifier(random_state=168)
+rf_model.fit(X_train, y_train)
+```
+
+The `lime` package is not directly accessible on Google Colab,
+so we need to manually install it with the following command:
+
+```
+!pip install lime
+```
+
+The output will be as follows:
+
+
+
+Caption: Installation logs for the lime package
+
+Once installed, we will instantiate the `LimeTabularExplainer`
+class by providing the training data, the names of the features, the
+names of the classes to be predicted, and the task type (in this
+example, it is `classification`):
+
+```
+from lime.lime_tabular import LimeTabularExplainer
+lime_explainer = LimeTabularExplainer\
+ (X_train, feature_names=data.feature_names,\
+ class_names=data.target_names,\
+ mode='classification')
+```
+
+Then, we will call the `.explain_instance()` method with the
+observations we are interested in (here, it will be the second
+observation from the testing set) and the function that will predict the
+outcome probabilities (here, it is `.predict_proba()`).
+Finally, we will call the `.show_in_notebook()` method to
+display the results from `lime`:
+
+```
+exp = lime_explainer.explain_instance\
+ (X_test[1], rf_model.predict_proba, num_features=10)
+exp.show_in_notebook()
+```
+
+The output will be as follows:
+
+
+
+Caption: Output of LIME
+
+Note
+
+Your output may differ slightly. This is due to the random sampling
+process of LIME.
+
+There is a lot of information in the preceding output. Let\'s go through
+it a bit at a time. The left-hand side shows the prediction
+probabilities for the two classes of the target variable. For this
+observation, the model thinks there is a 0.85 probability that the
+predicted value will be malignant:
+
+
+
+Caption: Prediction probabilities from LIME
+
+The right-hand side shows the value of each feature for this
+observation. Each feature is color-coded to highlight its contribution
+toward the possible classes of the target variable. The list sorts the
+features by decreasing importance. In this example, the mean perimeter,
+mean area, and area error contributed to the model to increase the
+probability toward class 1. All the other features influenced the model
+to predict outcome 0:
+
+
+
+Caption: Value of the feature for the observation of interest
+
+Finally, the central part shows how each variable contributed to the
+final prediction. In this example, the `worst concave points`
+and `worst compactness` variables led to an increase of,
+respectively, 0.10 and 0.05 probability in predicting outcome 0. On the
+other hand, `mean perimeter` and `mean area` both
+contributed to an increase of 0.03 probability of predicting class 1:
+
+
+
+Caption: Contribution of each feature to the final prediction
+
+It\'s as simple as that. With LIME, we can easily see how each variable
+impacted the probabilities of predicting the different outcomes of the
+model. As you saw, the LIME package not only computes the local
+approximation but also provides a visual representation of its results.
+It is much easier to interpret than looking at raw outputs. It is also
+very useful for presenting your findings and illustrating how different
+features influenced the prediction of a single observation.
+
+
+
+Exercise 9.05: Local Interpretation with LIME
+---------------------------------------------
+
+In this exercise, we will analyze some predictions from a Random Forest
+classifier model trained to predict the customer drop-out ratio using
+LIME.
+
+We will be using the same dataset as in the previous exercise.
+
+1. Open a new Colab notebook.
+
+2. Import the following packages: `pandas`,
+ `train_test_split` from
+ `sklearn.model_selection`, and
+ `RandomForestRegressor` from `sklearn.ensemble`:
+ ```
+ import pandas as pd
+ from sklearn.model_selection import train_test_split
+ from sklearn.ensemble import RandomForestRegressor
+ ```
+
+
+3. Create a variable called `file_url` that contains the URL
+ of the dataset:
+ ```
+ file_url = 'https://raw.githubusercontent.com/'\
+ 'fenago/data-science/'\
+ 'master/Lab09/Dataset/phpYYZ4Qc.csv'
+ ```
+
+
+4. Load the dataset into a DataFrame called `df` using
+ `.read_csv()`:
+ ```
+ df = pd.read_csv(file_url)
+ ```
+
+
+5. Extract the `rej` column using `.pop()` and save
+ it into a variable called `y`:
+ ```
+ y = df.pop('rej')
+ ```
+
+
+6. Split the DataFrame into training and testing sets using
+ `train_test_split()` with `test_size=0.3` and
+ `random_state = 1`:
+ ```
+ X_train, X_test, y_train, y_test = train_test_split\
+ (df, y, test_size=0.3, \
+ random_state=1)
+ ```
+
+
+7. Instantiate `RandomForestRegressor` with
+ `random_state=1`, `n_estimators=50`,
+ `max_depth=6`, and `min_samples_leaf=60`:
+ ```
+ rf_model = RandomForestRegressor(random_state=1, \
+ n_estimators=50, max_depth=6,\
+ min_samples_leaf=60)
+ ```
+
+
+8. Train the model on the training set using `.fit()`:
+
+ ```
+ rf_model.fit(X_train, y_train)
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+ Caption: Logs of RandomForest
+
+9. Install the lime package using the `!pip` install command:
+ ```
+ !pip install lime
+ ```
+
+
+10. Import `LimeTabularExplainer` from
+ `lime.lime_tabular`:
+ ```
+ from lime.lime_tabular import LimeTabularExplainer
+ ```
+
+
+11. Instantiate `LimeTabularExplainer` with the training set
+ and `mode='regression'`:
+ ```
+ lime_explainer = LimeTabularExplainer\
+ (X_train.values, \
+ feature_names=X_train.columns, \
+ mode='regression')
+ ```
+
+
+12. Display the LIME analysis on the first row of the testing set using
+ `.explain_instance()` and `.show_in_notebook()`:
+
+ ```
+ exp = lime_explainer.explain_instance\
+ (X_test.values[0], rf_model.predict)
+ exp.show_in_notebook()
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+ Caption: LIME output for the first observation of the testing
+ set
+
+ This output shows that the predicted value for this observation is a
+ 0.02 chance of customer drop-out and it has been mainly influenced
+ by the `a1pop`, `a3pop`, `a2pop`, and
+ `b2eff` features. For instance, the fact that
+ `a1pop` was under 0.87 has decreased the value of the
+ target variable by 0.01.
+
+13. Display the LIME analysis on the third row of the testing set using
+ `.explain_instance()` and `.show_in_notebook()`:
+
+ ```
+ exp = lime_explainer.explain_instance\
+ (X_test.values[2], rf_model.predict)
+ exp.show_in_notebook()
+ ```
+
+
+ You should get the following output:
+
+
+
+
+
+Caption: LIME output for the third observation of the testing set
+
+
+You have completed the last exercise of this lab. You saw how to use
+LIME to interpret the prediction of single observations. We learned that
+the `a1pop`, `a2pop`, and `a3pop` features
+have a strong negative impact on the first and third observations of the
+training set.
+
+
+
+Activity 9.01: Train and Analyze a Network Intrusion Detection Model
+--------------------------------------------------------------------
+
+You are working for a cybersecurity company and you have been tasked
+with building a model that can recognize network intrusion then analyze
+its feature importance, plot partial dependence, and perform local
+interpretation on a single observation using LIME.
+
+The dataset provided contains data from 7 weeks of network traffic.
+
+
+The following steps will help you to complete this activity:
+
+1. Download and load the dataset using `.read_csv()` from
+ `pandas`.
+
+2. Extract the response variable using `.pop()` from
+ `pandas`.
+
+3. Split the dataset into training and test sets using
+ `train_test_split()` from
+ `sklearn.model_selection`.
+
+4. Create a function that will instantiate and fit
+ `RandomForestClassifier` using `.fit()` from
+ `sklearn.ensemble`.
+
+5. Create a function that will predict the outcome for the training and
+ testing sets using `.predict()`.
+
+6. Create a function that will print the accuracy score for the
+ training and testing sets using `accuracy_score()` from
+ `sklearn.metrics`.
+
+7. Compute the feature importance via permutation with
+ `feature_importance_permutation()` and display it on a bar
+ chart using `altair`.
+
+8. Plot the partial dependence plot using
+ `plot_partial_dependence` on the `src_bytes`
+ variable.
+
+9. Install `lime` using `!pip install`.
+
+10. Perform a LIME analysis on row `99893` with
+ `explain_instance()`.
+
+ The output should be as follows:
+
+
+
+
+
+
+Summary
+=======
+
+
+In this lab, we learned a few techniques for interpreting machine
+learning models. We saw that there are techniques that are specific to
+the model used: coefficients for linear models and variable importance
+for tree-based models. There are also some methods that are
+model-agnostic, such as variable importance via permutation.
diff --git a/lab_guides/logo.png b/lab_guides/logo.png
new file mode 100644
index 0000000..f30cbd1
Binary files /dev/null and b/lab_guides/logo.png differ
diff --git a/lab_guides/lab_overview.md b/lab_overview.md
similarity index 100%
rename from lab_guides/lab_overview.md
rename to lab_overview.md
diff --git a/logo.png b/logo.png
new file mode 100644
index 0000000..f30cbd1
Binary files /dev/null and b/logo.png differ