mirror of
https://github.com/fenago/data-science.git
synced 2026-05-04 00:22:32 +00:00
added
This commit is contained in:
+1
-47
@@ -641,20 +641,7 @@ dataset, refer to the following note. Let\'s get started:
|
||||

|
||||
|
||||
|
||||
Caption: List of columns and their assigned data types
|
||||
|
||||
Note
|
||||
|
||||
The preceding output has been truncated.
|
||||
|
||||
From *Lab 10*, *Analyzing a Dataset* you know that the
|
||||
`Id`, `MSSubClass`, `OverallQual`, and
|
||||
`OverallCond` columns have been incorrectly classified as
|
||||
numerical variables. They have a finite number of unique values and
|
||||
you can\'t perform any mathematical operations on them. For example,
|
||||
it doesn\'t make sense to add, remove, multiply, or divide two
|
||||
different values from the `Id` column. Therefore, you need
|
||||
to convert them into categorical variables.
|
||||
|
||||
6. Using the `astype()` method, convert the `'Id'`
|
||||
column into a categorical variable, as shown in the following code
|
||||
@@ -694,14 +681,6 @@ dataset, refer to the following note. Let\'s get started:
|
||||

|
||||
|
||||
|
||||
Caption: List of categories for the four newly converted
|
||||
variables
|
||||
|
||||
Now, these four columns have been converted into categorical
|
||||
variables. From the output of *Step 5*, we can see that there are a
|
||||
lot of variables of the `object` type. Let\'s have a look
|
||||
at them and see if they need to be converted as well.
|
||||
|
||||
9. Create a new DataFrame called `obj_df` that will only
|
||||
contain variables of the `object` type using the
|
||||
`select_dtypes` method along with the
|
||||
@@ -1348,15 +1327,7 @@ You should get the following output:
|
||||
|
||||

|
||||
|
||||
Caption: Rows with missing values in CustomerID
|
||||
|
||||
This time, all the transactions look normal, except they are missing
|
||||
values for the `CustomerID` column; all the other variables
|
||||
have been filled in with values that seem genuine. There is no other way
|
||||
to infer the missing values for the `CustomerID` column. These
|
||||
rows represent almost 25% of the dataset, so we can\'t remove them.
|
||||
|
||||
However, most algorithms require a value for each observation, so you
|
||||
Most algorithms require a value for each observation, so you
|
||||
need to provide one for these cases. We will use the
|
||||
`.fillna()` method from `pandas` to do this. Provide
|
||||
the value to be imputed as `Missing` and use
|
||||
@@ -1385,15 +1356,6 @@ You should get the following output:
|
||||
|
||||

|
||||
|
||||
Caption: Summary of missing values for each variable
|
||||
|
||||
You have successfully fixed all the missing values in this dataset.
|
||||
These methods also work when we want to handle missing numerical
|
||||
variables. We will look at this in the following exercise. All you need
|
||||
to do is provide a numerical value when you want to impute a value with
|
||||
`.fillna()`.
|
||||
|
||||
|
||||
|
||||
Exercise 11.04: Fixing Missing Values for the Horse Colic Dataset
|
||||
-----------------------------------------------------------------
|
||||
@@ -1699,48 +1661,40 @@ The following figure illustrates a list of unique values for gaming:
|
||||
|
||||

|
||||
|
||||
Caption: List of unique values for gaming
|
||||
|
||||
The following figure displays the data types of each column:
|
||||
|
||||

|
||||
|
||||
Caption: Data types of each column
|
||||
|
||||
The following figure displays the updated data types of each column:
|
||||
|
||||

|
||||
|
||||
Caption: Data types of each column
|
||||
|
||||
The following figure displays the number of missing values for numerical
|
||||
variables:
|
||||
|
||||

|
||||
|
||||
Caption: Number of missing values for numerical variables
|
||||
|
||||
The following figure displays the list of unique values for
|
||||
`int_corr`:
|
||||
|
||||

|
||||
|
||||
Caption: List of unique values for \'int\_corr\'
|
||||
|
||||
The following figure displays the list of unique values for numerical
|
||||
variables:
|
||||
|
||||

|
||||
|
||||
Caption: List of unique values for numerical variables
|
||||
|
||||
The following figure displays the number of missing values for numerical
|
||||
variables:
|
||||
|
||||

|
||||
|
||||
Caption: Number of missing values for numerical variables
|
||||
|
||||
|
||||
Summary
|
||||
=======
|
||||
|
||||
@@ -38,14 +38,7 @@ You should get the following output.
|
||||
|
||||

|
||||
|
||||
Caption: First five rows of the Online Retail dataset
|
||||
|
||||
Next, we are going to load all the public holidays in the UK into
|
||||
another `pandas` DataFrame. From *Lab 10*, *Analyzing a
|
||||
Dataset* we know the records of this dataset are only for the years 2010
|
||||
and 2011. So we are going to extract public holidays for those two
|
||||
years, but we need to do so in two different steps as the API provided
|
||||
by `date.nager` is split into single years only.
|
||||
|
||||
Let\'s focus on 2010 first:
|
||||
|
||||
@@ -759,17 +752,6 @@ You should get the following output:
|
||||
```
|
||||
30
|
||||
```
|
||||
`30` is the number of unique values for the
|
||||
`Country_bin` column. So we reduced the number of unique
|
||||
values in this column from `38` to `30`:
|
||||
|
||||
We just saw how to group categorical values together, but the same
|
||||
process can be applied to numerical values as well. For instance, it is
|
||||
quite common to group people\'s ages into bins such as 20s (20 to 29
|
||||
years old), 30s (30 to 39), and so on.
|
||||
|
||||
Have a look at *Exercise 12.02*, *Binning the YearBuilt variable from
|
||||
the AMES Housing dataset*.
|
||||
|
||||
|
||||
|
||||
@@ -1768,8 +1750,3 @@ of a dataset are and identifying data quality issues. We saw how to
|
||||
handle and fix some of the most frequent issues (duplicate rows, type
|
||||
conversion, value replacement, and missing values) using
|
||||
`pandas`\' APIs. Finally, we went through several feature engineering techniques.
|
||||
|
||||
The next lab opens a new part of this course that presents data
|
||||
science use cases end to end. *Lab 13*, *Imbalanced Datasets*, will
|
||||
walk you through an example of an imbalanced dataset and how to deal
|
||||
with such a situation.
|
||||
|
||||
@@ -148,10 +148,6 @@ Classification*, and you will look closely at the metrics:
|
||||
```
|
||||
|
||||
|
||||
After the categorical values are transformed, they must be combined
|
||||
with the scaled numerical values of the data frame to get the
|
||||
feature-engineered dataset.
|
||||
|
||||
10. Create the independent variables, `X`, and dependent
|
||||
variables, `Y`, from the combined dataset for modeling, as
|
||||
in the following code snippet:
|
||||
@@ -171,13 +167,9 @@ Classification*, and you will look closely at the metrics:
|
||||
The output is as follows:
|
||||
|
||||
|
||||

|
||||

|
||||
|
||||
|
||||
Caption: The independent variables and the combined data
|
||||
(truncated)
|
||||
|
||||
We are now ready for the modeling task. Let\'s first import the
|
||||
necessary packages.
|
||||
|
||||
|
||||
@@ -1693,45 +1693,6 @@ The following steps will help you complete this exercise:
|
||||
|
||||
|
||||
|
||||
From this exercise, you may come up with a few questions:
|
||||
|
||||
- How do you think we can improve the classification results using
|
||||
ICA?
|
||||
- Increasing the number of components results in a marginal increase
|
||||
in the accuracy metrics.
|
||||
- Are there any other side effects because of the strategy adopted to
|
||||
improve the results?
|
||||
|
||||
Increasing the number of components also results in a longer training
|
||||
time for the logistic regression model.
|
||||
|
||||
|
||||
|
||||
Factor Analysis
|
||||
---------------
|
||||
|
||||
Factor analysis is a technique that achieves dimensionality reduction by
|
||||
grouping variables that are highly correlated. Let\'s look at an example
|
||||
from our context of predicting advertisements.
|
||||
|
||||
In our dataset, there could be many features that describe the geometry
|
||||
(the size and shape of an image in the ad) of the images on a web page.
|
||||
These features can be correlated because they refer to specific
|
||||
characteristics of an image.
|
||||
|
||||
Similarly, there could be many features that describe the anchor text or
|
||||
phrases occurring in a URL, which are highly correlated. Factor analysis
|
||||
looks at correlated groups such as these from the data and then groups
|
||||
them into latent factors. Therefore, if there are 10 raw features
|
||||
describing the geometry of an image, factor analysis will group them
|
||||
into one feature that characterizes the geometry of an image. Each of
|
||||
these groups is called factors. As many correlated features are combined
|
||||
to form a group, the resulting number of features will be much smaller
|
||||
in comparison with the original dimensions of the dataset.
|
||||
|
||||
Let\'s now see how factor analysis can be implemented as a technique for
|
||||
dimensionality reduction.
|
||||
|
||||
|
||||
|
||||
Exercise 14.06: Dimensionality Reduction Using Factor Analysis
|
||||
@@ -2015,18 +1976,7 @@ You should get the following output:
|
||||
|
||||

|
||||
|
||||
Caption: Sample data frame
|
||||
|
||||
What we will do next is sample some data points with the same shape as
|
||||
the data frame we created.
|
||||
|
||||
Let\'s sample some data points from a normal distribution that has mean
|
||||
`0` and standard deviation of `0.1`. We touched
|
||||
briefly on normal distributions in *Lab 3, Binary Classification.* A
|
||||
normal distribution has two parameters. The first one is the mean, which
|
||||
is the average of all the data in the distribution, and the second one
|
||||
is standard deviation, which is a measure of how spread out the data
|
||||
points are.
|
||||
|
||||
By assuming a mean and standard deviation, we will be able to draw
|
||||
samples from a normal distribution using the
|
||||
|
||||
@@ -19,109 +19,6 @@ where we will try to predict whether a credit card application will be
|
||||
approved.
|
||||
|
||||
|
||||
Introduction
|
||||
============
|
||||
|
||||
|
||||
In the previous lab, we learned various techniques, such as the
|
||||
backward elimination technique, factor analysis, and so on, that helped
|
||||
us to deal with high-dimensional datasets.
|
||||
|
||||
In this lab, we will further enhance our repertoire of skills with
|
||||
another set of techniques, called **ensemble learning**, in which we
|
||||
will be dealing with different ensemble learning techniques such as the
|
||||
following:
|
||||
|
||||
- Averaging
|
||||
- Weighted averaging
|
||||
- Max voting
|
||||
- Bagging
|
||||
- Boosting
|
||||
- Blending
|
||||
|
||||
|
||||
Ensemble Learning
|
||||
=================
|
||||
|
||||
|
||||
Ensemble learning, as the name denotes, is a method that combines
|
||||
several machine learning models to generate a superior model, thereby
|
||||
decreasing variability/variance and bias, and boosting performance.
|
||||
|
||||
Before we explore what ensemble learning is, let\'s look at the concepts
|
||||
of bias and variance with the help of the classical bias-variance
|
||||
quadrant, as shown here:
|
||||
|
||||

|
||||
|
||||
Caption: Bias-variance quadrant
|
||||
|
||||
|
||||
|
||||
Variance
|
||||
--------
|
||||
|
||||
Variance is the measure of how spread out data is. In the context of
|
||||
machine learning, models with high variance imply that the predictions
|
||||
generated on the same test set will differ considerably when different
|
||||
training sets are used to fit the model. The underlying reason for high
|
||||
variability could be attributed to the model being attuned to specific
|
||||
nuances of training data rather than generalizing the relationship
|
||||
between input and output. Ideally, we want every machine learning model
|
||||
to have low variance.
|
||||
|
||||
|
||||
|
||||
Bias
|
||||
----
|
||||
|
||||
Bias is the difference between the ground truth and the average value of
|
||||
our predictions. A low bias will indicate that the predictions are very
|
||||
close to the actual values. A high bias implies that the model has
|
||||
oversimplified the relationship between the inputs and outputs, leading
|
||||
to high error rates on test sets, which again is an undesirable outcome.
|
||||
|
||||
*Figure 15.1* helps us to visualize the trade-off between bias and
|
||||
variance. The top-left corner is the depiction of a scenario where the
|
||||
bias is high, and the variance is low. The top-right quadrant displays a
|
||||
scenario where both bias and variance are high. From the figure, we can
|
||||
see that when the bias is high, it is further away from the truth, which
|
||||
in this case, is the *bull\'s eye*. The presence of variance is
|
||||
manifested as whether the arrows are spread out or congregated in one
|
||||
spot.
|
||||
|
||||
Ensemble models combine many weaker models that differ in variance and
|
||||
bias, thereby creating a better model, outperforming the individual
|
||||
weaker models. Ensemble models exemplify the adage *the wisdom of the
|
||||
crowds*. In this lab, we will learn about different ensemble
|
||||
techniques, which can be classified into two types, that is, simple and
|
||||
advanced techniques:
|
||||
|
||||

|
||||
|
||||
Caption: Different ensemble learning methods
|
||||
|
||||
|
||||
|
||||
Business Context
|
||||
----------------
|
||||
|
||||
You are working in the credit card division of your bank. The operations
|
||||
head of your company has requested your help in determining whether a
|
||||
customer is creditworthy or not. You have been provided with credit card
|
||||
operations data.
|
||||
|
||||
This dataset contains credit card applications with around 15 variables.
|
||||
The variables are a mix of continuous and categorical data pertaining to
|
||||
credit card operations. The label for the dataset is a flag, which
|
||||
indicates whether the application has been approved or not.
|
||||
|
||||
You want to fit some benchmark models and try some ensemble learning
|
||||
methods on the dataset to address the problem and come up with a tool
|
||||
for predicting whether or not a given customer should be approved for
|
||||
their credit application.
|
||||
|
||||
|
||||
|
||||
Exercise 15.01: Loading, Exploring, and Cleaning the Data
|
||||
---------------------------------------------------------
|
||||
@@ -783,71 +680,6 @@ the new combination of weights in *iteration 2*:
|
||||

|
||||
|
||||
|
||||
Caption: Classification report
|
||||
|
||||
In this exercise, we implemented the weighted averaging technique for
|
||||
ensemble learning. We did two iterations with the weights. We saw that
|
||||
in the second iteration, where we increased the weight of the logistic
|
||||
regression prediction from `0.6` to `0.7`, the
|
||||
accuracy actually improved from `0.89` to `0.90`.
|
||||
This is a validation of our assumption about the prominence of the
|
||||
logistic regression model in the ensemble. To check whether there is
|
||||
more room for improvement, we should again change the weights, just like
|
||||
we did in iteration `2`, and then validate against the
|
||||
metrics. We should continue these iterations until there is no further
|
||||
improvement noticed in the metrics.
|
||||
|
||||
Comparing it with the metrics from the averaging method, we can see that
|
||||
the accuracy level has gone down from `0.91` to
|
||||
`0.90`. However, the recall value of class `1` has
|
||||
gone up from `0.91` to `0.92`, and the corresponding
|
||||
value for class `0` has gone down from `0.91` to
|
||||
`0.88`. It could be that the weights that we applied have
|
||||
resulted in a marginal degradation of the results from what we got from
|
||||
the averaging method.
|
||||
|
||||
Looking at the results from a business perspective, we can see that with
|
||||
the increase in the recall value of class `1`, the card
|
||||
division is getting more creditworthy customers. However, this has come
|
||||
at the cost of increasing the risk with more unworthy customers, with
|
||||
`12%` (`100% - 88%`) being tagged as creditworthy
|
||||
customers.
|
||||
|
||||
|
||||
|
||||
### Max Voting
|
||||
|
||||
The max voting method works on the principle of majority rule. In this
|
||||
method, the opinion of the majority rules the roost. In this technique,
|
||||
individual models, or, in ensemble learning jargon, individual learners,
|
||||
are fit on the training set and their predictions are then generated on
|
||||
the test set. Each individual learner\'s prediction is considered to be
|
||||
a vote. On the test set, whichever class gets the maximum vote is the
|
||||
ultimate winner. Let\'s demonstrate this with a toy example.
|
||||
|
||||
Let\'s say we have three individual learners who learned on the training
|
||||
set. Each of them generates their predictions on the test set, which is
|
||||
tabulated in the following table. The predictions are either for class
|
||||
\'1\' or class \'0\':
|
||||
|
||||

|
||||
|
||||
Caption: Predictions for learners
|
||||
|
||||
In the preceding example, we can see that for `Example 1` and
|
||||
`Example 3`, the majority vote is for class \'1,\' and for the
|
||||
other two examples, the majority of the vote is for class \'0\'. The
|
||||
final predictions are based on which class gets the majority vote. This
|
||||
method of voting, where we output a class, is called \"hard \" voting.
|
||||
|
||||
When implementing the max voting method using the
|
||||
`scikit-learn` library, we use a special function called
|
||||
`VotingClassifier()`. We provide individual learners as input
|
||||
to `VotingClassifier` to create the ensemble model. This
|
||||
ensemble model is then used to fit the training set and then is finally
|
||||
used to predict on the test sets. We will explore the dynamics of max
|
||||
voting in *Exercise 15.04*, *Ensemble Model Using Max Voting*.
|
||||
|
||||
|
||||
|
||||
Exercise 15.04: Ensemble Model Using Max Voting
|
||||
@@ -967,101 +799,6 @@ regression, KNN, and random forest:
|
||||

|
||||
|
||||
|
||||
Caption: Classification report
|
||||
|
||||
|
||||
|
||||
Advanced Techniques for Ensemble Learning
|
||||
=========================================
|
||||
|
||||
|
||||
Having learned simple techniques for ensemble learning, let\'s now
|
||||
explore some advanced techniques. Among the advanced techniques, we will
|
||||
be dealing with three different kinds of ensemble learning:
|
||||
|
||||
- Bagging
|
||||
- Boosting
|
||||
- Stacking/blending
|
||||
|
||||
Before we deal with each of them, there are some basic dynamics of these
|
||||
advanced ensemble learning techniques that need to be deciphered. As
|
||||
described at the beginning of the lab, the essence of ensemble
|
||||
learning is in combining individual models to form a superior model.
|
||||
There are some subtle nuances in the way the superior model is generated
|
||||
in the advanced techniques. In these techniques, the individual models
|
||||
or learners generate predictions and those predictions are used to form
|
||||
the final predictions. The individual models or learners, which generate
|
||||
the first set of predictions, are called **base** **learners** or
|
||||
**base** **estimators** and the model, which is a combination of the
|
||||
predictions of the base learners, is called the **meta** **learner** or
|
||||
**meta estimator**. The way in which the meta learners learn from the
|
||||
base learners differs for each of the advanced techniques. Let\'s
|
||||
understand each of the advanced techniques in detail.
|
||||
|
||||
|
||||
|
||||
Bagging
|
||||
-------
|
||||
|
||||
Bagging is a pseudonym for **B**ootstrap **Agg**regat**ing**. Before we
|
||||
explain how bagging works, let\'s describe what bootstrapping is.
|
||||
Bootstrapping has its etymological origins in the phrase, *Pulling
|
||||
oneself up by one\'s bootstrap*. The essence of this phrase is to make
|
||||
the best use of the available resources. In the statistical context,
|
||||
bootstrapping entails taking samples from the available dataset by
|
||||
replacement. Let\'s look at this concept with a toy example.
|
||||
|
||||
Suppose we have a dataset consisting of 10 numbers from 1 to 10. We now
|
||||
need to create 4 different datasets of 10 each from the available
|
||||
dataset. How do we do this? This is where the concept of bootstrapping
|
||||
comes in handy. In this method, we take samples from the available
|
||||
dataset one by one and then replace the number we took before taking the
|
||||
next sample. We continue with this until we get a sample with the number
|
||||
of data points we need.
|
||||
|
||||
As we are replacing each number after it is selected, there is a chance
|
||||
that we might have more than one of a given data point in a sample. This
|
||||
is explained by the following figure:
|
||||
|
||||

|
||||
|
||||
Caption: Bootstrapping
|
||||
|
||||
Now that we have understood bootstrapping, let\'s apply this concept to
|
||||
a machine learning context. Earlier in the lab, we discussed that
|
||||
ensemble learning helps in reducing the variance of predictions. One way
|
||||
that variance could be reduced is by averaging out the predictions from
|
||||
multiple learners. In bagging, multiple subsets of the data are created
|
||||
using bootstrapping. On each of these subsets of data, a base learner is
|
||||
fitted and the predictions generated. These predictions from all the
|
||||
base learners are then averaged to get the meta learner or the final
|
||||
predictions.
|
||||
|
||||
When implementing bagging, we use a function called
|
||||
`BaggingClassifier()`, which is available in the
|
||||
`Scikit learn` library. Some of the important arguments that
|
||||
are provided when creating an ensemble model include the following:
|
||||
|
||||
- `base_estimator`: This argument is to define the base
|
||||
estimator to be used.
|
||||
- `n_estimator`: This argument defines the number of base
|
||||
estimators that will be used in the ensemble.
|
||||
- `max_samples`: The maximum size of the bootstrapped sample
|
||||
for fitting the base estimator is defined using this argument. This
|
||||
is represented as a proportion (0.8, 0.7, and so on).
|
||||
- `max_features`: When fitting multiple individual learners,
|
||||
it has been found that randomly selecting the features to be used in
|
||||
each dataset results in superior performance. The
|
||||
`max_features` argument indicates the number of features
|
||||
to be used. For example, if there were 10 features in the dataset
|
||||
and the `max_features` argument was to be defined as 0.8,
|
||||
then only 8 (0.8 x 10) features would be used to fit a model using
|
||||
the base learner.
|
||||
|
||||
Let\'s explore ensemble learning with bagging in *Exercise 15.05*,
|
||||
*Ensemble Learning Using Bagging*.
|
||||
|
||||
|
||||
|
||||
Exercise 15.05: Ensemble Learning Using Bagging
|
||||
-----------------------------------------------
|
||||
|
||||
@@ -232,12 +232,6 @@ The following steps will help you to complete this exercise:
|
||||
```
|
||||
|
||||
|
||||
The use of the backslash character, `\`, on *line 4* in
|
||||
the preceding code snippet is to enforce the continuation of code on
|
||||
to a new line in Python. The `\` character is not required
|
||||
if you are entering the full line of code into a single line in
|
||||
your notebook.
|
||||
|
||||
You should get the following output:
|
||||
|
||||
|
||||
|
||||
+1
-18
@@ -81,12 +81,6 @@ The following steps will help you to complete this exercise:
|
||||
```
|
||||
|
||||
|
||||
Note
|
||||
|
||||
The `#` symbol in the code snippet above denotes a code
|
||||
comment. Comments are added into code to help explain specific bits
|
||||
of logic.
|
||||
|
||||
The `pd.read_csv()` function\'s arguments are the filename
|
||||
as a string and the limit separator of a CSV, which is
|
||||
`";"`. After reading the file, the DataFrame is printed
|
||||
@@ -289,23 +283,12 @@ their age. We will be using a line graph for this exercise.
|
||||
|
||||
The following steps will help you to complete this exercise:
|
||||
|
||||
1. Begin by defining the hypothesis.
|
||||
|
||||
The first step in the verification process will be to define a
|
||||
hypothesis about the relationship. A hypothesis can be based on your
|
||||
experiences, domain knowledge, some published pieces of knowledge,
|
||||
or your business intuitions.
|
||||
|
||||
Let\'s first define our hypothesis on age and propensity to buy term
|
||||
1. Let\'s first define our hypothesis on age and propensity to buy term
|
||||
deposits:
|
||||
|
||||
*The propensity to buy term deposits is more with elderly customers
|
||||
compared to younger ones*. This is our hypothesis.
|
||||
|
||||
Now that we have defined our hypothesis, it is time to verify its
|
||||
veracity with the data. One of the best ways to get business
|
||||
intuitions from data is by taking cross-sections of our data and
|
||||
visualizing them.
|
||||
|
||||
2. Import the pandas and altair packages:
|
||||
```
|
||||
|
||||
+7
-130
@@ -75,21 +75,12 @@ from the DataFrame:
|
||||
```
|
||||
target = df.pop('Activity')
|
||||
```
|
||||
|
||||
Now the response variable is contained in the variable called
|
||||
`target` and all the features are in the DataFrame called
|
||||
`df`.
|
||||
|
||||
Now we are going to split the dataset into training and testing sets.
|
||||
The model uses the training set to learn relevant parameters in
|
||||
predicting the response variable. The test set is used to check whether
|
||||
a model can accurately predict unseen data. We say the model is
|
||||
overfitting when it has learned the patterns relevant only to the
|
||||
training set and makes incorrect predictions about the testing set. In
|
||||
this case, the model performance will be much higher for the training
|
||||
set compared to the testing one. Ideally, we want to have a very similar
|
||||
level of performance for the training and testing sets. This topic will
|
||||
be covered in more depth in *Lab 7*, *The Generalization of Machine
|
||||
Learning Models*.
|
||||
|
||||
|
||||
The `sklearn` package provides a function called
|
||||
`train_test_split()` to randomly split the dataset into two
|
||||
@@ -116,6 +107,7 @@ class from `sklearn.ensemble`:
|
||||
```
|
||||
from sklearn.ensemble import RandomForestClassifier
|
||||
```
|
||||
|
||||
Now we can instantiate the Random Forest classifier with some
|
||||
hyperparameters. Remember from *Lab 1, Introduction to Data Science
|
||||
in Python*, a hyperparameter is a type of parameter the model can\'t
|
||||
@@ -203,15 +195,9 @@ The output will be as follows:
|
||||
|
||||

|
||||
|
||||
Caption: Accuracy score on the training set
|
||||
|
||||
|
||||
Remember, in the last section, we split the dataset into training and
|
||||
testing sets. We used the training set to fit the model and assess its
|
||||
predictive power on it. But it hasn\'t seen the observations from the
|
||||
testing set at all, so we can use it to assess whether our model is
|
||||
capable of generalizing unseen data. Let\'s calculate the accuracy score
|
||||
for the testing set:
|
||||
Let\'s calculate the accuracy score for the testing set:
|
||||
|
||||
```
|
||||
test_preds = rf_model.predict(X_test)
|
||||
@@ -438,94 +424,15 @@ score:
|
||||
|
||||
|
||||
|
||||
Number of Trees Estimator
|
||||
-------------------------
|
||||
|
||||
Now that we know how to fit a Random Forest classifier and assess its
|
||||
performance, it is time to dig into the details. In the coming sections,
|
||||
we will learn how to tune some of the most important hyperparameters for
|
||||
this algorithm. As mentioned in *Lab 1, Introduction to Data Science
|
||||
in Python*, hyperparameters are parameters that are not learned
|
||||
automatically by machine learning algorithms. Their values have to be
|
||||
set by data scientists. These hyperparameters can have a huge impact on
|
||||
the performance of a model, its ability to generalize to unseen data,
|
||||
and the time taken to learn patterns from the data.
|
||||
|
||||
The first hyperparameter you will look at in this section is called
|
||||
`n_estimators`. This hyperparameter is responsible for
|
||||
defining the number of trees that will be trained by the
|
||||
`RandomForest` algorithm.
|
||||
|
||||
Before looking at how to tune this hyperparameter, we need to understand
|
||||
what a tree is and why it is so important for the
|
||||
`RandomForest` algorithm.
|
||||
|
||||
A tree is a logical graph that maps a decision and its outcomes at each
|
||||
of its nodes. Simply speaking, it is a series of yes/no (or true/false)
|
||||
questions that lead to different outcomes.
|
||||
|
||||
A leaf is a special type of node where the model will make a prediction.
|
||||
There will be no split after a leaf. A single node split of a tree may
|
||||
look like this:
|
||||
|
||||

|
||||
|
||||
Caption: Example of a single tree node
|
||||
|
||||
A tree node is composed of a question and two outcomes depending on
|
||||
whether the condition defined by the question is met or not. In the
|
||||
preceding example, the question is `is avg_rss12 > 41?` If the
|
||||
answer is yes, the outcome is the `bending_1` leaf and if not,
|
||||
it will be the `sitting` leaf.
|
||||
|
||||
A tree is just a series of nodes and leaves combined together:
|
||||
|
||||

|
||||
|
||||
Caption: Example of a tree
|
||||
|
||||
In the preceding example, the tree is composed of three nodes with
|
||||
different questions. Now, for an observation to be predicted as
|
||||
`sitting`, it will need to meet the conditions:
|
||||
`avg_rss13 <= 41`, `var_rss > 0.7`, and
|
||||
`avg_rss13 <= 16.25`.
|
||||
|
||||
The `RandomForest` algorithm will build this kind of tree
|
||||
based on the training data it sees. We will not go through the
|
||||
mathematical details about how it defines the split for each node but,
|
||||
basically, it will go through every column of the dataset and see which
|
||||
split value will best help to separate the data into two groups of
|
||||
similar classes. Taking the preceding example, the first node with the
|
||||
`avg_rss13 > 41` condition will help to get the group of data
|
||||
on the left-hand side with mostly the `bending_1` class. The
|
||||
`RandomForest` algorithm usually builds several of this kind
|
||||
of tree and this is the reason why it is called a forest.
|
||||
|
||||
As you may have guessed now, the `n_estimators` hyperparameter
|
||||
is used to specify the number of trees the `RandomForest`
|
||||
algorithm will build. For example (as in the previous exercise), say we
|
||||
ask it to build 10 trees. For a given observation, it will ask each tree
|
||||
to make a prediction. Then, it will average those predictions and use
|
||||
the result as the final prediction for this input. For instance, if, out
|
||||
of 10 trees, 8 of them predict the outcome `sitting`, then the
|
||||
`RandomForest` algorithm will use this outcome as the final
|
||||
prediction.
|
||||
|
||||
Note
|
||||
|
||||
If you don\'t pass in a specific `n_estimators`
|
||||
hyperparameter, it will use the default value. The default depends on
|
||||
the version of scikit-learn you\'re using. In early versions, the
|
||||
default value is 10. From version 0.22 onwards, the default is 100. You
|
||||
can find out which version you are using by executing the following
|
||||
code:
|
||||
You can find out which version you are using by executing the following code:
|
||||
|
||||
`import sklearn`
|
||||
|
||||
`sklearn.__version__`
|
||||
|
||||
For more information, see here:
|
||||
<https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html>
|
||||
|
||||
|
||||
In general, the higher the number of trees is, the better the
|
||||
performance you will get. Let\'s see what happens with
|
||||
@@ -1118,31 +1025,8 @@ print(accuracy_score(y_test, test_preds9))
|
||||
|
||||
The output will be as follows:
|
||||
|
||||

|
||||

|
||||
|
||||
Caption: Accuracy scores for the training and testing sets for
|
||||
min\_samples\_leaf=25
|
||||
|
||||
Both accuracies for the training and testing sets decreased but they are
|
||||
quite close to each other now. So, we will keep this value
|
||||
(`25`) as the optimal one for this dataset as the performance
|
||||
is still OK and we are not overfitting too much.
|
||||
|
||||
When choosing the optimal value for this hyperparameter, you need to be
|
||||
careful: a value that\'s too low will increase the chance of the model
|
||||
overfitting, but on the other hand, setting a very high value will lead
|
||||
to underfitting (the model will not accurately predict the right
|
||||
outcome).
|
||||
|
||||
For instance, if you have a dataset of `1000` rows, if you set
|
||||
`min_samples_leaf` to `400`, then the model will not
|
||||
be able to find good splits to predict `5` different classes.
|
||||
In this case, the model can only create one single split and the model
|
||||
will only be able to predict two different classes instead of
|
||||
`5`. It is good practice to start with low values first and
|
||||
then progressively increase them until you reach satisfactory
|
||||
performance.
|
||||
|
||||
|
||||
|
||||
@@ -1258,13 +1142,6 @@ We will be using the same zoo dataset as in the previous exercise.
|
||||

|
||||
|
||||
|
||||
Caption: Accuracy scores for the training and testing sets
|
||||
|
||||
The accuracy score decreased for both the training and testing sets
|
||||
compared to the best result we got in the previous exercise. Now the
|
||||
difference between the training and testing sets\' accuracy scores
|
||||
is much smaller so our model is overfitting less.
|
||||
|
||||
11. Instantiate another `RandomForestClassifier` with
|
||||
`random_state=42`, `n_estimators=30`,
|
||||
`max_depth=2`, and `min_samples_leaf=7`, and
|
||||
|
||||
@@ -77,13 +77,6 @@ The following steps will help you complete the exercise:
|
||||

|
||||
|
||||
|
||||
Caption: The car dataset without headers
|
||||
|
||||
Note
|
||||
|
||||
Alternatively, you can enter the dataset URL in the browser to view
|
||||
the dataset.
|
||||
|
||||
`CSV` files normally have the name of each column written
|
||||
in the first row of the data. For instance, have a look at this
|
||||
dataset\'s CSV file, which you used in *Lab 3, Binary
|
||||
@@ -1375,19 +1368,6 @@ The following steps will help you accomplish this task:
|
||||

|
||||
|
||||
|
||||
Caption: Reading the dataset
|
||||
|
||||
You will need to do a few things to work with this file. Skip 15
|
||||
rows and specify the column headers and read the file without an
|
||||
index.
|
||||
|
||||
The code shows how you do that by creating a Python list to hold
|
||||
your column headers and then read in the file using
|
||||
`read_csv()`. The parameters that you pass in are the
|
||||
file\'s location, the column headers as a Python list, the name of
|
||||
the index column (in this case, it is None), and the number of rows
|
||||
to skip.
|
||||
|
||||
The `head()` method will print out the top five rows and
|
||||
should look similar to the following:
|
||||
|
||||
|
||||
+5
-53
@@ -784,44 +784,11 @@ dataset we will use contains 1,797 labeled images of handwritten digits.
|
||||

|
||||
|
||||
|
||||
Caption: Using pandas to visualize the results
|
||||
|
||||
|
||||
|
||||
Advantages and Disadvantages of Grid Search
|
||||
-------------------------------------------
|
||||
|
||||
The primary advantage of the grid search compared to a manual search is
|
||||
that it is an automated process that one can simply set and forget.
|
||||
Additionally, you have the power to dictate the exact
|
||||
hyperparameterizations evaluated, which can be a good thing when you
|
||||
have prior knowledge of what kind of hyperparameterizations might work
|
||||
well in your context. It is also easy to understand exactly what will
|
||||
happen during the search thanks to the explicit definitions of the grid.
|
||||
|
||||
The major drawback of the grid search strategy is that it is
|
||||
computationally very expensive; that is, when the number of
|
||||
hyperparameterizations to try increases substantially, processing times
|
||||
can be very slow. Also, when you define your grid, you may inadvertently
|
||||
omit an hyperparameterization that would in fact be optimal. If it is
|
||||
not specified in your grid, it will never be tried
|
||||
|
||||
To overcome these drawbacks, we will be looking at random search in the
|
||||
next section.
|
||||
|
||||
|
||||
Random Search
|
||||
=============
|
||||
|
||||
|
||||
Instead of searching through every hyperparameterizations in a
|
||||
pre-defined set, as is the case with a grid search, in a random search
|
||||
we sample from a distribution of possibilities by assuming each
|
||||
hyperparameter to be a random variable. Before we go through the process
|
||||
in depth, it will be helpful to briefly review what random variables are
|
||||
and what we mean by a distribution.
|
||||
|
||||
|
||||
|
||||
Random Variables and Their Distributions
|
||||
----------------------------------------
|
||||
@@ -831,6 +798,7 @@ Random Variables and Their Distributions
|
||||
|
||||
Caption: Probability mass function for the discrete uniform distribution
|
||||
|
||||
|
||||
The following code will allow us to see the form of this distribution
|
||||
with 10 possible values of X.
|
||||
|
||||
@@ -900,9 +868,7 @@ p_X_1 = stats.norm.pdf(x=x, loc=0.0, scale=1.0**2)
|
||||
p_X_2 = stats.norm.pdf(x=x, loc=0.0, scale=1.5**2)
|
||||
```
|
||||
|
||||
Note
|
||||
|
||||
In this case, `loc` corresponds to 𝜇, while `scale`
|
||||
**Note:** In this case, `loc` corresponds to 𝜇, while `scale`
|
||||
corresponds to the standard deviation, which is the square root of
|
||||
`𝜎``2`, hence why we square the inputs.
|
||||
|
||||
@@ -1017,9 +983,7 @@ samples = stats.gamma.rvs(a=1, loc=1, scale=2, \
|
||||
size=n_iter, random_state=100)
|
||||
```
|
||||
|
||||
Note
|
||||
|
||||
We set a random state to ensure reproducible results.
|
||||
**Note** We set a random state to ensure reproducible results.
|
||||
|
||||
Plotting a histogram of the sample, as shown in the following figure,
|
||||
reveals a shape that approximately conforms to the distribution that we
|
||||
@@ -1086,17 +1050,14 @@ The output will be as follows:
|
||||
|
||||

|
||||
|
||||
Caption: Output for the random search process
|
||||
|
||||
Note
|
||||
|
||||
The results will be different, depending on the data used.
|
||||
|
||||
It is always beneficial to visualize results where possible. Plotting 𝛼
|
||||
by negative mean squared error as a scatter plot makes it clear that
|
||||
venturing away from 𝛼 = 1 does not result in improvements in predictive
|
||||
performance:
|
||||
|
||||
|
||||
```
|
||||
plt.scatter(df_result.alpha, \
|
||||
df_result.mean_neg_mean_squared_error)
|
||||
@@ -1108,7 +1069,6 @@ The output will be as follows:
|
||||
|
||||

|
||||
|
||||
Caption: Plotting the scatter plot
|
||||
|
||||
The fact that we found the optimal 𝛼 to be 1 (its default value) is a
|
||||
special case in hyperparameter tuning in that the optimal
|
||||
@@ -1189,9 +1149,7 @@ The output will be as follows:
|
||||
|
||||
Caption: Output for tuning using RandomizedSearchCV
|
||||
|
||||
Note
|
||||
|
||||
The preceding results may vary, depending on the data.
|
||||
Note: The preceding results may vary, depending on the data.
|
||||
|
||||
|
||||
|
||||
@@ -1351,12 +1309,6 @@ The following steps will help you complete the exercise.
|
||||

|
||||
|
||||
|
||||
Caption: Top five hyperparameterizations
|
||||
|
||||
Note
|
||||
|
||||
You may get slightly different results. However, the values you
|
||||
obtain should largely agree with those in the preceding output.
|
||||
|
||||
9. The last step is to visualize the result. Including every
|
||||
parameterization will result in a cluttered plot, so we will filter
|
||||
|
||||
+2
-15
@@ -364,11 +364,8 @@ The output will be as shown in the following figure:
|
||||
|
||||

|
||||
|
||||
Caption: Feature importance of a Random Forest model
|
||||
|
||||
Note
|
||||
|
||||
Due to randomization, you may get a slightly different result.
|
||||
**Note:** Due to randomization, you may get a slightly different result.
|
||||
|
||||
It might be a little difficult to evaluate which importance value
|
||||
corresponds to which variable from this output. Let\'s create a
|
||||
@@ -1378,19 +1375,9 @@ We will be using the same dataset as in the previous exercise.
|
||||
You should get the following output:
|
||||
|
||||
|
||||

|
||||

|
||||
|
||||
|
||||
Caption: LIME output for the third observation of the testing set
|
||||
|
||||
|
||||
You have completed the last exercise of this lab. You saw how to use
|
||||
LIME to interpret the prediction of single observations. We learned that
|
||||
the `a1pop`, `a2pop`, and `a3pop` features
|
||||
have a strong negative impact on the first and third observations of the
|
||||
training set.
|
||||
|
||||
|
||||
|
||||
Activity 9.01: Train and Analyze a Network Intrusion Detection Model
|
||||
|
||||
Reference in New Issue
Block a user