This commit is contained in:
fenago
2021-02-08 20:53:32 +05:00
parent 50426ea94e
commit e8f9b371e1
11 changed files with 17 additions and 634 deletions
+1 -47
View File
@@ -641,20 +641,7 @@ dataset, refer to the following note. Let\'s get started:
![](./images/B15019_11_15.jpg)
Caption: List of columns and their assigned data types
Note
The preceding output has been truncated.
From *Lab 10*, *Analyzing a Dataset* you know that the
`Id`, `MSSubClass`, `OverallQual`, and
`OverallCond` columns have been incorrectly classified as
numerical variables. They have a finite number of unique values and
you can\'t perform any mathematical operations on them. For example,
it doesn\'t make sense to add, remove, multiply, or divide two
different values from the `Id` column. Therefore, you need
to convert them into categorical variables.
6. Using the `astype()` method, convert the `'Id'`
column into a categorical variable, as shown in the following code
@@ -694,14 +681,6 @@ dataset, refer to the following note. Let\'s get started:
![](./images/B15019_11_16.jpg)
Caption: List of categories for the four newly converted
variables
Now, these four columns have been converted into categorical
variables. From the output of *Step 5*, we can see that there are a
lot of variables of the `object` type. Let\'s have a look
at them and see if they need to be converted as well.
9. Create a new DataFrame called `obj_df` that will only
contain variables of the `object` type using the
`select_dtypes` method along with the
@@ -1348,15 +1327,7 @@ You should get the following output:
![](./images/B15019_11_38.jpg)
Caption: Rows with missing values in CustomerID
This time, all the transactions look normal, except they are missing
values for the `CustomerID` column; all the other variables
have been filled in with values that seem genuine. There is no other way
to infer the missing values for the `CustomerID` column. These
rows represent almost 25% of the dataset, so we can\'t remove them.
However, most algorithms require a value for each observation, so you
Most algorithms require a value for each observation, so you
need to provide one for these cases. We will use the
`.fillna()` method from `pandas` to do this. Provide
the value to be imputed as `Missing` and use
@@ -1385,15 +1356,6 @@ You should get the following output:
![](./images/B15019_11_40.jpg)
Caption: Summary of missing values for each variable
You have successfully fixed all the missing values in this dataset.
These methods also work when we want to handle missing numerical
variables. We will look at this in the following exercise. All you need
to do is provide a numerical value when you want to impute a value with
`.fillna()`.
Exercise 11.04: Fixing Missing Values for the Horse Colic Dataset
-----------------------------------------------------------------
@@ -1699,48 +1661,40 @@ The following figure illustrates a list of unique values for gaming:
![](./images/B15019_11_49.jpg)
Caption: List of unique values for gaming
The following figure displays the data types of each column:
![](./images/B15019_11_50.jpg)
Caption: Data types of each column
The following figure displays the updated data types of each column:
![](./images/B15019_11_51.jpg)
Caption: Data types of each column
The following figure displays the number of missing values for numerical
variables:
![](./images/B15019_11_52.jpg)
Caption: Number of missing values for numerical variables
The following figure displays the list of unique values for
`int_corr`:
![](./images/B15019_11_53.jpg)
Caption: List of unique values for \'int\_corr\'
The following figure displays the list of unique values for numerical
variables:
![](./images/B15019_11_54.jpg)
Caption: List of unique values for numerical variables
The following figure displays the number of missing values for numerical
variables:
![](./images/B15019_11_55.jpg)
Caption: Number of missing values for numerical variables
Summary
=======
-23
View File
@@ -38,14 +38,7 @@ You should get the following output.
![](./images/B15019_12_01.jpg)
Caption: First five rows of the Online Retail dataset
Next, we are going to load all the public holidays in the UK into
another `pandas` DataFrame. From *Lab 10*, *Analyzing a
Dataset* we know the records of this dataset are only for the years 2010
and 2011. So we are going to extract public holidays for those two
years, but we need to do so in two different steps as the API provided
by `date.nager` is split into single years only.
Let\'s focus on 2010 first:
@@ -759,17 +752,6 @@ You should get the following output:
```
30
```
`30` is the number of unique values for the
`Country_bin` column. So we reduced the number of unique
values in this column from `38` to `30`:
We just saw how to group categorical values together, but the same
process can be applied to numerical values as well. For instance, it is
quite common to group people\'s ages into bins such as 20s (20 to 29
years old), 30s (30 to 39), and so on.
Have a look at *Exercise 12.02*, *Binning the YearBuilt variable from
the AMES Housing dataset*.
@@ -1768,8 +1750,3 @@ of a dataset are and identifying data quality issues. We saw how to
handle and fix some of the most frequent issues (duplicate rows, type
conversion, value replacement, and missing values) using
`pandas`\' APIs. Finally, we went through several feature engineering techniques.
The next lab opens a new part of this course that presents data
science use cases end to end. *Lab 13*, *Imbalanced Datasets*, will
walk you through an example of an imbalanced dataset and how to deal
with such a situation.
+1 -9
View File
@@ -148,10 +148,6 @@ Classification*, and you will look closely at the metrics:
```
After the categorical values are transformed, they must be combined
with the scaled numerical values of the data frame to get the
feature-engineered dataset.
10. Create the independent variables, `X`, and dependent
variables, `Y`, from the combined dataset for modeling, as
in the following code snippet:
@@ -171,13 +167,9 @@ Classification*, and you will look closely at the metrics:
The output is as follows:
![Caption: The independent variables and the combined data
(truncated) ](./images/B15019_13_03.jpg)
![](./images/B15019_13_03.jpg)
Caption: The independent variables and the combined data
(truncated)
We are now ready for the modeling task. Let\'s first import the
necessary packages.
-50
View File
@@ -1693,45 +1693,6 @@ The following steps will help you complete this exercise:
From this exercise, you may come up with a few questions:
- How do you think we can improve the classification results using
ICA?
- Increasing the number of components results in a marginal increase
in the accuracy metrics.
- Are there any other side effects because of the strategy adopted to
improve the results?
Increasing the number of components also results in a longer training
time for the logistic regression model.
Factor Analysis
---------------
Factor analysis is a technique that achieves dimensionality reduction by
grouping variables that are highly correlated. Let\'s look at an example
from our context of predicting advertisements.
In our dataset, there could be many features that describe the geometry
(the size and shape of an image in the ad) of the images on a web page.
These features can be correlated because they refer to specific
characteristics of an image.
Similarly, there could be many features that describe the anchor text or
phrases occurring in a URL, which are highly correlated. Factor analysis
looks at correlated groups such as these from the data and then groups
them into latent factors. Therefore, if there are 10 raw features
describing the geometry of an image, factor analysis will group them
into one feature that characterizes the geometry of an image. Each of
these groups is called factors. As many correlated features are combined
to form a group, the resulting number of features will be much smaller
in comparison with the original dimensions of the dataset.
Let\'s now see how factor analysis can be implemented as a technique for
dimensionality reduction.
Exercise 14.06: Dimensionality Reduction Using Factor Analysis
@@ -2015,18 +1976,7 @@ You should get the following output:
![](./images/B15019_14_35.jpg)
Caption: Sample data frame
What we will do next is sample some data points with the same shape as
the data frame we created.
Let\'s sample some data points from a normal distribution that has mean
`0` and standard deviation of `0.1`. We touched
briefly on normal distributions in *Lab 3, Binary Classification.* A
normal distribution has two parameters. The first one is the mean, which
is the average of all the data in the distribution, and the second one
is standard deviation, which is a measure of how spread out the data
points are.
By assuming a mean and standard deviation, we will be able to draw
samples from a normal distribution using the
-263
View File
@@ -19,109 +19,6 @@ where we will try to predict whether a credit card application will be
approved.
Introduction
============
In the previous lab, we learned various techniques, such as the
backward elimination technique, factor analysis, and so on, that helped
us to deal with high-dimensional datasets.
In this lab, we will further enhance our repertoire of skills with
another set of techniques, called **ensemble learning**, in which we
will be dealing with different ensemble learning techniques such as the
following:
- Averaging
- Weighted averaging
- Max voting
- Bagging
- Boosting
- Blending
Ensemble Learning
=================
Ensemble learning, as the name denotes, is a method that combines
several machine learning models to generate a superior model, thereby
decreasing variability/variance and bias, and boosting performance.
Before we explore what ensemble learning is, let\'s look at the concepts
of bias and variance with the help of the classical bias-variance
quadrant, as shown here:
![](./images/B15019_15_01.jpg)
Caption: Bias-variance quadrant
Variance
--------
Variance is the measure of how spread out data is. In the context of
machine learning, models with high variance imply that the predictions
generated on the same test set will differ considerably when different
training sets are used to fit the model. The underlying reason for high
variability could be attributed to the model being attuned to specific
nuances of training data rather than generalizing the relationship
between input and output. Ideally, we want every machine learning model
to have low variance.
Bias
----
Bias is the difference between the ground truth and the average value of
our predictions. A low bias will indicate that the predictions are very
close to the actual values. A high bias implies that the model has
oversimplified the relationship between the inputs and outputs, leading
to high error rates on test sets, which again is an undesirable outcome.
*Figure 15.1* helps us to visualize the trade-off between bias and
variance. The top-left corner is the depiction of a scenario where the
bias is high, and the variance is low. The top-right quadrant displays a
scenario where both bias and variance are high. From the figure, we can
see that when the bias is high, it is further away from the truth, which
in this case, is the *bull\'s eye*. The presence of variance is
manifested as whether the arrows are spread out or congregated in one
spot.
Ensemble models combine many weaker models that differ in variance and
bias, thereby creating a better model, outperforming the individual
weaker models. Ensemble models exemplify the adage *the wisdom of the
crowds*. In this lab, we will learn about different ensemble
techniques, which can be classified into two types, that is, simple and
advanced techniques:
![](./images/B15019_15_02.jpg)
Caption: Different ensemble learning methods
Business Context
----------------
You are working in the credit card division of your bank. The operations
head of your company has requested your help in determining whether a
customer is creditworthy or not. You have been provided with credit card
operations data.
This dataset contains credit card applications with around 15 variables.
The variables are a mix of continuous and categorical data pertaining to
credit card operations. The label for the dataset is a flag, which
indicates whether the application has been approved or not.
You want to fit some benchmark models and try some ensemble learning
methods on the dataset to address the problem and come up with a tool
for predicting whether or not a given customer should be approved for
their credit application.
Exercise 15.01: Loading, Exploring, and Cleaning the Data
---------------------------------------------------------
@@ -783,71 +680,6 @@ the new combination of weights in *iteration 2*:
![](./images/B15019_15_21.jpg)
Caption: Classification report
In this exercise, we implemented the weighted averaging technique for
ensemble learning. We did two iterations with the weights. We saw that
in the second iteration, where we increased the weight of the logistic
regression prediction from `0.6` to `0.7`, the
accuracy actually improved from `0.89` to `0.90`.
This is a validation of our assumption about the prominence of the
logistic regression model in the ensemble. To check whether there is
more room for improvement, we should again change the weights, just like
we did in iteration `2`, and then validate against the
metrics. We should continue these iterations until there is no further
improvement noticed in the metrics.
Comparing it with the metrics from the averaging method, we can see that
the accuracy level has gone down from `0.91` to
`0.90`. However, the recall value of class `1` has
gone up from `0.91` to `0.92`, and the corresponding
value for class `0` has gone down from `0.91` to
`0.88`. It could be that the weights that we applied have
resulted in a marginal degradation of the results from what we got from
the averaging method.
Looking at the results from a business perspective, we can see that with
the increase in the recall value of class `1`, the card
division is getting more creditworthy customers. However, this has come
at the cost of increasing the risk with more unworthy customers, with
`12%` (`100% - 88%`) being tagged as creditworthy
customers.
### Max Voting
The max voting method works on the principle of majority rule. In this
method, the opinion of the majority rules the roost. In this technique,
individual models, or, in ensemble learning jargon, individual learners,
are fit on the training set and their predictions are then generated on
the test set. Each individual learner\'s prediction is considered to be
a vote. On the test set, whichever class gets the maximum vote is the
ultimate winner. Let\'s demonstrate this with a toy example.
Let\'s say we have three individual learners who learned on the training
set. Each of them generates their predictions on the test set, which is
tabulated in the following table. The predictions are either for class
\'1\' or class \'0\':
![](./images/B15019_15_22.jpg)
Caption: Predictions for learners
In the preceding example, we can see that for `Example 1` and
`Example 3`, the majority vote is for class \'1,\' and for the
other two examples, the majority of the vote is for class \'0\'. The
final predictions are based on which class gets the majority vote. This
method of voting, where we output a class, is called \"hard \" voting.
When implementing the max voting method using the
`scikit-learn` library, we use a special function called
`VotingClassifier()`. We provide individual learners as input
to `VotingClassifier` to create the ensemble model. This
ensemble model is then used to fit the training set and then is finally
used to predict on the test sets. We will explore the dynamics of max
voting in *Exercise 15.04*, *Ensemble Model Using Max Voting*.
Exercise 15.04: Ensemble Model Using Max Voting
@@ -967,101 +799,6 @@ regression, KNN, and random forest:
![](./images/B15019_15_24.jpg)
Caption: Classification report
Advanced Techniques for Ensemble Learning
=========================================
Having learned simple techniques for ensemble learning, let\'s now
explore some advanced techniques. Among the advanced techniques, we will
be dealing with three different kinds of ensemble learning:
- Bagging
- Boosting
- Stacking/blending
Before we deal with each of them, there are some basic dynamics of these
advanced ensemble learning techniques that need to be deciphered. As
described at the beginning of the lab, the essence of ensemble
learning is in combining individual models to form a superior model.
There are some subtle nuances in the way the superior model is generated
in the advanced techniques. In these techniques, the individual models
or learners generate predictions and those predictions are used to form
the final predictions. The individual models or learners, which generate
the first set of predictions, are called **base** **learners** or
**base** **estimators** and the model, which is a combination of the
predictions of the base learners, is called the **meta** **learner** or
**meta estimator**. The way in which the meta learners learn from the
base learners differs for each of the advanced techniques. Let\'s
understand each of the advanced techniques in detail.
Bagging
-------
Bagging is a pseudonym for **B**ootstrap **Agg**regat**ing**. Before we
explain how bagging works, let\'s describe what bootstrapping is.
Bootstrapping has its etymological origins in the phrase, *Pulling
oneself up by one\'s bootstrap*. The essence of this phrase is to make
the best use of the available resources. In the statistical context,
bootstrapping entails taking samples from the available dataset by
replacement. Let\'s look at this concept with a toy example.
Suppose we have a dataset consisting of 10 numbers from 1 to 10. We now
need to create 4 different datasets of 10 each from the available
dataset. How do we do this? This is where the concept of bootstrapping
comes in handy. In this method, we take samples from the available
dataset one by one and then replace the number we took before taking the
next sample. We continue with this until we get a sample with the number
of data points we need.
As we are replacing each number after it is selected, there is a chance
that we might have more than one of a given data point in a sample. This
is explained by the following figure:
![](./images/B15019_15_25.jpg)
Caption: Bootstrapping
Now that we have understood bootstrapping, let\'s apply this concept to
a machine learning context. Earlier in the lab, we discussed that
ensemble learning helps in reducing the variance of predictions. One way
that variance could be reduced is by averaging out the predictions from
multiple learners. In bagging, multiple subsets of the data are created
using bootstrapping. On each of these subsets of data, a base learner is
fitted and the predictions generated. These predictions from all the
base learners are then averaged to get the meta learner or the final
predictions.
When implementing bagging, we use a function called
`BaggingClassifier()`, which is available in the
`Scikit learn` library. Some of the important arguments that
are provided when creating an ensemble model include the following:
- `base_estimator`: This argument is to define the base
estimator to be used.
- `n_estimator`: This argument defines the number of base
estimators that will be used in the ensemble.
- `max_samples`: The maximum size of the bootstrapped sample
for fitting the base estimator is defined using this argument. This
is represented as a proportion (0.8, 0.7, and so on).
- `max_features`: When fitting multiple individual learners,
it has been found that randomly selecting the features to be used in
each dataset results in superior performance. The
`max_features` argument indicates the number of features
to be used. For example, if there were 10 features in the dataset
and the `max_features` argument was to be defined as 0.8,
then only 8 (0.8 x 10) features would be used to fit a model using
the base learner.
Let\'s explore ensemble learning with bagging in *Exercise 15.05*,
*Ensemble Learning Using Bagging*.
Exercise 15.05: Ensemble Learning Using Bagging
-----------------------------------------------
-6
View File
@@ -232,12 +232,6 @@ The following steps will help you to complete this exercise:
```
The use of the backslash character, `\`, on *line 4* in
the preceding code snippet is to enforce the continuation of code on
to a new line in Python. The `\` character is not required
if you are entering the full line of code into a single line in
your notebook.
You should get the following output:
+1 -18
View File
@@ -81,12 +81,6 @@ The following steps will help you to complete this exercise:
```
Note
The `#` symbol in the code snippet above denotes a code
comment. Comments are added into code to help explain specific bits
of logic.
The `pd.read_csv()` function\'s arguments are the filename
as a string and the limit separator of a CSV, which is
`";"`. After reading the file, the DataFrame is printed
@@ -289,23 +283,12 @@ their age. We will be using a line graph for this exercise.
The following steps will help you to complete this exercise:
1. Begin by defining the hypothesis.
The first step in the verification process will be to define a
hypothesis about the relationship. A hypothesis can be based on your
experiences, domain knowledge, some published pieces of knowledge,
or your business intuitions.
Let\'s first define our hypothesis on age and propensity to buy term
1. Let\'s first define our hypothesis on age and propensity to buy term
deposits:
*The propensity to buy term deposits is more with elderly customers
compared to younger ones*. This is our hypothesis.
Now that we have defined our hypothesis, it is time to verify its
veracity with the data. One of the best ways to get business
intuitions from data is by taking cross-sections of our data and
visualizing them.
2. Import the pandas and altair packages:
```
+7 -130
View File
@@ -75,21 +75,12 @@ from the DataFrame:
```
target = df.pop('Activity')
```
Now the response variable is contained in the variable called
`target` and all the features are in the DataFrame called
`df`.
Now we are going to split the dataset into training and testing sets.
The model uses the training set to learn relevant parameters in
predicting the response variable. The test set is used to check whether
a model can accurately predict unseen data. We say the model is
overfitting when it has learned the patterns relevant only to the
training set and makes incorrect predictions about the testing set. In
this case, the model performance will be much higher for the training
set compared to the testing one. Ideally, we want to have a very similar
level of performance for the training and testing sets. This topic will
be covered in more depth in *Lab 7*, *The Generalization of Machine
Learning Models*.
The `sklearn` package provides a function called
`train_test_split()` to randomly split the dataset into two
@@ -116,6 +107,7 @@ class from `sklearn.ensemble`:
```
from sklearn.ensemble import RandomForestClassifier
```
Now we can instantiate the Random Forest classifier with some
hyperparameters. Remember from *Lab 1, Introduction to Data Science
in Python*, a hyperparameter is a type of parameter the model can\'t
@@ -203,15 +195,9 @@ The output will be as follows:
![](./images/B15019_04_06.jpg)
Caption: Accuracy score on the training set
Remember, in the last section, we split the dataset into training and
testing sets. We used the training set to fit the model and assess its
predictive power on it. But it hasn\'t seen the observations from the
testing set at all, so we can use it to assess whether our model is
capable of generalizing unseen data. Let\'s calculate the accuracy score
for the testing set:
Let\'s calculate the accuracy score for the testing set:
```
test_preds = rf_model.predict(X_test)
@@ -438,94 +424,15 @@ score:
Number of Trees Estimator
-------------------------
Now that we know how to fit a Random Forest classifier and assess its
performance, it is time to dig into the details. In the coming sections,
we will learn how to tune some of the most important hyperparameters for
this algorithm. As mentioned in *Lab 1, Introduction to Data Science
in Python*, hyperparameters are parameters that are not learned
automatically by machine learning algorithms. Their values have to be
set by data scientists. These hyperparameters can have a huge impact on
the performance of a model, its ability to generalize to unseen data,
and the time taken to learn patterns from the data.
The first hyperparameter you will look at in this section is called
`n_estimators`. This hyperparameter is responsible for
defining the number of trees that will be trained by the
`RandomForest` algorithm.
Before looking at how to tune this hyperparameter, we need to understand
what a tree is and why it is so important for the
`RandomForest` algorithm.
A tree is a logical graph that maps a decision and its outcomes at each
of its nodes. Simply speaking, it is a series of yes/no (or true/false)
questions that lead to different outcomes.
A leaf is a special type of node where the model will make a prediction.
There will be no split after a leaf. A single node split of a tree may
look like this:
![](./images/B15019_04_14.jpg)
Caption: Example of a single tree node
A tree node is composed of a question and two outcomes depending on
whether the condition defined by the question is met or not. In the
preceding example, the question is `is avg_rss12 > 41?` If the
answer is yes, the outcome is the `bending_1` leaf and if not,
it will be the `sitting` leaf.
A tree is just a series of nodes and leaves combined together:
![](./images/B15019_04_15.jpg)
Caption: Example of a tree
In the preceding example, the tree is composed of three nodes with
different questions. Now, for an observation to be predicted as
`sitting`, it will need to meet the conditions:
`avg_rss13 <= 41`, `var_rss > 0.7`, and
`avg_rss13 <= 16.25`.
The `RandomForest` algorithm will build this kind of tree
based on the training data it sees. We will not go through the
mathematical details about how it defines the split for each node but,
basically, it will go through every column of the dataset and see which
split value will best help to separate the data into two groups of
similar classes. Taking the preceding example, the first node with the
`avg_rss13 > 41` condition will help to get the group of data
on the left-hand side with mostly the `bending_1` class. The
`RandomForest` algorithm usually builds several of this kind
of tree and this is the reason why it is called a forest.
As you may have guessed now, the `n_estimators` hyperparameter
is used to specify the number of trees the `RandomForest`
algorithm will build. For example (as in the previous exercise), say we
ask it to build 10 trees. For a given observation, it will ask each tree
to make a prediction. Then, it will average those predictions and use
the result as the final prediction for this input. For instance, if, out
of 10 trees, 8 of them predict the outcome `sitting`, then the
`RandomForest` algorithm will use this outcome as the final
prediction.
Note
If you don\'t pass in a specific `n_estimators`
hyperparameter, it will use the default value. The default depends on
the version of scikit-learn you\'re using. In early versions, the
default value is 10. From version 0.22 onwards, the default is 100. You
can find out which version you are using by executing the following
code:
You can find out which version you are using by executing the following code:
`import sklearn`
`sklearn.__version__`
For more information, see here:
<https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html>
In general, the higher the number of trees is, the better the
performance you will get. Let\'s see what happens with
@@ -1118,31 +1025,8 @@ print(accuracy_score(y_test, test_preds9))
The output will be as follows:
![Caption: Accuracy scores for the training and testing sets for
min\_samples\_leaf=25 ](./images/B15019_04_31.jpg)
![](./images/B15019_04_31.jpg)
Caption: Accuracy scores for the training and testing sets for
min\_samples\_leaf=25
Both accuracies for the training and testing sets decreased but they are
quite close to each other now. So, we will keep this value
(`25`) as the optimal one for this dataset as the performance
is still OK and we are not overfitting too much.
When choosing the optimal value for this hyperparameter, you need to be
careful: a value that\'s too low will increase the chance of the model
overfitting, but on the other hand, setting a very high value will lead
to underfitting (the model will not accurately predict the right
outcome).
For instance, if you have a dataset of `1000` rows, if you set
`min_samples_leaf` to `400`, then the model will not
be able to find good splits to predict `5` different classes.
In this case, the model can only create one single split and the model
will only be able to predict two different classes instead of
`5`. It is good practice to start with low values first and
then progressively increase them until you reach satisfactory
performance.
@@ -1258,13 +1142,6 @@ We will be using the same zoo dataset as in the previous exercise.
![](./images/B15019_04_33.jpg)
Caption: Accuracy scores for the training and testing sets
The accuracy score decreased for both the training and testing sets
compared to the best result we got in the previous exercise. Now the
difference between the training and testing sets\' accuracy scores
is much smaller so our model is overfitting less.
11. Instantiate another `RandomForestClassifier` with
`random_state=42`, `n_estimators=30`,
`max_depth=2`, and `min_samples_leaf=7`, and
-20
View File
@@ -77,13 +77,6 @@ The following steps will help you complete the exercise:
![](./images/B15019_06_01.jpg)
Caption: The car dataset without headers
Note
Alternatively, you can enter the dataset URL in the browser to view
the dataset.
`CSV` files normally have the name of each column written
in the first row of the data. For instance, have a look at this
dataset\'s CSV file, which you used in *Lab 3, Binary
@@ -1375,19 +1368,6 @@ The following steps will help you accomplish this task:
![](./images/B15019_06_35.jpg)
Caption: Reading the dataset
You will need to do a few things to work with this file. Skip 15
rows and specify the column headers and read the file without an
index.
The code shows how you do that by creating a Python list to hold
your column headers and then read in the file using
`read_csv()`. The parameters that you pass in are the
file\'s location, the column headers as a Python list, the name of
the index column (in this case, it is None), and the number of rows
to skip.
The `head()` method will print out the top five rows and
should look similar to the following:
+5 -53
View File
@@ -784,44 +784,11 @@ dataset we will use contains 1,797 labeled images of handwritten digits.
![](./images/B15019_08_14.jpg)
Caption: Using pandas to visualize the results
Advantages and Disadvantages of Grid Search
-------------------------------------------
The primary advantage of the grid search compared to a manual search is
that it is an automated process that one can simply set and forget.
Additionally, you have the power to dictate the exact
hyperparameterizations evaluated, which can be a good thing when you
have prior knowledge of what kind of hyperparameterizations might work
well in your context. It is also easy to understand exactly what will
happen during the search thanks to the explicit definitions of the grid.
The major drawback of the grid search strategy is that it is
computationally very expensive; that is, when the number of
hyperparameterizations to try increases substantially, processing times
can be very slow. Also, when you define your grid, you may inadvertently
omit an hyperparameterization that would in fact be optimal. If it is
not specified in your grid, it will never be tried
To overcome these drawbacks, we will be looking at random search in the
next section.
Random Search
=============
Instead of searching through every hyperparameterizations in a
pre-defined set, as is the case with a grid search, in a random search
we sample from a distribution of possibilities by assuming each
hyperparameter to be a random variable. Before we go through the process
in depth, it will be helpful to briefly review what random variables are
and what we mean by a distribution.
Random Variables and Their Distributions
----------------------------------------
@@ -831,6 +798,7 @@ Random Variables and Their Distributions
Caption: Probability mass function for the discrete uniform distribution
The following code will allow us to see the form of this distribution
with 10 possible values of X.
@@ -900,9 +868,7 @@ p_X_1 = stats.norm.pdf(x=x, loc=0.0, scale=1.0**2)
p_X_2 = stats.norm.pdf(x=x, loc=0.0, scale=1.5**2)
```
Note
In this case, `loc` corresponds to 𝜇, while `scale`
**Note:** In this case, `loc` corresponds to 𝜇, while `scale`
corresponds to the standard deviation, which is the square root of
`𝜎``2`, hence why we square the inputs.
@@ -1017,9 +983,7 @@ samples = stats.gamma.rvs(a=1, loc=1, scale=2, \
size=n_iter, random_state=100)
```
Note
We set a random state to ensure reproducible results.
**Note** We set a random state to ensure reproducible results.
Plotting a histogram of the sample, as shown in the following figure,
reveals a shape that approximately conforms to the distribution that we
@@ -1086,17 +1050,14 @@ The output will be as follows:
![](./images/B15019_08_22.jpg)
Caption: Output for the random search process
Note
The results will be different, depending on the data used.
It is always beneficial to visualize results where possible. Plotting 𝛼
by negative mean squared error as a scatter plot makes it clear that
venturing away from 𝛼 = 1 does not result in improvements in predictive
performance:
```
plt.scatter(df_result.alpha, \
df_result.mean_neg_mean_squared_error)
@@ -1108,7 +1069,6 @@ The output will be as follows:
![](./images/B15019_08_23.jpg)
Caption: Plotting the scatter plot
The fact that we found the optimal 𝛼 to be 1 (its default value) is a
special case in hyperparameter tuning in that the optimal
@@ -1189,9 +1149,7 @@ The output will be as follows:
Caption: Output for tuning using RandomizedSearchCV
Note
The preceding results may vary, depending on the data.
Note: The preceding results may vary, depending on the data.
@@ -1351,12 +1309,6 @@ The following steps will help you complete the exercise.
![](./images/B15019_08_26.jpg)
Caption: Top five hyperparameterizations
Note
You may get slightly different results. However, the values you
obtain should largely agree with those in the preceding output.
9. The last step is to visualize the result. Including every
parameterization will result in a cluttered plot, so we will filter
+2 -15
View File
@@ -364,11 +364,8 @@ The output will be as shown in the following figure:
![](./images/B15019_09_12.jpg)
Caption: Feature importance of a Random Forest model
Note
Due to randomization, you may get a slightly different result.
**Note:** Due to randomization, you may get a slightly different result.
It might be a little difficult to evaluate which importance value
corresponds to which variable from this output. Let\'s create a
@@ -1378,19 +1375,9 @@ We will be using the same dataset as in the previous exercise.
You should get the following output:
![Caption: LIME output for the third observation of the testing
set ](./images/B15019_09_41.jpg)
![](./images/B15019_09_41.jpg)
Caption: LIME output for the third observation of the testing set
You have completed the last exercise of this lab. You saw how to use
LIME to interpret the prediction of single observations. We learned that
the `a1pop`, `a2pop`, and `a3pop` features
have a strong negative impact on the first and third observations of the
training set.
Activity 9.01: Train and Analyze a Network Intrusion Detection Model