mirror of
https://github.com/fenago/data-science.git
synced 2026-05-05 00:51:50 +00:00
1472 lines
40 KiB
Markdown
1472 lines
40 KiB
Markdown
|
||
<img align="right" src="./logo.png">
|
||
|
||
|
||
Lab 4. Multiclass Classification with RandomForest
|
||
==============================================
|
||
|
||
|
||
|
||
Overview
|
||
|
||
This lab will show you how to train a multiclass classifier using
|
||
the Random Forest algorithm. You will also see how to evaluate the
|
||
performance of multiclass models.
|
||
|
||
By the end of the lab, you will be able to implement a Random Forest
|
||
classifier, as well as tune hyperparameters in order to improve model
|
||
performance.
|
||
|
||
|
||
|
||
|
||
Training a Random Forest Classifier
|
||
===================================
|
||
|
||
|
||
|
||
Let\'s see how we can train a Random Forest classifier on this dataset.
|
||
First, we need to load the data from the GitHub repository using
|
||
`pandas` and then we will print its first five rows using the
|
||
`head()` method.
|
||
|
||
|
||
|
||
```
|
||
import pandas as pd
|
||
file_url = 'https://raw.githubusercontent.com/fenago'\
|
||
'/data-science/master/Lab04/'\
|
||
'Dataset/activity.csv'
|
||
df = pd.read_csv(file_url)
|
||
df.head()
|
||
```
|
||
|
||
The output will be as follows:
|
||
|
||

|
||
|
||
Caption: First five rows of the dataset
|
||
|
||
Each row represents an activity that was performed by a person and the
|
||
name of the activity is stored in the `Activity` column. There
|
||
are seven different activities in this variable: `bending1`,
|
||
`bending2`, `cycling`, `lying`,
|
||
`sitting`, `standing`, and `Walking`. The
|
||
other six columns are different measurements taken from sensor data.
|
||
|
||
In this example, you will accurately predict the target variable
|
||
(`'Activity'`) from the features (the six other columns) using
|
||
Random Forest. For example, for the first row of the preceding example,
|
||
the model will receive the following features as input and will predict
|
||
the `'bending1'` class:
|
||
|
||

|
||
|
||
Caption: Features for the first row of the dataset
|
||
|
||
But before that, we need to do a bit of data preparation. The
|
||
`sklearn` package (we will use it to train Random Forest
|
||
model) requires the target variable and the features to be separated.
|
||
So, we need to extract the response variable using the
|
||
`.pop()` method from `pandas`. The
|
||
`.pop()` method extracts the specified column and removes it
|
||
from the DataFrame:
|
||
|
||
```
|
||
target = df.pop('Activity')
|
||
```
|
||
|
||
|
||
The `sklearn` package provides a function called
|
||
`train_test_split()` to randomly split the dataset into two
|
||
different sets. We need to specify the following parameters for this
|
||
function: the feature and target variables, the ratio of the testing set
|
||
(`test_size`), and `random_state` in order to get
|
||
reproducible results if we have to run the code again:
|
||
|
||
```
|
||
from sklearn.model_selection import train_test_split
|
||
X_train, X_test, y_train, y_test = train_test_split\
|
||
(df, target, test_size=0.33, \
|
||
random_state=42)
|
||
```
|
||
|
||
|
||
Now that we have got our training and testing sets, we are ready for
|
||
modeling. Let\'s first import the `RandomForestClassifier`
|
||
class from `sklearn.ensemble`:
|
||
|
||
```
|
||
from sklearn.ensemble import RandomForestClassifier
|
||
```
|
||
|
||
This topic will be covered more in depth in *Lab 8, Hyperparameter Tuning*. For now, we will just specify the
|
||
`random_state` value. We will walk you through some of the key
|
||
hyperparameters in the following sections:
|
||
|
||
|
||
```
|
||
rf_model = RandomForestClassifier(random_state=1, \
|
||
n_estimators=10)
|
||
```
|
||
|
||
The next step is to train (also called fit) the model with the training
|
||
data. During this step, the model will try to learn the relationship
|
||
between the response variable and the independent variables and save the
|
||
parameters learned. We need to specify the features and target variables
|
||
as parameters:
|
||
|
||
```
|
||
rf_model.fit(X_train, y_train)
|
||
```
|
||
|
||
The output will be as follows:
|
||
|
||

|
||
|
||
Caption: Logs of the trained RandomForest
|
||
|
||
Now that the model has completed its training, we can use the parameters
|
||
it learned to make predictions on the input data we will provide. In the
|
||
following example, we are using the features from the training set:
|
||
|
||
```
|
||
preds = rf_model.predict(X_train)
|
||
```
|
||
Now we can print these predictions:
|
||
|
||
```
|
||
preds
|
||
```
|
||
|
||
The output will be as follows:
|
||
|
||

|
||
|
||
|
||
Evaluating the Model\'s Performance
|
||
===================================
|
||
|
||
|
||
If your model made 950 correct predictions out of 1,000
|
||
cases, then the accuracy score would be 950/1000 = 0.95. This would mean
|
||
that your model was 95% accurate on that dataset. The
|
||
`sklearn` package provides a function to calculate this score
|
||
automatically and it is called `accuracy_score()`. We need to
|
||
import it first:
|
||
|
||
```
|
||
from sklearn.metrics import accuracy_score
|
||
```
|
||
|
||
Then, we just need to provide the list of predictions for some
|
||
observations and the corresponding true value for the target variable.
|
||
Using the previous example, we will use the `y_train` and
|
||
`preds` variables, which respectively contain the response
|
||
variable (also known as the target) for the training set and the
|
||
corresponding predictions made by the Random Forest model. We will reuse
|
||
the predictions from the previous section -- `preds`:
|
||
|
||
```
|
||
accuracy_score(y_train, preds)
|
||
```
|
||
|
||
The output will be as follows:
|
||
|
||

|
||
|
||
|
||
|
||
Let\'s calculate the accuracy score for the testing set:
|
||
|
||
```
|
||
test_preds = rf_model.predict(X_test)
|
||
accuracy_score(y_test, test_preds)
|
||
```
|
||
|
||
The output will be as follows:
|
||
|
||

|
||
|
||
|
||
|
||
Exercise 4.01: Building a Model for Classifying Animal Type and Assessing Its Performance
|
||
-----------------------------------------------------------------------------------------
|
||
|
||
In this exercise, we will train a Random Forest classifier to predict
|
||
the type of an animal based on its attributes and check its accuracy
|
||
score:
|
||
|
||
|
||
1. Open a new Jupyter notebook.
|
||
|
||
2. Import the `pandas` package:
|
||
```
|
||
import pandas as pd
|
||
```
|
||
|
||
|
||
3. Create a variable called `file_url` that contains the URL
|
||
of the dataset:
|
||
```
|
||
file_url = 'https://raw.githubusercontent.com'\
|
||
'/fenago/data-science'\
|
||
'/master/Lab04/Dataset'\
|
||
'/openml_phpZNNasq.csv'
|
||
```
|
||
|
||
|
||
4. Load the dataset into a DataFrame using the `.read_csv()`
|
||
method from pandas:
|
||
```
|
||
df = pd.read_csv(file_url)
|
||
```
|
||
|
||
|
||
5. Print the first five rows of the DataFrame:
|
||
|
||
```
|
||
df.head()
|
||
```
|
||
|
||
|
||
You should get the following output:
|
||
|
||
|
||

|
||
|
||
|
||
Caption: First five rows of the DataFrame
|
||
|
||
We will be using the `type` column as our target variable.
|
||
We will need to remove the `animal` column from the
|
||
DataFrame and only use the remaining columns as features.
|
||
|
||
6. Remove the `'animal'` column using the `.drop()`
|
||
method from `pandas` and specify the
|
||
`columns='animal'` and `inplace=True` parameters
|
||
(to directly update the original DataFrame):
|
||
```
|
||
df.drop(columns='animal', inplace=True)
|
||
```
|
||
|
||
|
||
7. Extract the `'type'` column using the `.pop()`
|
||
method from `pandas`:
|
||
```
|
||
y = df.pop('type')
|
||
```
|
||
|
||
|
||
8. Print the first five rows of the updated DataFrame:
|
||
|
||
```
|
||
df.head()
|
||
```
|
||
|
||
|
||
You should get the following output:
|
||
|
||
|
||

|
||
|
||
|
||
Caption: First five rows of the DataFrame
|
||
|
||
9. Import the `train_test_split` function from
|
||
`sklearn.model_selection`:
|
||
```
|
||
from sklearn.model_selection import train_test_split
|
||
```
|
||
|
||
|
||
10. Split the dataset into training and testing sets with the
|
||
`df`, `y`, `test_size=0.4`, and
|
||
`random_state=188` parameters:
|
||
```
|
||
X_train, X_test, y_train, y_test = train_test_split\
|
||
(df, y, test_size=0.4, \
|
||
random_state=188)
|
||
```
|
||
|
||
|
||
11. Import `RandomForestClassifier` from
|
||
`sklearn.ensemble`:
|
||
```
|
||
from sklearn.ensemble import RandomForestClassifier
|
||
```
|
||
|
||
|
||
12. Instantiate the `RandomForestClassifier` object with
|
||
`random_state` equal to `42`. Set the
|
||
`n-estimators` value to an initial default value of
|
||
`10`. We\'ll discuss later how changing this value affects
|
||
the result.
|
||
```
|
||
rf_model = RandomForestClassifier(random_state=42, \
|
||
n_estimators=10)
|
||
```
|
||
|
||
|
||
13. Fit `RandomForestClassifier` with the training set:
|
||
|
||
```
|
||
rf_model.fit(X_train, y_train)
|
||
```
|
||
|
||
|
||
You should get the following output:
|
||
|
||
|
||

|
||
|
||
|
||
Caption: Logs of RandomForestClassifier
|
||
|
||
14. Predict the outcome of the training set with the
|
||
`.predict()`method, save the results in a variable called
|
||
\'`train_preds`\', and print its value:
|
||
|
||
```
|
||
train_preds = rf_model.predict(X_train)
|
||
train_preds
|
||
```
|
||
|
||
|
||
You should get the following output:
|
||
|
||
|
||

|
||
|
||
|
||
Caption: Predictions on the training set
|
||
|
||
15. Import the `accuracy_score` function from
|
||
`sklearn.metrics`:
|
||
```
|
||
from sklearn.metrics import accuracy_score
|
||
```
|
||
|
||
|
||
16. Calculate the accuracy score on the training set, save the result in
|
||
a variable called `train_acc`, and print its value:
|
||
|
||
```
|
||
train_acc = accuracy_score(y_train, train_preds)
|
||
print(train_acc)
|
||
```
|
||
|
||
|
||
You should get the following output:
|
||
|
||
|
||

|
||
|
||
|
||
Caption: Accuracy score on the training set
|
||
|
||
|
||
17. Predict the outcome of the testing set with the
|
||
`.predict()` method and save the results into a variable
|
||
called `test_preds`:
|
||
```
|
||
test_preds = rf_model.predict(X_test)
|
||
```
|
||
|
||
|
||
18. Calculate the accuracy score on the testing set, save the result in
|
||
a variable called `test_acc`, and print its value:
|
||
|
||
```
|
||
test_acc = accuracy_score(y_test, test_preds)
|
||
print(test_acc)
|
||
```
|
||
|
||
|
||
You should get the following output:
|
||
|
||
|
||

|
||
|
||
|
||
|
||
|
||
|
||
You can find out which version you are using by executing the following code:
|
||
|
||
`import sklearn`
|
||
|
||
`sklearn.__version__`
|
||
|
||
|
||
|
||
In general, the higher the number of trees is, the better the
|
||
performance you will get. Let\'s see what happens with
|
||
`n_estimators = 2` on the Activity Recognition dataset:
|
||
|
||
```
|
||
rf_model2 = RandomForestClassifier(random_state=1, \
|
||
n_estimators=2)
|
||
rf_model2.fit(X_train, y_train)
|
||
preds2 = rf_model2.predict(X_train)
|
||
test_preds2 = rf_model2.predict(X_test)
|
||
print(accuracy_score(y_train, preds2))
|
||
print(accuracy_score(y_test, test_preds2))
|
||
```
|
||
|
||
The output will be as follows:
|
||
|
||

|
||
|
||
Caption: Accuracy of RandomForest with n\_estimators = 2
|
||
|
||
As expected, the accuracy is significantly lower than the previous
|
||
example with `n_estimators = 10`. Let\'s now try with
|
||
`50` trees:
|
||
|
||
```
|
||
rf_model3 = RandomForestClassifier(random_state=1, \
|
||
n_estimators=50)
|
||
rf_model3.fit(X_train, y_train)
|
||
preds3 = rf_model3.predict(X_train)
|
||
test_preds3 = rf_model3.predict(X_test)
|
||
print(accuracy_score(y_train, preds3))
|
||
print(accuracy_score(y_test, test_preds3))
|
||
```
|
||
|
||
The output will be as follows:
|
||
|
||

|
||
|
||
Caption: Accuracy of RandomForest with n\_estimators = 50
|
||
|
||
|
||
|
||
Exercise 4.02: Tuning n\_estimators to Reduce Overfitting
|
||
---------------------------------------------------------
|
||
|
||
In this exercise, we will train a Random Forest classifier to predict
|
||
the type of an animal based on its attributes and will try two different
|
||
values for the `n_estimators` hyperparameter:
|
||
|
||
We will be using the same zoo dataset as in the previous exercise.
|
||
|
||
1. Open a new Jupyter notebook.
|
||
|
||
2. Import the `pandas `package, `train_test_split`,
|
||
`RandomForestClassifier`, and `accuracy_score`
|
||
from `sklearn`:
|
||
```
|
||
import pandas as pd
|
||
from sklearn.model_selection import train_test_split
|
||
from sklearn.ensemble import RandomForestClassifier
|
||
from sklearn.metrics import accuracy_score
|
||
```
|
||
|
||
|
||
3. Create a variable called `file_url` that contains the URL
|
||
to the dataset:
|
||
```
|
||
file_url = 'https://raw.githubusercontent.com'\
|
||
'/fenago/data-science'\
|
||
'/master/Lab04/Dataset'\
|
||
'/openml_phpZNNasq.csv'
|
||
```
|
||
|
||
|
||
4. Load the dataset into a DataFrame using the `.read_csv()`
|
||
method from `pandas`:
|
||
```
|
||
df = pd.read_csv(file_url)
|
||
```
|
||
|
||
|
||
5. Remove the `animal` column using `.drop()` and
|
||
then extract the `type` target variable into a new
|
||
variable called `y` using `.pop()`:
|
||
```
|
||
df.drop(columns='animal', inplace=True)
|
||
y = df.pop('type')
|
||
```
|
||
|
||
|
||
6. Split the data into training and testing sets with
|
||
`train_test_split()` and the `test_size=0.4` and
|
||
`random_state=188` parameters:
|
||
```
|
||
X_train, X_test, y_train, y_test = train_test_split\
|
||
(df, y, test_size=0.4, \
|
||
random_state=188)
|
||
```
|
||
|
||
|
||
7. Instantiate `RandomForestClassifier` with
|
||
`random_state=42` and `n_estimators=1`, and then
|
||
fit the model with the training set:
|
||
|
||
```
|
||
rf_model = RandomForestClassifier(random_state=42, \
|
||
n_estimators=1)
|
||
rf_model.fit(X_train, y_train)
|
||
```
|
||
|
||
|
||
You should get the following output:
|
||
|
||
|
||

|
||
|
||
|
||
Caption: Logs of RandomForestClassifier
|
||
|
||
8. Make predictions on the training and testing sets with
|
||
`.predict()` and save the results into two new variables
|
||
called `train_preds` and `test_preds`:
|
||
```
|
||
train_preds = rf_model.predict(X_train)
|
||
test_preds = rf_model.predict(X_test)
|
||
```
|
||
|
||
|
||
9. Calculate the accuracy score for the training and testing sets and
|
||
save the results in two new variables called `train_acc`
|
||
and `test_acc`:
|
||
```
|
||
train_acc = accuracy_score(y_train, train_preds)
|
||
test_acc = accuracy_score(y_test, test_preds)
|
||
```
|
||
|
||
|
||
10. Print the accuracy scores: `train_acc` and
|
||
`test_acc`:
|
||
|
||
```
|
||
print(train_acc)
|
||
print(test_acc)
|
||
```
|
||
|
||
|
||
You should get the following output:
|
||
|
||
|
||

|
||
|
||
|
||
Caption: Accuracy scores for the training and testing sets
|
||
|
||
The accuracy score decreased for both the training and testing sets.
|
||
But now the difference is smaller compared to the results from
|
||
*Exercise 4.01*, *Building a Model for Classifying Animal Type and
|
||
Assessing Its Performance*.
|
||
|
||
11. Instantiate another `RandomForestClassifier` with
|
||
`random_state=42` and `n_estimators=30`, and
|
||
then fit the model with the training set:
|
||
|
||
```
|
||
rf_model2 = RandomForestClassifier(random_state=42, \
|
||
n_estimators=30)
|
||
rf_model2.fit(X_train, y_train)
|
||
```
|
||
|
||
|
||
You should get the following output:
|
||
|
||
|
||

|
||
|
||
|
||
Caption: Logs of RandomForest with n\_estimators = 30
|
||
|
||
12. Make predictions on the training and testing sets with
|
||
`.predict()` and save the results into two new variables
|
||
called `train_preds2` and `test_preds2`:
|
||
```
|
||
train_preds2 = rf_model2.predict(X_train)
|
||
test_preds2 = rf_model2.predict(X_test)
|
||
```
|
||
|
||
|
||
13. Calculate the accuracy score for the training and testing sets and
|
||
save the results in two new variables called `train_acc2`
|
||
and `test_acc2`:
|
||
```
|
||
train_acc2 = accuracy_score(y_train, train_preds2)
|
||
test_acc2 = accuracy_score(y_test, test_preds2)
|
||
```
|
||
|
||
|
||
14. Print the accuracy scores: `train_acc` and
|
||
`test_acc`:
|
||
|
||
```
|
||
print(train_acc2)
|
||
print(test_acc2)
|
||
```
|
||
|
||
|
||
You should get the following output:
|
||
|
||
|
||

|
||
|
||
|
||
Caption: Accuracy scores for the training and testing sets
|
||
|
||
|
||
|
||
Maximum Depth
|
||
=============
|
||
|
||
|
||
In the previous section, we learned how Random Forest builds multiple
|
||
trees to make predictions. Increasing the number of trees does improve
|
||
model performance but it usually doesn\'t help much to decrease the risk
|
||
of overfitting. Our model in the previous example is still performing
|
||
much better on the training set (data it has already seen) than on the
|
||
testing set (unseen data).
|
||
|
||
So, we are not confident enough yet to say the model will perform well
|
||
in production. There are different hyperparameters that can help to
|
||
lower the risk of overfitting for Random Forest and one of them is
|
||
called `max_depth`.
|
||
|
||
This hyperparameter defines the depth of the trees built by Random
|
||
Forest. Basically, it tells Random Forest model, how many nodes
|
||
(questions) it can create before making predictions. But how will that
|
||
help to reduce overfitting, you may ask. Well, let\'s say you built a
|
||
single tree and set the `max_depth` hyperparameter to
|
||
`50`. This would mean that there would be some cases where you
|
||
could ask 49 different questions (the value `c` includes the
|
||
final leaf node) before making a prediction. So, the logic would be
|
||
`IF X1 > value1 AND X2 > value2 AND X1 <= value3 AND … AND X3 > value49 THEN predict class A`.
|
||
|
||
As you can imagine, this is a very specific rule. In the end, it may
|
||
apply to only a few observations in the training set, with this case
|
||
appearing very infrequently. Therefore, your model would be overfitting.
|
||
By default, the value of this `max_depth` parameter is
|
||
`None`, which means there is no limit set for the depth of the
|
||
trees.
|
||
|
||
What you really want is to find some rules that are generic enough to be
|
||
applied to bigger groups of observations. This is why it is recommended
|
||
to not create deep trees with Random Forest. Let\'s try several values
|
||
for this hyperparameter on the Activity Recognition dataset:
|
||
`3`, `10`, and `50`:
|
||
|
||
```
|
||
rf_model4 = RandomForestClassifier(random_state=1, \
|
||
n_estimators=50, max_depth=3)
|
||
rf_model4.fit(X_train, y_train)
|
||
preds4 = rf_model4.predict(X_train)
|
||
test_preds4 = rf_model4.predict(X_test)
|
||
print(accuracy_score(y_train, preds4))
|
||
print(accuracy_score(y_test, test_preds4))
|
||
```
|
||
|
||
You should get the following output:
|
||
|
||

|
||
|
||
Caption: Accuracy scores for the training and testing sets and a
|
||
max\_depth of 3
|
||
|
||
For a `max_depth` of `3`, we got extremely similar
|
||
results for the training and testing sets but the overall performance
|
||
decreased drastically to `0.61`. Our model is not overfitting
|
||
anymore, but it is now underfitting; that is, it is not predicting the
|
||
target variable very well (only in `61%` of cases). Let\'s
|
||
increase `max_depth` to `10`:
|
||
|
||
```
|
||
rf_model5 = RandomForestClassifier(random_state=1, \
|
||
n_estimators=50, \
|
||
max_depth=10)
|
||
rf_model5.fit(X_train, y_train)
|
||
preds5 = rf_model5.predict(X_train)
|
||
test_preds5 = rf_model5.predict(X_test)
|
||
print(accuracy_score(y_train, preds5))
|
||
print(accuracy_score(y_test, test_preds5))
|
||
```
|
||

|
||
|
||
Caption: Accuracy scores for the training and testing sets and a
|
||
max\_depth of 10
|
||
|
||
The accuracy of the training set increased and is relatively close to
|
||
the testing set. We are starting to get some good results, but the model
|
||
is still slightly overfitting. Now we will see the results for
|
||
`max_depth = 50`:
|
||
|
||
```
|
||
rf_model6 = RandomForestClassifier(random_state=1, \
|
||
n_estimators=50, \
|
||
max_depth=50)
|
||
rf_model6.fit(X_train, y_train)
|
||
preds6 = rf_model6.predict(X_train)
|
||
test_preds6 = rf_model6.predict(X_test)
|
||
print(accuracy_score(y_train, preds6))
|
||
print(accuracy_score(y_test, test_preds6))
|
||
```
|
||
|
||
The output will be as follows:
|
||
|
||

|
||
|
||
Caption: Accuracy scores for the training and testing sets and a
|
||
max\_depth of 50
|
||
|
||
The accuracy jumped to `0.99` for the training set but it
|
||
didn\'t improve much for the testing set. So, the model is overfitting
|
||
with `max_depth = 50`. It seems the sweet spot to get good
|
||
predictions and not much overfitting is around `10` for the
|
||
`max_depth` hyperparameter in this dataset.
|
||
|
||
|
||
|
||
Exercise 4.03: Tuning max\_depth to Reduce Overfitting
|
||
------------------------------------------------------
|
||
|
||
In this exercise, we will keep tuning our RandomForest classifier that
|
||
predicts animal type by trying two different values for the
|
||
`max_depth` hyperparameter:
|
||
|
||
We will be using the same zoo dataset as in the previous exercise.
|
||
|
||
1. Open a new Jupyter notebook.
|
||
|
||
2. Import the `pandas` package, `train_test_split`,
|
||
`RandomForestClassifier`, and `accuracy_score`
|
||
from `sklearn`:
|
||
```
|
||
import pandas as pd
|
||
from sklearn.model_selection import train_test_split
|
||
from sklearn.ensemble import RandomForestClassifier
|
||
from sklearn.metrics import accuracy_score
|
||
```
|
||
|
||
|
||
3. Create a variable called `file_url` that contains the URL
|
||
to the dataset:
|
||
```
|
||
file_url = 'https://raw.githubusercontent.com'\
|
||
'fenago/data-science'\
|
||
'/master/Lab04/Dataset'\
|
||
'/openml_phpZNNasq.csv'
|
||
```
|
||
|
||
|
||
4. Load the dataset into a DataFrame using the `.read_csv()`
|
||
method from `pandas`:
|
||
```
|
||
df = pd.read_csv(file_url)
|
||
```
|
||
|
||
|
||
5. Remove the `animal` column using `.drop()` and
|
||
then extract the `type` target variable into a new
|
||
variable called `y` using `.pop()`:
|
||
```
|
||
df.drop(columns='animal', inplace=True)
|
||
y = df.pop('type')
|
||
```
|
||
|
||
|
||
6. Split the data into training and testing sets with
|
||
`train_test_split()` and the parameters
|
||
`test_size=0.4` and `random_state=188`:
|
||
```
|
||
X_train, X_test, y_train, y_test = train_test_split\
|
||
(df, y, test_size=0.4, \
|
||
random_state=188)
|
||
```
|
||
|
||
|
||
7. Instantiate `RandomForestClassifier` with
|
||
`random_state=42`, `n_estimators=30`, and
|
||
`max_depth=5`, and then fit the model with the training
|
||
set:
|
||
|
||
```
|
||
rf_model = RandomForestClassifier(random_state=42, \
|
||
n_estimators=30, \
|
||
max_depth=5)
|
||
rf_model.fit(X_train, y_train)
|
||
```
|
||
|
||
|
||
You should get the following output:
|
||
|
||
|
||

|
||
|
||
|
||
Caption: Logs of RandomForest
|
||
|
||
8. Make predictions on the training and testing sets with
|
||
`.predict()` and save the results into two new variables
|
||
called `train_preds` and `test_preds`:
|
||
```
|
||
train_preds = rf_model.predict(X_train)
|
||
test_preds = rf_model.predict(X_test)
|
||
```
|
||
|
||
|
||
9. Calculate the accuracy score for the training and testing sets and
|
||
save the results in two new variables called `train_acc`
|
||
and `test_acc`:
|
||
```
|
||
train_acc = accuracy_score(y_train, train_preds)
|
||
test_acc = accuracy_score(y_test, test_preds)
|
||
```
|
||
|
||
|
||
10. Print the accuracy scores: `train_acc` and
|
||
`test_acc`:
|
||
|
||
```
|
||
print(train_acc)
|
||
print(test_acc)
|
||
```
|
||
|
||
|
||
You should get the following output:
|
||
|
||
|
||

|
||
|
||
|
||
Caption: Accuracy scores for the training and testing sets
|
||
|
||
We got the exact same accuracy scores as for the best result we
|
||
obtained in the previous exercise. This value for the
|
||
`max_depth` hyperparameter hasn\'t impacted the model\'s
|
||
performance.
|
||
|
||
11. Instantiate another `RandomForestClassifier` with
|
||
`random_state=42`, `n_estimators=30`, and
|
||
`max_depth=2`, and then fit the model with the training
|
||
set:
|
||
|
||
```
|
||
rf_model2 = RandomForestClassifier(random_state=42, \
|
||
n_estimators=30, \
|
||
max_depth=2)
|
||
rf_model2.fit(X_train, y_train)
|
||
```
|
||
|
||
|
||
You should get the following output:
|
||
|
||
|
||

|
||
|
||
|
||
Caption: Logs of RandomForestClassifier with max\_depth = 2
|
||
|
||
12. Make predictions on the training and testing sets with
|
||
`.predict()` and save the results into two new variables
|
||
called `train_preds2 `and `test_preds2`:
|
||
```
|
||
train_preds2 = rf_model2.predict(X_train)
|
||
test_preds2 = rf_model2.predict(X_test)
|
||
```
|
||
|
||
|
||
13. Calculate the accuracy scores for the training and testing sets and
|
||
save the results in two new variables called `train_acc2`
|
||
and `test_acc2`:
|
||
```
|
||
train_acc2 = accuracy_score(y_train, train_preds2)
|
||
test_acc2 = accuracy_score(y_test, test_preds2)
|
||
```
|
||
|
||
|
||
14. Print the accuracy scores: `train_acc` and
|
||
`test_acc`:
|
||
|
||
```
|
||
print(train_acc2)
|
||
print(test_acc2)
|
||
```
|
||
|
||
|
||
You should get the following output:
|
||
|
||
|
||

|
||
|
||
|
||
|
||
|
||
Minimum Sample in Leaf
|
||
======================
|
||
|
||
|
||
It would be great if we could let the model know to not create such
|
||
specific rules that happen quite infrequently. Luckily,
|
||
`RandomForest` has such a hyperparameter and, you guessed it,
|
||
it is `min_samples_leaf`. This hyperparameter will specify the
|
||
minimum number of observations (or samples) that will have to fall under
|
||
a leaf node to be considered in the tree. For instance, if we set
|
||
`min_samples_leaf` to `3`, then
|
||
`RandomForest` will only consider a split that leads to at
|
||
least three observations on both the left and right leaf nodes. If this
|
||
condition is not met for a split, the model will not consider it and
|
||
will exclude it from the tree. The default value in `sklearn`
|
||
for this hyperparameter is `1`. Let\'s try to find the optimal
|
||
value for `min_samples_leaf` for the Activity Recognition
|
||
dataset:
|
||
|
||
```
|
||
rf_model7 = RandomForestClassifier(random_state=1, \
|
||
n_estimators=50, \
|
||
max_depth=10, \
|
||
min_samples_leaf=3)
|
||
rf_model7.fit(X_train, y_train)
|
||
preds7 = rf_model7.predict(X_train)
|
||
test_preds7 = rf_model7.predict(X_test)
|
||
print(accuracy_score(y_train, preds7))
|
||
print(accuracy_score(y_test, test_preds7))
|
||
```
|
||
|
||
The output will be as follows:
|
||
|
||

|
||
|
||
Caption: Accuracy scores for the training and testing sets for
|
||
min\_samples\_leaf=3
|
||
|
||
With `min_samples_leaf=3`, the accuracy for both the training
|
||
and testing sets didn\'t change much compared to the best model we found
|
||
in the previous section. Let\'s try increasing it to `10`:
|
||
|
||
```
|
||
rf_model8 = RandomForestClassifier(random_state=1, \
|
||
n_estimators=50, \
|
||
max_depth=10, \
|
||
min_samples_leaf=10)
|
||
rf_model8.fit(X_train, y_train)
|
||
preds8 = rf_model8.predict(X_train)
|
||
test_preds8 = rf_model8.predict(X_test)
|
||
print(accuracy_score(y_train, preds8))
|
||
print(accuracy_score(y_test, test_preds8))
|
||
```
|
||
|
||
The output will be as follows:
|
||
|
||

|
||
|
||
Caption: Accuracy scores for the training and testing sets for
|
||
min\_samples\_leaf=10
|
||
|
||
Now the accuracy of the training set dropped a bit but increased for the
|
||
testing set and their difference is smaller now. So, our model is
|
||
overfitting less. Let\'s try another value for this hyperparameter --
|
||
`25`:
|
||
|
||
```
|
||
rf_model9 = RandomForestClassifier(random_state=1, \
|
||
n_estimators=50, \
|
||
max_depth=10, \
|
||
min_samples_leaf=25)
|
||
rf_model9.fit(X_train, y_train)
|
||
preds9 = rf_model9.predict(X_train)
|
||
test_preds9 = rf_model9.predict(X_test)
|
||
print(accuracy_score(y_train, preds9))
|
||
print(accuracy_score(y_test, test_preds9))
|
||
```
|
||
|
||
The output will be as follows:
|
||
|
||

|
||
|
||
|
||
|
||
|
||
Exercise 4.04: Tuning min\_samples\_leaf
|
||
----------------------------------------
|
||
|
||
In this exercise, we will keep tuning our Random Forest classifier that
|
||
predicts animal type by trying two different values for the
|
||
`min_samples_leaf` hyperparameter:
|
||
|
||
We will be using the same zoo dataset as in the previous exercise.
|
||
|
||
1. Open a new Jupyter notebook.
|
||
|
||
2. Import the `pandas` package, `train_test_split`,
|
||
`RandomForestClassifier`, and `accuracy_score`
|
||
from `sklearn`:
|
||
```
|
||
import pandas as pd
|
||
from sklearn.model_selection import train_test_split
|
||
from sklearn.ensemble import RandomForestClassifier
|
||
from sklearn.metrics import accuracy_score
|
||
```
|
||
|
||
|
||
3. Create a variable called `file_url` that contains the URL
|
||
to the dataset:
|
||
```
|
||
file_url = 'https://raw.githubusercontent.com'\
|
||
'/fenago/data-science'\
|
||
'/master/Lab04/Dataset/openml_phpZNNasq.csv'
|
||
```
|
||
|
||
|
||
4. Load the dataset into a DataFrame using the `.read_csv()`
|
||
method from `pandas`:
|
||
```
|
||
df = pd.read_csv(file_url)
|
||
```
|
||
|
||
|
||
5. Remove the `animal` column using `.drop()` and
|
||
then extract the `type` target variable into a new
|
||
variable called `y` using `.pop()`:
|
||
```
|
||
df.drop(columns='animal', inplace=True)
|
||
y = df.pop('type')
|
||
```
|
||
|
||
|
||
6. Split the data into training and testing sets with
|
||
`train_test_split()` and the parameters
|
||
`test_size=0.4` and `random_state=188`:
|
||
```
|
||
X_train, X_test, \
|
||
y_train, y_test = train_test_split(df, y, test_size=0.4, \
|
||
random_state=188)
|
||
```
|
||
|
||
|
||
7. Instantiate `RandomForestClassifier` with
|
||
`random_state=42`, `n_estimators=30`,
|
||
`max_depth=2`, and `min_samples_leaf=3`, and
|
||
then fit the model with the training set:
|
||
|
||
```
|
||
rf_model = RandomForestClassifier(random_state=42, \
|
||
n_estimators=30, \
|
||
max_depth=2, \
|
||
min_samples_leaf=3)
|
||
rf_model.fit(X_train, y_train)
|
||
```
|
||
|
||
|
||
You should get the following output:
|
||
|
||
|
||

|
||
|
||
|
||
Caption: Logs of RandomForest
|
||
|
||
8. Make predictions on the training and testing sets with
|
||
`.predict()` and save the results into two new variables
|
||
called `train_preds` and `test_preds`:
|
||
```
|
||
train_preds = rf_model.predict(X_train)
|
||
test_preds = rf_model.predict(X_test)
|
||
```
|
||
|
||
|
||
9. Calculate the accuracy score for the training and testing sets and
|
||
save the results in two new variables called `train_acc`
|
||
and `test_acc`:
|
||
```
|
||
train_acc = accuracy_score(y_train, train_preds)
|
||
test_acc = accuracy_score(y_test, test_preds)
|
||
```
|
||
|
||
|
||
10. Print the accuracy score -- `train_acc` and
|
||
`test_acc`:
|
||
|
||
```
|
||
print(train_acc)
|
||
print(test_acc)
|
||
```
|
||
|
||
|
||
You should get the following output:
|
||
|
||
|
||

|
||
|
||
|
||
11. Instantiate another `RandomForestClassifier` with
|
||
`random_state=42`, `n_estimators=30`,
|
||
`max_depth=2`, and `min_samples_leaf=7`, and
|
||
then fit the model with the training set:
|
||
|
||
```
|
||
rf_model2 = RandomForestClassifier(random_state=42, \
|
||
n_estimators=30, \
|
||
max_depth=2, \
|
||
min_samples_leaf=7)
|
||
rf_model2.fit(X_train, y_train)
|
||
```
|
||
|
||
|
||
You should get the following output:
|
||
|
||
|
||

|
||
|
||
|
||
Caption: Logs of RandomForest with max\_depth=2
|
||
|
||
12. Make predictions on the training and testing sets with
|
||
`.predict()` and save the results into two new variables
|
||
called `train_preds2` and `test_preds2`:
|
||
```
|
||
train_preds2 = rf_model2.predict(X_train)
|
||
test_preds2 = rf_model2.predict(X_test)
|
||
```
|
||
|
||
|
||
13. Calculate the accuracy score for the training and testing sets and
|
||
save the results in two new variables called `train_acc2`
|
||
and `test_acc2`:
|
||
```
|
||
train_acc2 = accuracy_score(y_train, train_preds2)
|
||
test_acc2 = accuracy_score(y_test, test_preds2)
|
||
```
|
||
|
||
|
||
14. Print the accuracy scores: `train_acc` and
|
||
`test_acc`:
|
||
|
||
```
|
||
print(train_acc2)
|
||
print(test_acc2)
|
||
```
|
||
|
||
|
||
You should get the following output:
|
||
|
||
|
||

|
||
|
||
|
||
|
||
|
||
Maximum Features
|
||
================
|
||
|
||
|
||
Let\'s try three different values on the activity dataset. First, we
|
||
will specify the maximum number of features as two:
|
||
|
||
```
|
||
rf_model10 = RandomForestClassifier(random_state=1, \
|
||
n_estimators=50, \
|
||
max_depth=10, \
|
||
min_samples_leaf=25, \
|
||
max_features=2)
|
||
rf_model10.fit(X_train, y_train)
|
||
preds10 = rf_model10.predict(X_train)
|
||
test_preds10 = rf_model10.predict(X_test)
|
||
print(accuracy_score(y_train, preds10))
|
||
print(accuracy_score(y_test, test_preds10))
|
||
```
|
||
|
||
The output will be as follows:
|
||
|
||

|
||
|
||
Caption: Accuracy scores for the training and testing sets for
|
||
max\_features=2
|
||
|
||
We got results similar to those of the best model we trained in the
|
||
previous section. This is not really surprising as we were using the
|
||
default value of `max_features` at that time, which is
|
||
`sqrt`. The square root of `2` equals
|
||
`1.45`, which is quite close to `2`. This time,
|
||
let\'s try with the ratio `0.7`:
|
||
|
||
```
|
||
rf_model11 = RandomForestClassifier(random_state=1, \
|
||
n_estimators=50, \
|
||
max_depth=10, \
|
||
min_samples_leaf=25, \
|
||
max_features=0.7)
|
||
rf_model11.fit(X_train, y_train)
|
||
preds11 = rf_model11.predict(X_train)
|
||
test_preds11 = rf_model11.predict(X_test)
|
||
print(accuracy_score(y_train, preds11))
|
||
print(accuracy_score(y_test, test_preds11))
|
||
```
|
||
|
||
The output will be as follows:
|
||
|
||

|
||
|
||
With this ratio, both accuracy scores increased for the training and
|
||
testing sets and the difference between them is less. Our model is
|
||
overfitting less now and has slightly improved its predictive power.
|
||
Let\'s give it a shot with the `log2` option:
|
||
|
||
```
|
||
rf_model12 = RandomForestClassifier(random_state=1, \
|
||
n_estimators=50, \
|
||
max_depth=10, \
|
||
min_samples_leaf=25, \
|
||
max_features='log2')
|
||
rf_model12.fit(X_train, y_train)
|
||
preds12 = rf_model12.predict(X_train)
|
||
test_preds12 = rf_model12.predict(X_test)
|
||
print(accuracy_score(y_train, preds12))
|
||
print(accuracy_score(y_test, test_preds12))
|
||
```
|
||
|
||
The output will be as follows:
|
||
|
||

|
||
|
||
|
||
|
||
|
||
|
||
Exercise 4.05: Tuning max\_features
|
||
-----------------------------------
|
||
|
||
In this exercise, we will keep tuning our RandomForest classifier that
|
||
predicts animal type by trying two different values for the
|
||
`max_features` hyperparameter:
|
||
|
||
We will be using the same zoo dataset as in the previous exercise.
|
||
|
||
1. Open a new Jupyter notebook.
|
||
|
||
2. Import the `pandas` package, `train_test_split`,
|
||
`RandomForestClassifier`, and `accuracy_score`
|
||
from `sklearn`:
|
||
```
|
||
import pandas as pd
|
||
from sklearn.model_selection import train_test_split
|
||
from sklearn.ensemble import RandomForestClassifier
|
||
from sklearn.metrics import accuracy_score
|
||
```
|
||
|
||
|
||
3. Create a variable called `file_url` that contains the URL
|
||
to the dataset:
|
||
```
|
||
file_url = 'https://raw.githubusercontent.com'\
|
||
'/fenago/data-science'\
|
||
'/master/Lab04/Dataset/openml_phpZNNasq.csv'
|
||
```
|
||
|
||
|
||
4. Load the dataset into a DataFrame using the `.read_csv()`
|
||
method from `pandas`:
|
||
```
|
||
df = pd.read_csv(file_url)
|
||
```
|
||
|
||
|
||
5. Remove the `animal` column using `.drop()` and
|
||
then extract the `type` target variable into a new
|
||
variable called `y` using `.pop()`:
|
||
```
|
||
df.drop(columns='animal', inplace=True)
|
||
y = df.pop('type')
|
||
```
|
||
|
||
|
||
6. Split the data into training and testing sets with
|
||
`train_test_split()` and the parameters
|
||
`test_size=0.4` and `random_state=188`:
|
||
```
|
||
X_train, X_test, \
|
||
y_train, y_test = train_test_split(df, y, test_size=0.4, \
|
||
random_state=188)
|
||
```
|
||
|
||
|
||
7. Instantiate `RandomForestClassifier` with
|
||
`random_state=42`, `n_estimators=30`,
|
||
`max_depth=2`, `min_samples_leaf=7`, and
|
||
`max_features=10`, and then fit the model with the
|
||
training set:
|
||
|
||
```
|
||
rf_model = RandomForestClassifier(random_state=42, \
|
||
n_estimators=30, \
|
||
max_depth=2, \
|
||
min_samples_leaf=7, \
|
||
max_features=10)
|
||
rf_model.fit(X_train, y_train)
|
||
```
|
||
|
||
|
||
You should get the following output:
|
||
|
||
|
||

|
||
|
||
|
||
Caption: Logs of RandomForest
|
||
|
||
8. Make predictions on the training and testing sets with
|
||
`.predict()` and save the results into two new variables
|
||
called `train_preds` and `test_preds`:
|
||
```
|
||
train_preds = rf_model.predict(X_train)
|
||
test_preds = rf_model.predict(X_test)
|
||
```
|
||
|
||
|
||
9. Calculate the accuracy scores for the training and testing sets and
|
||
save the results in two new variables called `train_acc`
|
||
and `test_acc`:
|
||
```
|
||
train_acc = accuracy_score(y_train, train_preds)
|
||
test_acc = accuracy_score(y_test, test_preds)
|
||
```
|
||
|
||
|
||
10. Print the accuracy scores: `train_acc` and
|
||
`test_acc`:
|
||
|
||
```
|
||
print(train_acc)
|
||
print(test_acc)
|
||
```
|
||
|
||
|
||
You should get the following output:
|
||
|
||
|
||

|
||
|
||
|
||
Caption: Accuracy scores for the training and testing sets
|
||
|
||
11. Instantiate another `RandomForestClassifier` with
|
||
`random_state=42`, `n_estimators=30`,
|
||
`max_depth=2`, `min_samples_leaf=7`, and
|
||
`max_features=0.2`, and then fit the model with the
|
||
training set:
|
||
|
||
```
|
||
rf_model2 = RandomForestClassifier(random_state=42, \
|
||
n_estimators=30, \
|
||
max_depth=2, \
|
||
min_samples_leaf=7, \
|
||
max_features=0.2)
|
||
rf_model2.fit(X_train, y_train)
|
||
```
|
||
|
||
|
||
You should get the following output:
|
||
|
||
|
||

|
||
|
||
|
||
Caption: Logs of RandomForest with max\_features = 0.2
|
||
|
||
12. Make predictions on the training and testing sets with
|
||
`.predict()` and save the results into two new variables
|
||
called `train_preds2` and `test_preds2`:
|
||
```
|
||
train_preds2 = rf_model2.predict(X_train)
|
||
test_preds2 = rf_model2.predict(X_test)
|
||
```
|
||
|
||
|
||
13. Calculate the accuracy score for the training and testing sets and
|
||
save the results in two new variables called `train_acc2`
|
||
and `test_acc2`:
|
||
```
|
||
train_acc2 = accuracy_score(y_train, train_preds2)
|
||
test_acc2 = accuracy_score(y_test, test_preds2)
|
||
```
|
||
|
||
|
||
14. Print the accuracy scores: `train_acc` and
|
||
`test_acc`:
|
||
|
||
```
|
||
print(train_acc2)
|
||
print(test_acc2)
|
||
```
|
||
|
||
|
||
You should get the following output:
|
||
|
||
|
||

|
||
|
||
|
||
|
||
|
||
|
||
Activity 4.01: Train a Random Forest Classifier on the ISOLET Dataset
|
||
---------------------------------------------------------------------
|
||
|
||
You are working for a technology company and they are planning to launch
|
||
a new voice assistant product. You have been tasked with building a
|
||
classification model that will recognize the letters spelled out by a
|
||
user based on the signal frequencies captured. Each sound can be
|
||
captured and represented as a signal composed of multiple frequencies.
|
||
|
||
|
||
The following steps will help you to complete this activity:
|
||
|
||
1. Download and load the dataset using `.read_csv()` from
|
||
`pandas`.
|
||
2. Extract the response variable using `.pop()` from
|
||
`pandas`.
|
||
3. Split the dataset into training and test sets using
|
||
`train_test_split()` from
|
||
`sklearn.model_selection`.
|
||
4. Create a function that will instantiate and fit a
|
||
`RandomForestClassifier` using `.fit()` from
|
||
`sklearn.ensemble`.
|
||
5. Create a function that will predict the outcome for the training and
|
||
testing sets using `.predict()`.
|
||
6. Create a function that will print the accuracy score for the
|
||
training and testing sets using `accuracy_score()` from
|
||
`sklearn.metrics`.
|
||
7. Train and get the accuracy score for a range of different
|
||
hyperparameters. Here are some options you can try:
|
||
- `n_estimators = 20` and `50`
|
||
- `max_depth = 5` and `10`
|
||
- `min_samples_leaf = 10` and `50`
|
||
- `max_features = 0.5` and `0.3`
|
||
8. Select the best hyperparameter value.
|
||
|
||
These are the accuracy scores for the best model we trained:
|
||
|
||

|
||
|
||
|
||
|
||
|
||
Summary
|
||
=======
|
||
|
||
|
||
We have finally reached the end of this lab on multiclass
|
||
classification with Random Forest. We learned that multiclass
|
||
classification is an extension of binary classification: instead of
|
||
predicting only two classes, target variables can have many more values.
|
||
We saw how we can train a Random Forest model in just a few lines of
|
||
code and assess its performance by calculating the accuracy score for
|
||
the training and testing sets. Finally, we learned how to tune some of
|
||
its most important hyperparameters: `n_estimators`,
|
||
`max_depth`, `min_samples_leaf`, and
|
||
`max_features`. We also saw how their values can have a
|
||
significant impact on the predictive power of a model but also on its
|
||
ability to generalize to unseen data.
|