68 KiB
- The Generalization of Machine Learning Models ================================================
Overview
This lab will teach you how to make use of the data you have to train better models by either splitting your data if it is sufficient or making use of cross-validation if it is not. By the end of this lab, you will know how to split your data into training, validation, and test datasets. You will be able to identify the ratio in which data has to be split and also consider certain features while splitting. You will also be able to implement cross-validation to use limited data for testing and use regularization to reduce overfitting in models.
Exercise 7.01: Importing and Splitting Data
The goal of this exercise is to import data from a repository and to split it into a training and an evaluation set. We will be using the Cars dataset from the UCI Machine Learning Repository.
This dataset is about the cost of owning cars with certain attributes. The abstract from the website states: "Derived from simple hierarchical decision model, this database may be useful for testing constructive induction and structure discovery methods." Here are some of the key attributes of this dataset:
CAR car acceptability
. PRICE overall price
. . buying buying price
. . maint price of the maintenance
. TECH technical characteristics
. . COMFORT comfort
. . . doors number of doors
. . . persons capacity in terms of persons to carry
. . . lug_boot the size of luggage boot
. . safety estimated safety of the car
The following steps will help you complete the exercise:
-
Open a new Jupyter notebook file.
-
Import the necessary libraries:
# import libraries import pandas as pd from sklearn.model_selection import train_test_splitIn this step, you have imported
pandasand aliased it aspd. As you know,pandasis required to read in the file. You also importtrain_test_splitfromsklearn.model_selectionto split the data into two parts. -
Before reading the file into your notebook, open and inspect the file (
car.data) with an editor. You should see an output similar to the following:
Caption: Car data
You will notice from the preceding screenshot that the file doesn\'t
have a first row containing the headers.
-
Create a Python list to hold the headers for the data:
# data doesn't have headers, so let's create headers _headers = ['buying', 'maint', 'doors', 'persons', \ 'lug_boot', 'safety', 'car'] -
Now, import the data as shown in the following code snippet:
# read in cars dataset df = pd.read_csv('https://raw.githubusercontent.com/'\ 'fenago/data-science/'\ 'master/Lab07/Dataset/car.data', \ names=_headers, index_col=None)You then proceed to import the data into a variable called
dfby usingpd.read_csv. You specify the location of the data file, as well as the list of column headers. You also specify that the data does not have a column index. -
Show the top five records:
df.info()In order to get information about the columns in the data as well as the number of records, you make use of the
info()method. You should get an output similar to the following:
Caption: The top five records of the DataFrame
The `RangeIndex` value shows the number of records, which
is `1728`.
-
Now, you need to split the data contained in
dfinto a training dataset and an evaluation dataset:#split the data into 80% for training and 20% for evaluation training_df, eval_df = train_test_split(df, train_size=0.8, \ random_state=0)In this step, you make use of
train_test_splitto create two new DataFrames calledtraining_dfandeval_df.You specify a value of
0.8fortrain_sizeso that80%of the data is assigned totraining_df.random_stateensures that your experiments are reproducible. Withoutrandom_state, the data is split differently every time using a different random number. Withrandom_state, the data is split the same way every time. We will be studyingrandom_statein depth in the next lab. -
Check the information of
training_df:training_df.info()In this step, you make use of
.info()to get the details oftraining_df. This will print out the column names as well as the number of records.You should get an output similar to the following:
Caption: Information on training\_df
You should observe that the column names match those in
`df`, but you should have `80%` of the records
that you did in `df`, which is `1382` out of
`1728`.
-
Check the information on
eval_df:eval_df.info()In this step, you print out the information about
eval_df. This will give you the column names and the number of records. The output should be similar to the following:
Caption: Information on eval_df
Random State
Caption: Numbers generated using random state
Exercise 7.02: Setting a Random State When Splitting Data
The goal of this exercise is to have a reproducible way of splitting the data that you imported in Exercise 7.01, Importing and Splitting Data.
Note
We going to refactor the code from the previous exercise. Hence, if you are using a new Jupyter notebook then make sure you copy the code from the previous exercise. Alternatively, you can make a copy of the notebook used in Exercise 7.01 and use the revised the code as suggested in the following steps.
The following steps will help you complete the exercise:
-
Continue from the previous Exercise 7.01 notebook.
-
Set the random state as
1and split the data:""" split the data into 80% for training and 20% for evaluation using a random state """ training_df, eval_df = train_test_split(df, train_size=0.8, \ random_state=1)In this step, you specify a
random_statevalue of 1 to thetrain_test_splitfunction. -
Now, view the top five records in
training_df:#view the head of training_eval training_df.head()In this step, you print out the first five records in
training_df.The output should be similar to the following:
Caption: The top five rows for the training evaluation set
-
View the top five records in
eval_df:#view the top of eval_df eval_df.head()In this step, you print out the first five records in
eval_df.The output should be similar to the following:
Exercise 7.03: Creating a Five-Fold Cross-Validation Dataset
The goal of this exercise is to create a five-fold cross-validation dataset from the data that you imported in Exercise 7.01, Importing and Splitting Data.
Note
If you are using a new Jupyter notebook then make sure you copy the code from Exercise 7.01, Importing and Splitting Data. Alternatively, you can make a copy of the notebook used in Exercise 7.01 and then use the code as suggested in the following steps.
The following steps will help you complete the exercise:
-
Continue from the notebook file of Exercise 7.01.
-
Import all the necessary libraries:
from sklearn.model_selection import KFoldIn this step, you import
KFoldfromsklearn.model_selection. -
Now create an instance of the class:
_kf = KFold(n_splits=5)In this step, you create an instance of
KFoldand assign it to a variable called_kf. You specify a value of5for then_splitsparameter so that it splits the dataset into five parts. -
Now split the data as shown in the following code snippet:
indices = _kf.split(df)In this step, you call the
splitmethod, which is.split()on_kf. The result is stored in a variable calledindices. -
Find out what data type
indiceshas:print(type(indices))In this step, you inspect the call to split the output returns.
The output should be a
generator, as seen in the following output:
Caption: Data type for indices
-
Get the first set of indices:
#first set train_indices, val_indices = next(indices) -
Create a training dataset as shown in the following code snippet:
train_df = df.drop(val_indices) train_df.info()In this step, you create a new DataFrame called
train_dfby dropping the validation indices fromdf, the DataFrame that contains all of the data. This is a subtractive operation similar to what is done in set theory. Thedfset is a union oftrainandval. Once you know whatvalis, you can work backward to determinetrainby subtractingvalfromdf. If you considerdfto be a set calledA,valto be a set calledB, and train to be a set calledC, then the following holds true:
Caption: Dataframe A
Similarly, set `C` can be the difference between set
`A` and set `B`, as depicted in the following:
Caption: Dataframe C
The way to accomplish this with a pandas DataFrame is to drop the
rows with the indices of the elements of `B` from
`A`, which is what you see in the preceding code snippet.
You can see the result of this by calling the `info()`
method on the new DataFrame.
The result of that call should be similar to the following
screenshot:
Caption: Information on the new dataframe
-
Create a validation dataset:
val_df = df.drop(train_indices) val_df.info()In this step, you create the
val_dfvalidation dataset by dropping the training indices from thedfDataFrame. Again, you can see the details of this new DataFrame by calling theinfo()method.The output should be similar to the following:
Caption: Information for the validation dataset
Exercise 7.04: Creating a Five-Fold Cross-Validation Dataset Using a Loop for Calls
The goal of this exercise is to create a five-fold cross-validation dataset from the data that you imported in Exercise 7.01, Importing and Splitting Data. You will make use of a loop for calls to the generator function.
The following steps will help you complete this exercise:
-
Open a new Jupyter notebook and repeat the steps you used to import data in Exercise 7.01, Importing and Splitting Data.
-
Define the number of splits you would like:
from sklearn.model_selection import KFold #define number of splits n_splits = 5In this step, you set the number of splits to
5. You store this in a variable calledn_splits. -
Create an instance of
Kfold:#create an instance of KFold _kf = KFold(n_splits=n_splits)In this step, you create an instance of
Kfold. You assign this instance to a variable called_kf. -
Generate the split indices:
#create splits as _indices _indices = _kf.split(df)In this step, you call the
split()method on_kf, which is the instance ofKFoldthat you defined earlier. You providedfas a parameter so that the splits are performed on the data contained in the DataFrame calleddf. The resulting generator is stored as_indices. -
Create two Python lists:
_t, _v = [], []In this step, you create two Python lists. The first is called
_tand holds the training DataFrames, and the second is called_vand holds the validation DataFrames. -
Iterate over the generator and create DataFrames called
train_idx,val_idx,_train_dfand_val_df:#iterate over _indices for i in range(n_splits): train_idx, val_idx = next(_indices) _train_df = df.drop(val_idx) _t.append(_train_df) _val_df = df.drop(train_idx) _v.append(_val_df)In this step, you create a loop using
rangeto determine the number of iterations. You specify the number of iterations by providingn_splitsas a parameter torange(). On every iteration, you executenext()on the_indicesgenerator and store the results intrain_idxandval_idx. You then proceed to create_train_dfby dropping the validation indices,val_idx, fromdf. You also create_val_dfby dropping the training indices fromdf. -
Iterate over the training list:
for d in _t: print(d.info())In this step, you verify that the compiler created the DataFrames. You do this by iterating over the list and using the
.info()method to print out the details of each element. The output is similar to the following screenshot, which is incomplete due to the size of the output. Each element in the list is a DataFrame with 1,382 entries:
Caption: Iterating over the training list
Note
The preceding output is a truncated version of the actual output.
-
Iterate over the validation list:
for d in _v: print(d.info())In this step, you iterate over the validation list and make use of
.info()to print out the details of each element. The output is similar to the following screenshot, which is incomplete due to the size. Each element is a DataFrame with 346 entries:
Exercise 7.05: Getting the Scores from Five-Fold Cross-Validation
The goal of this exercise is to create a five-fold cross-validation
dataset from the data that you imported in Exercise 7.01, Importing
and Splitting Data. You will then use cross_val_score to
get the scores of models trained on those datasets.
The following steps will help you complete the exercise:
-
Open a new Jupyter notebook and repeat steps 1-6 that you took to import data in Exercise 7.01, Importing and Splitting Data.
-
Encode the categorical variables in the dataset:
# encode categorical variables _df = pd.get_dummies(df, columns=['buying', 'maint', 'doors', \ 'persons', 'lug_boot', \ 'safety']) _df.head()In this step, you make use of
pd.get_dummies()to convert categorical variables into an encoding. You store the result in a new DataFrame variable called_df. You then proceed to take a look at the first five records.The result should look similar to the following:
Caption: Encoding categorical variables
-
Split the data into features and labels:
# separate features and labels DataFrames features = _df.drop(['car'], axis=1).values labels = _df[['car']].valuesIn this step, you create a
featuresDataFrame by droppingcarfrom_df. You also createlabelsby selecting onlycarin a new DataFrame. Here, a feature and a label are similar in the Cars dataset. -
Create an instance of the
LogisticRegressionclass to be used later:from sklearn.linear_model import LogisticRegression # create an instance of LogisticRegression _lr = LogisticRegression()In this step, you import
LogisticRegressionfromsklearn.linear_model. We useLogisticRegressionbecause it lets us create a classification model, as you learned in Lab 3, Binary Classification. You then proceed to create an instance and store it as_lr. -
Import the
cross_val_scorefunction:from sklearn.model_selection import cross_val_scoreIn this step now, you import
cross_val_score, which you will make use of to compute the scores of the models. -
Compute the cross-validation scores:
_scores = cross_val_score(_lr, features, labels, cv=5)In this step, you the compute cross-validation scores and store the result in a Python list, which you call
_scores. You do this usingcross_cal_score. The function requires the following four parameters: the model to make use of (in our case, it's called_lr); the features of the dataset; the labels of the dataset; and the number of cross-validation splits to create (five, in our case). -
Now, display the scores as shown in the following code snippet:
print(_scores)In this step, you display the scores using
print().The output should look similar to the following:
Caption: Printing the cross-validation scores
LogisticRegressionCV
LogisticRegressionCV is a class that implements
cross-validation inside it. This class will train multiple
LogisticRegression models and return the best one.
Exercise 7.06: Training a Logistic Regression Model Using Cross-Validation
The goal of this exercise is to train a logistic regression model using cross-validation and get the optimal R2 result. We will be making use of the Cars dataset that you worked with previously.
The following steps will help you complete the exercise:
-
Open a new Jupyter notebook.
-
Import the necessary libraries:
# import libraries import pandas as pd from sklearn.model_selection import train_test_splitIn this step, you import
pandasand alias it aspd. You will make use of pandas to read in the file you will be working with. -
Create headers for the data:
# data doesn't have headers, so let's create headers _headers = ['buying', 'maint', 'doors', 'persons', \ 'lug_boot', 'safety', 'car']In this step, you start by creating a Python list to hold the
headerscolumn for the file you will be working with. You store this list as_headers. -
Read the data:
# read in cars dataset df = pd.read_csv('https://raw.githubusercontent.com/'\ 'fenago/data-science/'\ 'master/Lab07/Dataset/car.data', \ names=_headers, index_col=None)You then proceed to read in the file and store it as
df. This is a DataFrame. -
Print out the top five records:
df.info()Finally, you look at the summary of the DataFrame using
.info().The output looks similar to the following:
Caption: The top five records of the dataframe
-
Encode the categorical variables as shown in the following code snippet:
# encode categorical variables _df = pd.get_dummies(df, columns=['buying', 'maint', 'doors', \ 'persons', 'lug_boot', \ 'safety']) _df.head()In this step, you convert categorical variables into encodings using the
get_dummies()method from pandas. You supply the original DataFrame as a parameter and also specify the columns you would like to encode.Finally, you take a peek at the top five rows. The output looks similar to the following:
Caption: Encoding categorical variables
-
Split the DataFrame into features and labels:
# separate features and labels DataFrames features = _df.drop(['car'], axis=1).values labels = _df[['car']].valuesIn this step, you create two NumPy arrays. The first, called
features, contains the independent variables. The second, calledlabels, contains the values that the model learns to predict. These are also calledtargets. -
Import logistic regression with cross-validation:
from sklearn.linear_model import LogisticRegressionCVIn this step, you import the
LogisticRegressionCVclass. -
Instantiate
LogisticRegressionCVas shown in the following code snippet:model = LogisticRegressionCV(max_iter=2000, multi_class='auto',\ cv=5)In this step, you create an instance of
LogisticRegressionCV. You specify the following parameters:max_iter: You set this to2000so that the trainer continues training for2000iterations to find better weights.multi_class: You set this toautoso that the model automatically detects that your data has more than two classes.cv: You set this to5, which is the number of cross-validation sets you would like to train on. -
Now fit the model:
model.fit(features, labels.ravel())In this step, you train the model. You pass in
featuresandlabels. Becauselabelsis a 2D array, you make use ofravel()to convert it into a 1D array or vector.The interpreter produces an output similar to the following:
Caption: Fitting the model
In the preceding output, you see that the model fits the training
data. The output shows you the parameters that were used in
training, so you are not taken by surprise. Notice, for example,
that `max_iter` is `2000`, which is the value
that you set. Other parameters you didn\'t set make use of default
values, which you can find out more about from the documentation.
-
Evaluate the training R2:
print(model.score(features, labels.ravel()))In this step, we make use of the training dataset to compute the R2 score. While we didn't set aside a specific validation dataset, it is important to note that the model only saw 80% of our training data, so it still has new data to work with for this evaluation.
The output looks similar to the following:
Caption: Computing the R2 score
Hyperparameter Tuning with GridSearchCV
GridSearchCV will take a model and parameters and train one
model for each permutation of the parameters. At the end of the
training, it will provide access to the parameters and the model scores.
This is called hyperparameter tuning and you will be looking at this in
much more depth in Lab 8, Hyperparameter Tuning.
The usual practice is to make use of a small training set to find the optimal parameters using hyperparameter tuning and then to train a final model with all of the data.
Before the next exercise, let's take a brief look at decision trees, which are a type of model or estimator.
Decision Trees
A decision tree works by generating a separating hyperplane or a threshold for the features in data. It does this by considering every feature and finding the correlation between the spread of the values in that feature and the label that you are trying to predict.
Consider the following data about balloons. The label you need to
predict is called inflated. This dataset is used for
predicting whether the balloon is inflated or deflated given the
features. The features are:
colorsizeactage
The following table displays the distribution of features:
Caption: Tabular data for balloon features
Now consider the following charts, which are visualized depending on the spread of the features against the label:
- If you consider the
Colorfeature, the values arePURPLEandYELLOW, but the number of observations is the same, so you can't infer whether the balloon is inflated or not based on the color, as you can see in the following figure:
Caption: Barplot for the color feature
- The
Sizefeature has two values:LARGEandSMALL. These are equally spread, so we can't infer whether the balloon is inflated or not based on the color, as you can see in the following figure:
Caption: Barplot for the size feature
- The
Actfeature has two values:DIPandSTRETCH. You can see from the chart that the majority of theSTRETCHvalues are inflated. If you had to make a guess, you could easily say that ifActisSTRETCH, then the balloon is inflated. Consider the following figure:
Caption: Barplot for the act feature
- Finally, the
Agefeature also has two values:ADULTandCHILD. It's also visible from the chart that theADULTvalue constitutes the majority of inflated balloons:
Caption: Barplot for the age feature
The two features that are useful to the decision tree are
Act and Age. The tree could start by considering
whether Act is STRETCH. If it is, the prediction
will be true. This tree would look like the following figure:
Caption: Decision tree with depth=1
The left side evaluates to the condition being false, and the right side evaluates to the condition being true. This tree has a depth of 1. F means that the prediction is false, and T means that the prediction is true.
To get better results, the decision tree could introduce a second level.
The second level would utilize the Age feature and evaluate
whether the value is ADULT. It would look like the following
figure:
Caption: Decision tree with depth=2
This tree has a depth of 2. At the first level, it predicts true if
Act is STRETCH. If Act is not
STRETCH, it checks whether Age is
ADULT. If it is, it predicts true, otherwise, it predicts
false.
The decision tree can have as many levels as you like but starts to overfit at a certain point. As with everything in data science, the optimal depth depends on the data and is a hyperparameter, meaning you need to try different values to find the optimal one.
In the following exercise, we will be making use of grid search with cross-validation to find the best parameters for a decision tree estimator.
Exercise 7.07: Using Grid Search with Cross-Validation to Find the Best Parameters for a Model
The goal of this exercise is to make use of grid search to find the best
parameters for a DecisionTree classifier. We will be making
use of the Cars dataset that you worked with previously.
The following steps will help you complete the exercise:
-
Open a Jupyter notebook file.
-
Import
pandas:import pandas as pdIn this step, you import
pandas. You alias it aspd.Pandasis used to read in the data you will work with subsequently. -
Create
headers:_headers = ['buying', 'maint', 'doors', 'persons', \ 'lug_boot', 'safety', 'car'] -
Read in the
headers:# read in cars dataset df = pd.read_csv('https://raw.githubusercontent.com/'\ 'fenago/data-science/'\ 'master/Lab07/Dataset/car.data', \ names=_headers, index_col=None) -
Inspect the top five records:
df.info()The output looks similar to the following:
Caption: The top five records of the dataframe
-
Encode the categorical variables:
_df = pd.get_dummies(df, columns=['buying', 'maint', 'doors',\ 'persons', 'lug_boot', \ 'safety']) _df.head()In this step, you utilize
.get_dummies()to convert the categorical variables into encodings. The.head()method instructs the Python interpreter to output the top five columns.The output is similar to the following:
Caption: Encoding categorical variables
-
Separate
featuresandlabels:features = _df.drop(['car'], axis=1).values labels = _df[['car']].valuesIn this step, you create two
numpyarrays,featuresandlabels, the first containing independent variables or predictors, and the second containing dependent variables or targets. -
Import more libraries --
numpy,DecisionTreeClassifier, andGridSearchCV:import numpy as np from sklearn.tree import DecisionTreeClassifier from sklearn.model_selection import GridSearchCVIn this step, you import
numpy. NumPy is a numerical computation library. You alias it asnp. You also importDecisionTreeClassifier, which you use to create decision trees. Finally, you importGridSearchCV, which will use cross-validation to train multiple models. -
Instantiate the decision tree:
clf = DecisionTreeClassifier()In this step, you create an instance of
DecisionTreeClassifierasclf. This instance will be used repeatedly by the grid search. -
Create parameters --
max_depth:params = {'max_depth': np.arange(1, 8)}In this step, you create a dictionary of parameters. There are two parts to this dictionary:
The key of the dictionary is a parameter that is passed into the model. In this case,
max_depthis a parameter thatDecisionTreeClassifiertakes.The value is a Python list that grid search iterates over and passes to the model. In this case, we create an array that starts at 1 and ends at 7, inclusive.
-
Instantiate the grid search as shown in the following code snippet:
clf_cv = GridSearchCV(clf, param_grid=params, cv=5)In this step, you create an instance of
GridSearchCV. The first parameter is the model to train. The second parameter is the parameters to search over. The third parameter is the number of cross-validation splits to create. -
Now train the models:
clf_cv.fit(features, labels)In this step, you train the models using the features and labels. Depending on the type of model, this could take a while. Because we are using a decision tree, it trains quickly.
The output is similar to the following:
Caption: Training the model
You can learn a lot by reading the output, such as the number of
cross-validation datasets created (called `cv` and equal
to `5`), the estimator used
(`DecisionTreeClassifier`), and the parameter search space
(called `param_grid`).
-
Print the best parameter:
print("Tuned Decision Tree Parameters: {}"\ .format(clf_cv.best_params_))In this step, you print out what the best parameter is. In this case, what we were looking for was the best
max_depth. The output looks like the following:
Caption: Printing the best parameter
In the preceding output, you see that the best performing model is
one with a `max_depth` of `2`.
Accessing `best_params_` lets you train another model with
the best-known parameters using a larger training dataset.
-
Print the best
R2:print("Best score is {}".format(clf_cv.best_score_))In this step, you print out the
R2score of the best performing model.The output is similar to the following:
Best score is 0.7777777777777778In the preceding output, you see that the best performing model has an
R2score of0.778. -
Access the best model:
model = clf_cv.best_estimator_ modelIn this step, you access the best model (or estimator) using
best_estimator_. This will let you analyze the model, or optionally use it to make predictions and find other metrics. Instructing the Python interpreter to print the best estimator will yield an output similar to the following:
Caption: Accessing the model
In the preceding output, you see that the best model is
DecisionTreeClassifier with a max_depth of
2.
Hyperparameter Tuning with RandomizedSearchCV
Grid search goes over the entire search space and trains a model or estimator for every combination of parameters. Randomized search goes over only some of the combinations. This is a more optimal use of resources and still provides the benefits of hyperparameter tuning and cross-validation. You will be looking at this in depth in Lab 8, Hyperparameter Tuning.
Have a look at the following exercise.
Exercise 7.08: Using Randomized Search for Hyperparameter Tuning
The goal of this exercise is to perform hyperparameter tuning using randomized search and cross-validation.
The following steps will help you complete this exercise:
-
Open a new Jupyter notebook file.
-
Import
pandas:import pandas as pdIn this step, you import
pandas. You will make use of it in the next step. -
Create
headers:_headers = ['buying', 'maint', 'doors', 'persons', \ 'lug_boot', 'safety', 'car'] -
Read in the data:
# read in cars dataset df = pd.read_csv('https://raw.githubusercontent.com/'\ 'fenago/data-science/'\ 'master/Lab07/Dataset/car.data', \ names=_headers, index_col=None) -
Check the first five rows:
df.info()You need to provide a Python list of column headers because the data does not contain column headers. You also inspect the DataFrame that you created.
The output is similar to the following:
Caption: The top five rows of the DataFrame
-
Encode categorical variables as shown in the following code snippet:
_df = pd.get_dummies(df, columns=['buying', 'maint', 'doors',\ 'persons', 'lug_boot', \ 'safety']) _df.head()In this step, you find a numerical representation of text data using one-hot encoding. The operation results in a new DataFrame. You will see that the resulting data structure looks similar to the following:
Caption: Encoding categorical variables
-
Separate the data into independent and dependent variables, which are the
featuresandlabels:features = _df.drop(['car'], axis=1).values labels = _df[['car']].valuesIn this step, you separate the DataFrame into two
numpyarrays calledfeaturesandlabels.Featurescontains the independent variables, whilelabelscontains the target or dependent variables. -
Import additional libraries --
numpy,RandomForestClassifier, andRandomizedSearchCV:import numpy as np from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import RandomizedSearchCVIn this step, you import
numpyfor numerical computations,RandomForestClassifierto create an ensemble of estimators, andRandomizedSearchCVto perform a randomized search with cross-validation. -
Create an instance of
RandomForestClassifier:clf = RandomForestClassifier()In this step, you instantiate
RandomForestClassifier. A random forest classifier is a voting classifier. It makes use of multiple decision trees, which are trained on different subsets of the data. The results from the trees contribute to the output of the random forest by using a voting mechanism. -
Specify the parameters:
params = {'n_estimators':[500, 1000, 2000], \ 'max_depth': np.arange(1, 8)}RandomForestClassifieraccepts many parameters, but we specify two: the number of trees in the forest, calledn_estimators, and the depth of the nodes in each tree, calledmax_depth. -
Instantiate a randomized search:
clf_cv = RandomizedSearchCV(clf, param_distributions=params, \ cv=5)In this step, you specify three parameters when you instantiate the
clfclass, the estimator, or model to use, which is a random forest classifier,param_distributions, the parameter search space, andcv, the number of cross-validation datasets to create. -
Perform the search:
clf_cv.fit(features, labels.ravel())In this step, you perform the search by calling
fit(). This operation trains different models using the cross-validation datasets and various combinations of the hyperparameters. The output from this operation is similar to the following:
Caption: Output of the search operation
In the preceding output, you see that the randomized search will be
carried out using cross-validation with five splits
(`cv=5`). The estimator to be used is
`RandomForestClassifier`.
-
Print the best parameter combination:
print("Tuned Random Forest Parameters: {}"\ .format(clf_cv.best_params_))In this step, you print out the best hyperparameters.
The output is similar to the following:
Caption: Printing the best parameter combination
In the preceding output, you see that the best estimator is a Random
Forest classifier with 1,000 trees (`n_estimators=1000`)
and `max_depth=5`. You can print the best score by
executing
`print("Best score is {}".format(clf_cv.best_score_))`.
For this exercise, this value is \~ `0.76`.
-
Inspect the best model:
model = clf_cv.best_estimator_ modelIn this step, you find the best performing estimator (or model) and print out its details. The output is similar to the following:
Caption: Inspecting the model
In the preceding output, you see that the best estimator is
RandomForestClassifier with n_estimators=1000
and max_depth=5.
Exercise 7.09: Fixing Model Overfitting Using Lasso Regression
The goal of this exercise is to teach you how to identify when your model starts overfitting, and to use lasso regression to fix overfitting in your model.
The attribute information states "Features consist of hourly average ambient variables:
- Temperature (T) in the range 1.81°C and 37.11°C,
- Ambient Pressure (AP) in the range 992.89-1033.30 millibar,
- Relative Humidity (RH) in the range 25.56% to 100.16%
- Exhaust Vacuum (V) in the range 25.36-81.56 cm Hg
- Net hourly electrical energy output (EP) 420.26-495.76 MW
The averages are taken from various sensors located around the plant that record the ambient variables every second. The variables are given without normalization."
The following steps will help you complete the exercise:
-
Open a Jupyter notebook.
-
Import the required libraries:
import pandas as pd from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression, Lasso from sklearn.metrics import mean_squared_error from sklearn.pipeline import Pipeline from sklearn.preprocessing import MinMaxScaler, \ PolynomialFeatures -
Read in the data:
_df = pd.read_csv('https://raw.githubusercontent.com/'\ 'fenago/data-science/'\ 'master/Lab07/Dataset/ccpp.csv') -
Inspect the DataFrame:
_df.info()The
.info()method prints out a summary of the DataFrame, including the names of the columns and the number of records. The output might be similar to the following:
Caption: Inspecting the dataframe
You can see from the preceding figure that the DataFrame has 5
columns and 9,568 records. You can see that all columns contain
numeric data and that the columns have the following names:
`AT`, `V`, `AP`, `RH`, and
`PE`.
-
Extract features into a column called
X:X = _df.drop(['PE'], axis=1).values -
Extract labels into a column called
y:y = _df['PE'].values -
Split the data into training and evaluation sets:
train_X, eval_X, train_y, eval_y = train_test_split\ (X, y, train_size=0.8, \ random_state=0) -
Create an instance of a
LinearRegressionmodel:lr_model_1 = LinearRegression() -
Fit the model on the training data:
lr_model_1.fit(train_X, train_y)The output from this step should look similar to the following:
Caption: Fitting the model on training data
-
Use the model to make predictions on the evaluation dataset:
lr_model_1_preds = lr_model_1.predict(eval_X) -
Print out the
R2score of the model:print('lr_model_1 R2 Score: {}'\ .format(lr_model_1.score(eval_X, eval_y)))The output of this step should look similar to the following:
Caption: Printing the R2 score
You will notice that the `R2` score for this model is
`0.926`. You will make use of this figure to compare with
the next model you train. Recall that this is an evaluation metric.
-
Print out the Mean Squared Error (MSE) of this model:
print('lr_model_1 MSE: {}'\ .format(mean_squared_error(eval_y, lr_model_1_preds)))The output of this step should look similar to the following:
Caption: Printing the MSE
You will notice that the MSE is `21.675`. This is an
evaluation metric that you will use to compare this model to
subsequent models.
The first model was trained on four features. You will now train a
new model on four cubed features.
-
Create a list of tuples to serve as a pipeline:
steps = [('scaler', MinMaxScaler()),\ ('poly', PolynomialFeatures(degree=3)),\ ('lr', LinearRegression())]In this step, you create a list with three tuples. The first tuple represents a scaling operation that makes use of
MinMaxScaler. The second tuple represents a feature engineering step and makes use ofPolynomialFeatures. The third tuple represents aLinearRegressionmodel.The first element of the tuple represents the name of the step, while the second element represents the class that performs a transformation or an estimator.
-
Create an instance of a pipeline:
lr_model_2 = Pipeline(steps) -
Train the instance of the pipeline:
lr_model_2.fit(train_X, train_y)The pipeline implements a
.fit()method, which is also implemented in all instances of transformers and estimators. The.fit()method causes.fit_transform()to be called on transformers, and causes.fit()to be called on estimators. The output of this step is similar to the following:
Caption: Training the instance of the pipeline
You can see from the output that a pipeline was trained. You can see
that the steps are made up of `MinMaxScaler` and
`PolynomialFeatures`, and that the final step is made up
of `LinearRegression`.
-
Print out the
R2score of the model:print('lr_model_2 R2 Score: {}'\ .format(lr_model_2.score(eval_X, eval_y)))The output is similar to the following:
Caption: The R2 score of the model
You can see from the preceding that the `R2` score is
`0.944`, which is better than the `R2` score of
the first model, which was `0.932`. You can start to
observe that the metrics suggest that this model is better than the
first one.
-
Use the model to predict on the evaluation data:
lr_model_2_preds = lr_model_2.predict(eval_X) -
Print the MSE of the second model:
print('lr_model_2 MSE: {}'\ .format(mean_squared_error(eval_y, lr_model_2_preds)))The output is similar to the following:
Caption: The MSE of the second model
You can see from the output that the MSE of the second model is
`16.27`. This is less than the MSE of the first model,
which is `19.73`. You can safely conclude that the second
model is better than the first.
-
Inspect the model coefficients (also called weights):
print(lr_model_2[-1].coef_)In this step, you will note that
lr_model_2is a pipeline. The final object in this pipeline is the model, so you make use of list addressing to access this by setting the index of the list element to-1.Once you have the model, which is the final element in the pipeline, you make use of
.coef_to get the model coefficients. The output is similar to the following:
Caption: Print the model coefficients
You will note from the preceding output that the majority of the
values are in the tens, some values are in the hundreds, and one
value has a really small magnitude.
-
Check for the number of coefficients in this model:
print(len(lr_model_2[-1].coef_))The output for this step is similar to the following:
35You can see from the preceding screenshot that the second model has
35coefficients. -
Create a
stepslist withPolynomialFeaturesof degree10:steps = [('scaler', MinMaxScaler()),\ ('poly', PolynomialFeatures(degree=10)),\ ('lr', LinearRegression())] -
Create a third model from the preceding steps:
lr_model_3 = Pipeline(steps) -
Fit the third model on the training data:
lr_model_3.fit(train_X, train_y)The output from this step is similar to the following:
Caption: Fitting the third model on the data
You can see from the output that the pipeline makes use of
`PolynomialFeatures` of degree `10`. You are
doing this in the hope of getting a better model.
-
Print out the
R2score of this model:print('lr_model_3 R2 Score: {}'\ .format(lr_model_3.score(eval_X, eval_y)))The output of this model is similar to the following:
Caption: R2 score of the model
You can see from the preceding figure that the R2 score is now
`0.56`. The previous model had an `R2` score of
`0.944`. This model has an R2 score that is considerably
worse than the one of the previous model, `lr_model_2`.
This happens when your model is overfitting.
-
Use
lr_model_3to predict on evaluation data:lr_model_3_preds = lr_model_3.predict(eval_X) -
Print out the MSE for
lr_model_3:print('lr_model_3 MSE: {}'\ .format(mean_squared_error(eval_y, lr_model_3_preds)))The output for this step might be similar to the following:
Caption: The MSE of the model
You can see from the preceding figure that the MSE is also
considerably worse. The MSE is `126.25`, as compared to
`16.27` for the previous model.
-
Print out the number of coefficients (also called weights) in this model:
print(len(lr_model_3[-1].coef_))The output might resemble the following:
Caption: Printing the number of coefficients
You can see that the model has 1,001 coefficients.
-
Inspect the first 35 coefficients to get a sense of the individual magnitudes:
print(lr_model_3[-1].coef_[:35])The output might be similar to the following:
Caption: Inspecting the first 35 coefficients
You can see from the output that the coefficients have significantly
larger magnitudes than the coefficients from `lr_model_2`.
In the next steps, you will train a lasso regression model on the
same set of features to reduce overfitting.
-
Create a list of steps for the pipeline you will create later on:
steps = [('scaler', MinMaxScaler()),\ ('poly', PolynomialFeatures(degree=10)),\ ('lr', Lasso(alpha=0.01))]You create a list of steps for the pipeline you will create. Note that the third step in this list is an instance of lasso. The parameter called
alphain the call toLasso()is the regularization parameter. You can play around with any values from 0 to 1 to see how it affects the performance of the model that you train. -
Create an instance of a pipeline:
lasso_model = Pipeline(steps) -
Fit the pipeline on the training data:
lasso_model.fit(train_X, train_y)The output from this operation might be similar to the following:
Caption: Fitting the pipeline on the training data
You can see from the output that the pipeline trained a lasso model
in the final step. The regularization parameter was `0.01`
and the model trained for a maximum of 1,000 iterations.
-
Print the
R2score oflasso_model:print('lasso_model R2 Score: {}'\ .format(lasso_model.score(eval_X, eval_y)))The output of this step might be similar to the following:
Caption: R2 score
You can see that the `R2` score has climbed back up to
`0.94`, which is considerably better than the score of
`0.56` that `lr_model_3` had. This is already
looking like a better model.
-
Use
lasso_modelto predict on the evaluation data:lasso_preds = lasso_model.predict(eval_X) -
Print the MSE of
lasso_model:print('lasso_model MSE: {}'\ .format(mean_squared_error(eval_y, lasso_preds)))The output might be similar to the following:
Caption: MSE of lasso model
You can see from the output that the MSE is `17.01`, which
is way lower than the MSE value of `126.25` that
`lr_model_3` had. You can safely conclude that this is a
much better model.
-
Print out the number of coefficients in
lasso_model:print(len(lasso_model[-1].coef_))The output might be similar to the following:
1001You can see that this model has 1,001 coefficients, which is the same number of coefficients that
lr_model_3had. -
Print out the values of the first 35 coefficients:
print(lasso_model[-1].coef_[:35])The output might be similar to the following:
Caption: Printing the values of 35 coefficients
You can see from the preceding output that some of the coefficients are
set to 0. This has the effect of ignoring the corresponding
column of data in the input. You can also see that the remaining
coefficients have magnitudes of less than 100. This goes to show that
the model is no longer overfitting.
This exercise taught you how to fix overfitting by using
LassoRegression to train a new model.
In the next section, you will learn about using ridge regression to solve overfitting in a model.
Ridge Regression
You just learned about lasso regression, which introduces a penalty and tries to eliminate certain features from the data. Ridge regression takes an alternative approach by introducing a penalty that penalizes large weights. As a result, the optimization process tries to reduce the magnitude of the coefficients without completely eliminating them.
Exercise 7.10: Fixing Model Overfitting Using Ridge Regression
The goal of this exercise is to teach you how to identify when your model starts overfitting, and to use ridge regression to fix overfitting in your model.
Note
You will be using the same dataset as in Exercise 7.09, Fixing Model Overfitting Using Lasso Regression.
The following steps will help you complete the exercise:
-
Open a Jupyter notebook.
-
Import the required libraries:
import pandas as pd from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression, Ridge from sklearn.metrics import mean_squared_error from sklearn.pipeline import Pipeline from sklearn.preprocessing import MinMaxScaler, \ PolynomialFeatures -
Read in the data:
_df = pd.read_csv('https://raw.githubusercontent.com/'\ 'fenago/data-science/'\ 'master/Lab07/Dataset/ccpp.csv') -
Inspect the DataFrame:
_df.info()The
.info()method prints out a summary of the DataFrame, including the names of the columns and the number of records. The output might be similar to the following:
Caption: Inspecting the dataframe
You can see from the preceding figure that the DataFrame has 5
columns and 9,568 records. You can see that all columns contain
numeric data and that the columns have the names: `AT`,
`V`, `AP`, `RH`, and `PE`.
-
Extract features into a column called
X:X = _df.drop(['PE'], axis=1).values -
Extract labels into a column called
y:y = _df['PE'].values -
Split the data into training and evaluation sets:
train_X, eval_X, train_y, eval_y = train_test_split\ (X, y, train_size=0.8, \ random_state=0) -
Create an instance of a
LinearRegressionmodel:lr_model_1 = LinearRegression() -
Fit the model on the training data:
lr_model_1.fit(train_X, train_y)The output from this step should look similar to the following:
Caption: Fitting the model on data
-
Use the model to make predictions on the evaluation dataset:
lr_model_1_preds = lr_model_1.predict(eval_X) -
Print out the
R2score of the model:print('lr_model_1 R2 Score: {}'\ .format(lr_model_1.score(eval_X, eval_y)))The output of this step should look similar to the following:
Caption: R2 score
You will notice that the R2 score for this model is
`0.933`. You will make use of this figure to compare it
with the next model you train. Recall that this is an evaluation
metric.
-
Print out the MSE of this model:
print('lr_model_1 MSE: {}'\ .format(mean_squared_error(eval_y, lr_model_1_preds)))The output of this step should look similar to the following:
Caption: The MSE of the model
You will notice that the MSE is `19.734`. This is an
evaluation metric that you will use to compare this model to
subsequent models.
The first model was trained on four features. You will now train a
new model on four cubed features.
-
Create a list of tuples to serve as a pipeline:
steps = [('scaler', MinMaxScaler()),\ ('poly', PolynomialFeatures(degree=3)),\ ('lr', LinearRegression())]In this step, you create a list with three tuples. The first tuple represents a scaling operation that makes use of
MinMaxScaler. The second tuple represents a feature engineering step and makes use ofPolynomialFeatures. The third tuple represents aLinearRegressionmodel.The first element of the tuple represents the name of the step, while the second element represents the class that performs a transformation or an estimation.
-
Create an instance of a pipeline:
lr_model_2 = Pipeline(steps) -
Train the instance of the pipeline:
lr_model_2.fit(train_X, train_y)The pipeline implements a
.fit()method, which is also implemented in all instances of transformers and estimators. The.fit()method causes.fit_transform()to be called on transformers, and causes.fit()to be called on estimators. The output of this step is similar to the following:
Caption: Training the instance of a pipeline
You can see from the output that a pipeline was trained. You can see
that the steps are made up of `MinMaxScaler` and
`PolynomialFeatures`, and that the final step is made up
of `LinearRegression`.
-
Print out the
R2score of the model:print('lr_model_2 R2 Score: {}'\ .format(lr_model_2.score(eval_X, eval_y)))The output is similar to the following:
Caption: R2 score
You can see from the preceding that the R2 score is
`0.944`, which is better than the R2 score of the first
model, which was `0.933`. You can start to observe that
the metrics suggest that this model is better than the first one.
-
Use the model to predict on the evaluation data:
lr_model_2_preds = lr_model_2.predict(eval_X) -
Print the MSE of the second model:
print('lr_model_2 MSE: {}'\ .format(mean_squared_error(eval_y, lr_model_2_preds)))The output is similar to the following:
Caption: The MSE of the model
You can see from the output that the MSE of the second model is
`16.272`. This is less than the MSE of the first model,
which is `19.734`. You can safely conclude that the second
model is better than the first.
-
Inspect the model coefficients (also called weights):
print(lr_model_2[-1].coef_)In this step, you will note that
lr_model_2is a pipeline. The final object in this pipeline is the model, so you make use of list addressing to access this by setting the index of the list element to-1.Once you have the model, which is the final element in the pipeline, you make use of
.coef_to get the model coefficients. The output is similar to the following:
Caption: Printing model coefficients
You will note from the preceding output that the majority of the
values are in the tens, some values are in the hundreds, and one
value has a really small magnitude.
-
Check the number of coefficients in this model:
print(len(lr_model_2[-1].coef_))The output of this step is similar to the following:
Caption: Checking the number of coefficients
You will see from the preceding that the second model has 35
coefficients.
-
Create a
stepslist withPolynomialFeaturesof degree10:steps = [('scaler', MinMaxScaler()),\ ('poly', PolynomialFeatures(degree=10)),\ ('lr', LinearRegression())] -
Create a third model from the preceding steps:
lr_model_3 = Pipeline(steps) -
Fit the third model on the training data:
lr_model_3.fit(train_X, train_y)The output from this step is similar to the following:
Caption: Fitting lr\_model\_3 on the training data
You can see from the output that the pipeline makes use of
`PolynomialFeatures` of degree `10`. You are
doing this in the hope of getting a better model.
-
Print out the
R2score of this model:print('lr_model_3 R2 Score: {}'\ .format(lr_model_3.score(eval_X, eval_y)))The output of this model is similar to the following:
Caption: R2 score
You can see from the preceding figure that the `R2` score
is now `0.568` The previous model had an `R2`
score of `0.944`. This model has an `R2` score
that is worse than the one of the previous model,
`lr_model_2`. This happens when your model is overfitting.
-
Use
lr_model_3to predict on evaluation data:lr_model_3_preds = lr_model_3.predict(eval_X) -
Print out the MSE for
lr_model_3:print('lr_model_3 MSE: {}'\ .format(mean_squared_error(eval_y, lr_model_3_preds)))The output of this step might be similar to the following:
Caption: The MSE of lr\_model\_3
You can see from the preceding figure that the MSE is also worse.
The MSE is `126.254`, as compared to `16.271`
for the previous model.
-
Print out the number of coefficients (also called weights) in this model:
print(len(lr_model_3[-1].coef_))The output might resemble the following:
1001You can see that the model has
1,001coefficients. -
Inspect the first
35coefficients to get a sense of the individual magnitudes:print(lr_model_3[-1].coef_[:35])The output might be similar to the following:
Caption: Inspecting 35 coefficients
You can see from the output that the coefficients have significantly
larger magnitudes than the coefficients from `lr_model_2`.
In the next steps, you will train a ridge regression model on the
same set of features to reduce overfitting.
-
Create a list of steps for the pipeline you will create later on:
steps = [('scaler', MinMaxScaler()),\ ('poly', PolynomialFeatures(degree=10)),\ ('lr', Ridge(alpha=0.9))]You create a list of steps for the pipeline you will create. Note that the third step in this list is an instance of
Ridge. The parameter calledalphain the call toRidge()is the regularization parameter. You can play around with any values from 0 to 1 to see how it affects the performance of the model that you train. -
Create an instance of a pipeline:
ridge_model = Pipeline(steps) -
Fit the pipeline on the training data:
ridge_model.fit(train_X, train_y)The output of this operation might be similar to the following:
Caption: Fitting the pipeline on training data
You can see from the output that the pipeline trained a ridge model
in the final step. The regularization parameter was `0`.
-
Print the R2 score of
ridge_model:print('ridge_model R2 Score: {}'\ .format(ridge_model.score(eval_X, eval_y)))The output of this step might be similar to the following:
Caption: R2 score
You can see that the R2 score has climbed back up to
`0.945`, which is way better than the score of
`0.568` that `lr_model_3` had. This is already
looking like a better model.
-
Use
ridge_modelto predict on the evaluation data:ridge_model_preds = ridge_model.predict(eval_X) -
Print the MSE of
ridge_model:print('ridge_model MSE: {}'\ .format(mean_squared_error(eval_y, ridge_model_preds)))The output might be similar to the following:
Caption: The MSE of ridge\_model
You can see from the output that the MSE is `16.030`,
which is lower than the MSE value of `126.254` that
`lr_model_3` had. You can safely conclude that this is a
much better model.
-
Print out the number of coefficients in
ridge_model:print(len(ridge_model[-1].coef_))The output might be similar to the following:
Caption: The number of coefficients in the ridge model
You can see that this model has `1001` coefficients, which
is the same number of coefficients that `lr_model_3` had.
-
Print out the values of the first 35 coefficients:
print(ridge_model[-1].coef_[:35])The output might be similar to the following:
Caption: The values of the first 35 coefficients
This exercise taught you how to fix overfitting by using
RidgeRegression to train a new model.
Activity 7.01: Find an Optimal Model for Predicting the Critical Temperatures of Superconductors
You work as a data scientist for a cable manufacturer. Management has decided to start shipping low-resistance cables to clients around the world. To ensure that the right cables are shipped to the right countries, they would like to predict the critical temperatures of various cables based on certain observed readings.
In this activity, you will train a linear regression model and compute the R2 score and the MSE. You will proceed to engineer new features using polynomial features of degree 3. You will compare the R2 score and MSE of this new model to those of the first model to determine overfitting. You will then use regularization to train a model that generalizes to previously unseen data.
The steps to accomplish this task are:
-
Open a Jupyter notebook.
-
Load the necessary libraries.
-
Read in the data from the
superconductfolder. -
Prepare the
Xandyvariables. -
Split the data into training and evaluation sets.
-
Create a baseline linear regression model.
-
Print out the R2 score and MSE of the model.
-
Create a pipeline to engineer polynomial features and train a linear regression model.
-
Print out the R2 score and MSE.
-
Determine that this new model is overfitting.
-
Create a pipeline to engineer polynomial features and train a ridge or lasso model.
-
Print out the R2 score and MSE.
The output will be as follows:
Caption: The R2 score and MSE of the ridge model
-
Determine that this model is no longer overfitting. This is the model to put into production.
The coefficients for the ridge model are as shown in the following figure:
Caption: The coefficients for the ridge model
Summary
In this lab, we studied the importance of withholding some of the available data to evaluate models. We also learned how to make use of all of the available data with a technique called cross-validation to find the best performing model from a set of models you are training. We also made use of evaluation metrics to determine when a model starts to overfit and made use of ridge and lasso regression to fix a model that is overfitting.
In the next lab, we will go into hyperparameter tuning in depth. You will learn about various techniques for finding the best hyperparameters to train your models.









































































