Files
mlessentials/lab_guides/Lab_6.md
T
fenago eda8fb9fc0 added
2021-02-09 03:33:04 +05:00

44 KiB

Lab 6. How to Assess Performance

Overview

This lab will introduce you to model evaluation, where you evaluate or assess the performance of each model that you train before you decide to put it into production. By the end of this lab, you will be able to create an evaluation dataset. You will be equipped to assess the performance of linear regression models using mean absolute error (MAE) and mean squared error (MSE). You will also be able to evaluate the performance of logistic regression models using accuracy, precision, recall, and F1 score.

Exercise 6.01: Importing and Splitting Data

In this exercise, you will import data from a repository and split it into a training and an evaluation set to train a model. Splitting your data is required so that you can evaluate the model later. This exercise will get you familiar with the process of splitting data; this is something you will be doing frequently.

Note: The Car dataset that you will be using in this lab was taken from the UCI Machine Learning Repository.

The following steps will help you complete the exercise:

  1. Open a new Jupyter notebook.

  2. Import the required libraries:

    import pandas as pd
    from sklearn.model_selection import train_test_split
    
  3. Create a Python list:

    # data doesn't have headers, so let's create headers
    _headers = ['buying', 'maint', 'doors', 'persons', \
                'lug_boot', 'safety', 'car']
    

    The data that you are reading in is stored as a CSV file.

    The browser will download the file to your computer. You can open the file using a text editor. If you do, you will see something similar to the following:

`CSV` files normally have the name of each column written
in the first row of the data. For instance, have a look at this
dataset\'s CSV file, which you used in *Lab 3, Binary
Classification*:

  1. Read the data:

    df = pd.read_csv('https://raw.githubusercontent.com/'\
                     'fenago/data-science/'\
                     'master/Lab06/Dataset/car.data', \
                     names=_headers, index_col=None)
    
  2. Print out the top five records:

    df.head()
    

    The code in this step is used to print the top five rows of the DataFrame. The output from that operation is shown in the following screenshot:

Caption: The top five rows of the DataFrame
  1. Create a training and an evaluation DataFrame:

    training, evaluation = train_test_split(df, test_size=0.3, \
                                            random_state=0)
    

    Note

    The third parameter random_state is set to 0 to ensure reproducibility of results.

  2. Create a validation and test dataset:

    validation, test = train_test_split(evaluation, test_size=0.5, \
                                        random_state=0)
    

    This code is similar to the code in Step 6. In this step, the code splits our evaluation data into two equal parts because we specified 0.5, which means 50%.

Data Structures -- Vectors and Matrices

In this section, we will look at different data structures, as follows.

Scalars

You assign them to variables, such as in the following expression:

temperature = 23

If you had to store the temperature for 5 days, you would need to store the values in 5 different values, such as in the following code snippet:

temp_1 = 23
temp_2 = 24
temp_3 = 23
temp_4 = 22
temp_5 = 22

Vectors

Consider the following code snippet for creating a Python list:

temps_list = [23, 24, 23, 22, 22]

You can create a vector from the list using the .array() method from numpy by first importing numpy and then using the following snippet:

import numpy as np
temps_ndarray = np.array(temps_list)

You can proceed to verify the data type using the following code snippet:

print(type(temps_ndarray))

The code snippet will cause the compiler to print out the following:

Caption: The temps_ndarray vector data type

You may inspect the contents of the vector using the following code snippet:

print(temps_ndarray)

This generates the following output:

Caption: The temps_ndarray vector

print(temps_list)

The code snippet yields the following output:

Caption: List of elements in temps_list

Vectors have a shape and a dimension. Both of these can be determined by using the following code snippet:

print(temps_ndarray.shape)

The output is a Python data structure called a tuple and looks like this:

Caption: Shape of the temps_ndarray vector

Matrices

To convert temps_ndarray into a matrix with five rows and one column, you would use the following snippet:

temps_matrix = temps_ndarray.reshape(-1, 1)

To see the new shape, use the following snippet:

print(temps_matrix.shape)

You will get the following output:

Caption: Shape of the matrix

You can print out the value of the matrix using the following snippet:

print(temps_matrix)

The output of the code is as follows:

Caption: Elements of the matrix

You may reshape the matrix to contain 1 row and 5 columns and print out the value using the following code snippet:

print(temps_matrix.reshape(1,5))

The output will be as follows:

Caption: Reshaping the matrix

Finally, you can convert the matrix back into a vector by dropping the column using the following snippet:

vector = temps_matrix.reshape(-1)

You can print out the value of the vector to confirm that you get the following:

Exercise 6.02: Computing the R[2] Score of a Linear Regression Model

As mentioned in the preceding sections, R[2] score is an important factor in evaluating the performance of a model. Thus, in this exercise, we will be creating a linear regression model and then calculating the R[2] score for it.

The following attributes are useful for our task:

  • CIC0: information indices
  • SM1_Dz(Z): 2D matrix-based descriptors
  • GATS1i: 2D autocorrelations
  • NdsCH: Pimephales promelas
  • NdssC: atom-type counts
  • MLOGP: molecular properties
  • Quantitative response, LC50 [-LOG(mol/L)]: This attribute represents the concentration that causes death in 50% of test fish over a test duration of 96 hours.

The following steps will help you to complete the exercise:

  1. Open a new Jupyter notebook to write and execute your code.

  2. Next, import the libraries mentioned in the following code snippet:

    # import libraries
    import pandas as pd
    from sklearn.model_selection import train_test_split
    from sklearn.linear_model import LinearRegression
    

    In this step, you import pandas, which you will use to read your data. You also import train_test_split(), which you will use to split your data into training and validation sets, and you import LinearRegression, which you will use to train your model.

  3. Now, read the data from the dataset:

    # column headers
    _headers = ['CIC0', 'SM1', 'GATS1i', 'NdsCH', 'Ndssc', \
                'MLOGP', 'response']
    # read in data
    df = pd.read_csv('https://raw.githubusercontent.com/'\
                     'fenago/data-science/'\
                     'master/Lab06/Dataset/'\
                     'qsar_fish_toxicity.csv', \
                     names=_headers, sep=';')
    

    In this step, you create a Python list to hold the names of the columns in your data. You do this because the CSV file containing the data does not have a first row that contains the column headers. You proceed to read in the file and store it in a variable called df using the read_csv() method in pandas. You specify the list containing column headers by passing it into the names parameter. This CSV uses semi-colons as column separators, so you specify that using the sep parameter. You can use df.head() to see what the DataFrame looks like:

Caption: The first five rows of the DataFrame
  1. Split the data into features and labels and into training and evaluation datasets:

    # Let's split our data
    features = df.drop('response', axis=1).values
    labels = df[['response']].values
    X_train, X_eval, y_train, y_eval = train_test_split\
                                       (features, labels, \
                                        test_size=0.2, \
                                        random_state=0)
    X_val, X_test, y_val, y_test = train_test_split(X_eval, y_eval,\
                                                    random_state=0)
    

    In this step, you create two numpy arrays called features and labels. You then proceed to split them twice. The first split produces a training set and an evaluation set. The second split creates a validation set and a test set.

  2. Create a linear regression model:

    model = LinearRegression()
    

    In this step, you create an instance of LinearRegression and store it in a variable called model. You will make use of this to train on the training dataset.

  3. Train the model:

    model.fit(X_train, y_train)
    

    You should get an output similar to the following:

Caption: Training the model
  1. Make a prediction, as shown in the following code snippet:

    y_pred = model.predict(X_val)
    

    In this step, you make use of the validation dataset to make a prediction. This is stored in y_pred.

  2. Compute the R[2] score:

    r2 = model.score(X_val, y_val)
    print('R^2 score: {}'.format(r2))
    

    In this step, you compute r2, which is the R[2] score of the model. The R[2] score is computed using the score() method of the model. The next line causes the interpreter to print out the R[2] score.

    The output is similar to the following:

Caption: R2 score
  1. You see that the R[2] score we achieved is 0.56238, which is not close to 1. In the next step, we will be making comparisons.

  2. Compare the predictions to the actual ground truth:

    _ys = pd.DataFrame(dict(actuals=y_val.reshape(-1), \
                            predicted=y_pred.reshape(-1)))
    _ys.head()
    

    The output looks similar to the following:

Exercise 6.03: Computing the MAE of a Model

The goal of this exercise is to find the score and loss of a model using the same dataset as Exercise 6.02, Computing the R2 Score of a Linear Regression Model.

In this exercise, we will be calculating the MAE of a model.

The following steps will help you with this exercise:

  1. Open a new Jupyter notebook file.

  2. Import the necessary libraries:

    # Import libraries
    import pandas as pd
    from sklearn.model_selection import train_test_split
    from sklearn.linear_model import LinearRegression
    from sklearn.metrics import mean_absolute_error
    

    In this step, you import the function called mean_absolute_error from sklearn.metrics.

  3. Import the data:

    # column headers
    _headers = ['CIC0', 'SM1', 'GATS1i', 'NdsCH', 'Ndssc', \
                'MLOGP', 'response']
    # read in data
    df = pd.read_csv('https://raw.githubusercontent.com/'\
                     'fenago/data-science/'\
                     'master/Lab06/Dataset/'\
                     'qsar_fish_toxicity.csv', \
                     names=_headers, sep=';')
    

    In the preceding code, you read in your data. This data is hosted online and contains some information about fish toxicity. The data is stored as a CSV but does not contain any headers. Also, the columns in this file are not separated by a comma, but rather by a semi-colon. The Python list called _headers contains the names of the column headers.

  4. Split the data into features and labels and into training and evaluation sets:

    # Let's split our data
    features = df.drop('response', axis=1).values
    labels = df[['response']].values
    X_train, X_eval, y_train, y_eval = train_test_split\
                                       (features, labels, \
                                        test_size=0.2, \
                                        random_state=0)
    X_val, X_test, y_val, y_test = train_test_split(X_eval, y_eval,\
                                                    random_state=0)
    
  5. Create a simple linear regression model and train it:

    # create a simple Linear Regression model
    model = LinearRegression()
    # train the model
    model.fit(X_train, y_train)
    

    In this step, you make use of your training data to train a model. In the first line, you create an instance of LinearRegression, which you call model. In the second line, you train the model using X_train and y_train. X_train contains the features, while y_train contains the labels.

  6. Now predict the values of our validation dataset:

    # let's use our model to predict on our validation dataset
    y_pred = model.predict(X_val)
    
  7. Compute the MAE:

    # Let's compute our MEAN ABSOLUTE ERROR
    mae = mean_absolute_error(y_val, y_pred)
    print('MAE: {}'.format(mae))
    

    In this step, you compute the MAE of the model by using the mean_absolute_error function and passing in y_val and y_pred. y_val is the label that was provided with your training data, and y_pred is the prediction from the model. The preceding code should give you an MAE value of ~ 0.72434:

Figure 6.17 MAE score
  1. Compute the R[2] score of the model:

    # Let's get the R2 score
    r2 = model.score(X_val, y_val)
    print('R^2 score: {}'.format(r2))
    

    You should get an output similar to the following:

In this exercise, we have calculated the MAE, which is a significant parameter when it comes to evaluating models.

You will now train a second model and compare its R[2] score and MAE to the first model to evaluate which is a better performing model.

Exercise 6.04: Computing the Mean Absolute Error of a Second Model

In this exercise, we will be engineering new features and finding the score and loss of a new model.

The following steps will help you with this exercise:

  1. Open a new Jupyter notebook file.

  2. Import the required libraries:

    # Import libraries
    import pandas as pd
    from sklearn.model_selection import train_test_split
    from sklearn.linear_model import LinearRegression
    from sklearn.metrics import mean_absolute_error
    # pipeline
    from sklearn.pipeline import Pipeline
    # preprocessing
    from sklearn.preprocessing import MinMaxScaler
    from sklearn.preprocessing import StandardScaler
    from sklearn.preprocessing import PolynomialFeatures
    
  3. Read in the data from the dataset:

    # column headers
    _headers = ['CIC0', 'SM1', 'GATS1i', 'NdsCH', 'Ndssc', \
                'MLOGP', 'response']
    # read in data
    df = pd.read_csv('https://raw.githubusercontent.com/'\
                     'fenago/data-science/'\
                     'master/Lab06/Dataset/'\
                     'qsar_fish_toxicity.csv', \
                     names=_headers, sep=';')
    
  4. Split the data into training and evaluation sets:

    # Let's split our data
    features = df.drop('response', axis=1).values
    labels = df[['response']].values
    X_train, X_eval, y_train, y_eval = train_test_split\
                                       (features, labels, \
                                        test_size=0.2, \
                                        random_state=0)
    X_val, X_test, y_val, y_test = train_test_split(X_eval, y_eval,\
                                                    random_state=0)
    

    In this step, you begin by splitting the DataFrame called df into two. The first DataFrame is called features and contains all of the independent variables that you will use to make your predictions. The second is called labels and contains the values that you are trying to predict.

    In the third line, you split features and labels into four sets using train_test_split. X_train and y_train contain 80% of the data and are used for training your model. X_eval and y_eval contain the remaining 20%.

    In the fourth line, you split X_eval and y_eval into two additional sets. X_val and y_val contain 75% of the data because you did not specify a ratio or size. X_test and y_test contain the remaining 25%.

  5. Create a pipeline:

    # create a pipeline and engineer quadratic features
    steps = [('scaler', MinMaxScaler()),\
             ('poly', PolynomialFeatures(2)),\
             ('model', LinearRegression())]
    
  6. Create a pipeline:

    # create a simple Linear Regression model with a pipeline
    model = Pipeline(steps)
    
  7. Train the model:

    # train the model
    model.fit(X_train, y_train)
    

    On the next line, you call the fit method and provide X_train and y_train as parameters. Because the model is a pipeline, three operations will happen. First, X_train will be scaled. Next, additional features will be engineered. Finally, training will happen using the LinearRegression model. The output from this step is similar to the following:

Caption: Training the model
  1. Predict using the validation dataset:

    # let's use our model to predict on our validation dataset
    y_pred = model.predict(X_val)
    
  2. Compute the MAE of the model:

    # Let's compute our MEAN ABSOLUTE ERROR
    mae = mean_absolute_error(y_val, y_pred)
    print('MAE: {}'.format(mae))
    

    In the first line, you make use of mean_absolute_error to compute the mean absolute error. You supply y_val and y_pred, and the result is stored in the mae variable. In the following line, you print out mae:

Caption: MAE score

The loss that you compute at this step is called a validation loss
because you make use of the validation dataset. This is different
from a training loss that is computed using the training dataset.
This distinction is important to note as you study other
documentation or books, which might refer to both.
  1. Compute the R[2] score:

    # Let's get the R2 score
    r2 = model.score(X_val, y_val)
    print('R^2 score: {}'.format(r2))
    

    In the final two lines, you compute the R[2] score and also display it, as shown in the following screenshot:

Exercise 6.05: Creating a Classification Model for Computing Evaluation Metrics

In this exercise, you will create a classification model that you will make use of later on for model assessment.

You will make use of the cars dataset from the UCI Machine Learning Repository. You will use this dataset to classify cars as either acceptable or unacceptable based on the following categorical features:

  • buying: the purchase price of the car
  • maint: the maintenance cost of the car
  • doors: the number of doors on the car
  • persons: the carrying capacity of the vehicle
  • lug_boot: the size of the luggage boot
  • safety: the estimated safety of the car

The following steps will help you achieve the task:

  1. Open a new Jupyter notebook.

  2. Import the libraries you will need:

    # import libraries
    import pandas as pd
    from sklearn.model_selection import train_test_split
    from sklearn.linear_model import LogisticRegression
    
  3. Import your data:

    # data doesn't have headers, so let's create headers
    _headers = ['buying', 'maint', 'doors', 'persons', \
                'lug_boot', 'safety', 'car']
    # read in cars dataset
    df = pd.read_csv('https://raw.githubusercontent.com/'\
                     'fenago/data-science/'\
                     'master/Lab06/Dataset/car.data', \
                     names=_headers, index_col=None)
    df.head()
    

    You should get an output similar to the following:

Caption: Inspecting the DataFrame
  1. Encode categorical variables as shown in the following code snippet:

    # encode categorical variables
    _df = pd.get_dummies(df, columns=['buying', 'maint', 'doors',\
                                      'persons', 'lug_boot', \
                                      'safety'])
    _df.head()
    

    The output should now resemble the following screenshot:

Caption: Encoding categorical variables
  1. Split the data into training and validation sets:

    # split data into training and evaluation datasets
    features = _df.drop('car', axis=1).values
    labels = _df['car'].values
    X_train, X_eval, y_train, y_eval = train_test_split\
                                       (features, labels, \
                                        test_size=0.3, \
                                        random_state=0)
    X_val, X_test, y_val, y_test = train_test_split(X_eval, y_eval,\
                                                    test_size=0.5, \
                                                    random_state=0)
    
  2. Train a logistic regression model:

    # train a Logistic Regression model
    model = LogisticRegression()
    model.fit(X_train, y_train)
    

    In this step, you create an instance of LogisticRegression and train the model on your training data by passing in X_train and y_train to the fit method.

    You should get an output that looks similar to the following:

Caption: Training a logistic regression model
  1. Make a prediction:

    # make predictions for the validation set
    y_pred = model.predict(X_val)
    

    In this step, you make a prediction on the validation dataset, X_val, and store the result in y_pred. A look at the first 10 predictions (by executing y_pred[0:9]) should provide an output similar to the following:

Caption: Prediction for the validation set

Exercise 6.06: Generating a Confusion Matrix for the Classification Model

The goal of this exercise is to create a confusion matrix for the classification model you trained in Exercise 6.05, Creating a Classification Model for Computing Evaluation Metrics.

Note

You should continue this exercise in the same notebook as that used in Exercise 6.05, Creating a Classification Model for Computing Evaluation Metrics. If you wish to use a new notebook, make sure you copy and run the entire code from Exercise 6.05, Creating a Classification Model for Computing Evaluation Metrics, and then begin with the execution of the code of this exercise.

The following steps will help you achieve the task:

  1. Open a new Jupyter notebook file.

  2. Import confusion_matrix:

    from sklearn.metrics import confusion_matrix
    

    In this step, you import confusion_matrix from sklearn.metrics. This function will let you generate a confusion matrix.

  3. Generate a confusion matrix:

    confusion_matrix(y_val, y_pred)
    

    In this step, you generate a confusion matrix by supplying y_val, the actual classes, and y_pred, the predicted classes.

    The output should look similar to the following:

Exercise 6.07: Computing Precision for the Classification Model

In this exercise, you will be computing the precision for the classification model you trained in Exercise 6.05, Creating a Classification Model for Computing Evaluation Metrics.

The following steps will help you achieve the task:

  1. Import the required libraries:

    from sklearn.metrics import precision_score
    

    In this step, you import precision_score from sklearn.metrics.

  2. Next, compute the precision score as shown in the following code snippet:

    precision_score(y_val, y_pred, average='macro')
    

    In this step, you compute the precision score using precision_score.

    The output is a floating-point number between 0 and 1. It might look like this:

Exercise 6.08: Computing Recall for the Classification Model

The goal of this exercise is to compute the recall for the classification model you trained in Exercise 6.05, Creating a Classification Model for Computing Evaluation Metrics.

The following steps will help you accomplish the task:

  1. Open a new Jupyter notebook file.

  2. Now, import the required libraries:

    from sklearn.metrics import recall_score
    

    In this step, you import recall_score from sklearn.metrics. This is the function that you will make use of in the second step.

  3. Compute the recall:

    recall_score(y_val, y_pred, average='macro')
    

    You should get an output that looks like the following:

Exercise 6.09: Computing the F1 Score for the Classification Model

In this exercise, you will compute the F1 score for the classification model you trained in Exercise 6.05, Creating a Classification Model for Computing Evaluation Metrics.

Note

You should continue this exercise in the same notebook as that used in Exercise 6.05, Creating a Classification Model for Computing Evaluation Metrics. If you wish to use a new notebook, make sure you copy and run the entire code from Exercise 6.05, Creating a Classification Model for Computing Evaluation Metrics, and then begin with the execution of the code of this exercise.

The following steps will help you accomplish the task:

  1. Open a new Jupyter notebook file.

  2. Import the necessary modules:

    from sklearn.metrics import f1_score
    

    In this step, you import the f1_score method from sklearn.metrics. This score will let you compute evaluation metrics.

  3. Compute the F1 score:

    f1_score(y_val, y_pred, average='macro')
    

    In this step, you compute the F1 score by passing in y_val and y_pred. You also specify average='macro' because this is not binary classification.

    You should get an output similar to the following:

Exercise 6.10: Computing Model Accuracy for the Classification Model

The goal of this exercise is to compute the accuracy score of the model trained in Exercise 6.04, Computing the Mean Absolute Error of a Second Model.

The following steps will help you accomplish the task:

  1. Continue from where the code for Exercise 6.05, Creating a Classification Model for Computing Evaluation Metrics, ends in your notebook.

  2. Import accuracy_score():

    from sklearn.metrics import accuracy_score
    

    In this step, you import accuracy_score(), which you will use to compute the model accuracy.

  3. Compute the accuracy:

    _accuracy = accuracy_score(y_val, y_pred)
    print(_accuracy)
    

    In this step, you compute the model accuracy by passing in y_val and y_pred as parameters to accuracy_score(). The interpreter assigns the result to a variable called c. The print() method causes the interpreter to render the value of _accuracy.

    The result is similar to the following:

Exercise 6.11: Computing the Log Loss for the Classification Model

The goal of this exercise is to predict the log loss of the model trained in Exercise 6.05, Creating a Classification Model for Computing Evaluation Metrics.

Note

You should continue this exercise in the same notebook as that used in Exercise 6.05, Creating a Classification Model for Computing Evaluation Metrics. If you wish to use a new notebook, make sure you copy and run the entire code from Exercise 6.05 and then begin with the execution of the code of this exercise.

The following steps will help you accomplish the task:

  1. Open your Jupyter notebook and continue from where Exercise 6.05, Creating a Classification Model for Computing Evaluation Metrics, stopped.

  2. Import the required libraries:

    from sklearn.metrics import log_loss
    

    In this step, you import log_loss() from sklearn.metrics.

  3. Compute the log loss:

    _loss = log_loss(y_val, model.predict_proba(X_val))
    print(_loss)
    

Exercise 6.12: Computing and Plotting ROC Curve for a Binary Classification Problem

The goal of this exercise is to plot the ROC curve for a binary classification problem. The data for this problem is used to predict whether or not a mother will require a caesarian section to give birth.

From the UCI Machine Learning Repository, the abstract for this dataset follows: "This dataset contains information about caesarian section results of 80 pregnant women with the most important characteristics of delivery problems in the medical field." The attributes of interest are age, delivery number, delivery time, blood pressure, and heart status.

The following steps will help you accomplish this task:

  1. Open a Jupyter notebook file.

  2. Import the required libraries:

    # import libraries
    import pandas as pd
    from sklearn.model_selection import train_test_split
    from sklearn.linear_model import LogisticRegression
    from sklearn.metrics import roc_curve
    from sklearn.metrics import auc
    

    In this step, you import pandas, which you will use to read in data. You also import train_test_split for creating training and validation datasets, and LogisticRegression for creating a model.

  3. Read in the data:

    # data doesn't have headers, so let's create headers
    _headers = ['Age', 'Delivery_Nbr', 'Delivery_Time', \
                'Blood_Pressure', 'Heart_Problem', 'Caesarian']
    # read in cars dataset
    df = pd.read_csv('https://raw.githubusercontent.com/'\
                     'fenago/data-science/'\
                     'master/Lab06/Dataset/caesarian.csv.arff',\
                     names=_headers, index_col=None, skiprows=15)
    df.head()
    # target column is 'Caesarian'
    

The `head()` method will print out the top five rows and
should look similar to the following:

Caption: The top five rows of the DataFrame
  1. Split the data:

    # target column is 'Caesarian'
    features = df.drop(['Caesarian'], axis=1).values
    labels = df[['Caesarian']].values
    # split 80% for training and 20% into an evaluation set
    X_train, X_eval, y_train, y_eval = train_test_split\
                                       (features, labels, \
                                        test_size=0.2, \
                                        random_state=0)
    """
    further split the evaluation set into validation and test sets 
    of 10% each
    """
    X_val, X_test, y_val, y_test = train_test_split(X_eval, y_eval,\
                                                    test_size=0.5, \
                                                    random_state=0)
    

    In this step, you begin by creating two numpy arrays, which you call features and labels. You then split these arrays into a training and an evaluation dataset. You further split the evaluation dataset into validation and test datasets.

  2. Now, train and fit a logistic regression model:

    model = LogisticRegression()
    model.fit(X_train, y_train)
    

    In this step, you begin by creating an instance of a logistic regression model. You then proceed to train or fit the model on the training dataset.

    The output should be similar to the following:

Caption: Training a logistic regression model
  1. Predict the probabilities, as shown in the following code snippet:

    y_proba = model.predict_proba(X_val)
    

    In this step, the model predicts the probabilities for each entry in the validation dataset. It stores the results in y_proba.

  2. Compute the true positive rate, the false positive rate, and the thresholds:

    _false_positive, _true_positive, _thresholds = roc_curve\
                                                   (y_val, \
                                                    y_proba[:, 0])
    

    In this step, you make a call to roc_curve() and specify the ground truth and the first column of the predicted probabilities. The result is a tuple of false positive rate, true positive rate, and thresholds.

  3. Explore the false positive rates:

    print(_false_positive)
    

    In this step, you instruct the interpreter to print out the false positive rate. The output should be similar to the following:

Caption: False positive rates

Note

The false positive rates can vary, depending on the data.
  1. Explore the true positive rates:

    print(_true_positive)
    

    In this step, you instruct the interpreter to print out the true positive rates. This should be similar to the following:

Caption: True positive rates
  1. Explore the thresholds:

    print(_thresholds)
    

    In this step, you instruct the interpreter to display the thresholds. The output should be similar to the following:

Caption: Thresholds
  1. Now, plot the ROC curve:

    # Plot the RoC
    import matplotlib.pyplot as plt
    %matplotlib inline
    plt.plot(_false_positive, _true_positive, lw=2, \
             label='Receiver Operating Characteristic')
    plt.xlim(0.0, 1.2)
    plt.ylim(0.0, 1.2)
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('Receiver Operating Characteristic')
    plt.show()
    

    The output should look similar to the following:

Caption: ROC curve

Exercise 6.13: Computing the ROC AUC for the Caesarian Dataset

The goal of this exercise is to compute the ROC AUC for the binary classification model that you trained in Exercise 6.12, Computing and Plotting ROC Curve for a Binary Classification Problem.

Note

You should continue this exercise in the same notebook as that used in Exercise 6.12, Computing and Plotting ROC Curve for a Binary Classification Problem. If you wish to use a new notebook, make sure you copy and run the entire code from Exercise 6.12 and then begin with the execution of the code of this exercise.

The following steps will help you accomplish the task:

  1. Open a Jupyter notebook to the code for Exercise 6.12, Computing and Plotting ROC Curve for a Binary Classification Problem, and continue writing your code.

  2. Predict the probabilities:

    y_proba = model.predict_proba(X_val)
    

    In this step, you compute the probabilities of the classes in the validation dataset. You store the result in y_proba.

  3. Compute the ROC AUC:

    from sklearn.metrics import roc_auc_score
    _auc = roc_auc_score(y_val, y_proba[:, 0])
    print(_auc)
    

    In this step, you compute the ROC AUC and store the result in _auc. You then proceed to print this value out. The result should look similar to the following:

Caption: Computing the ROC AUC

Note

The AUC can be different, depending on the data.

Saving and Loading Models

You will eventually need to transfer some of the models you have trained to a different computer so they can be put into production. There are various utilities for doing this, but the one we will discuss is called joblib.

joblib supports saving and loading models, and it saves the models in a format that is supported by other machine learning architectures, such as ONNX.

joblib is found in the sklearn.externals module.

Exercise 6.14: Saving and Loading a Model

In this exercise, you will train a simple model and use it for prediction. You will then proceed to save the model and then load it back in. You will use the loaded model for a second prediction, and then compare the predictions from the first model to those from the second model. You will make use of the car dataset for this exercise.

The following steps will guide you toward the goal:

  1. Open a Jupyter notebook.

  2. Import the required libraries:

    import pandas as pd
    from sklearn.model_selection import train_test_split
    from sklearn.linear_model import LinearRegression
    
  3. Read in the data:

    _headers = ['CIC0', 'SM1', 'GATS1i', 'NdsCH', 'Ndssc', \
                'MLOGP', 'response']
    # read in data
    df = pd.read_csv('https://raw.githubusercontent.com/'\
                     'fenago/data-science/'\
                     'master/Lab06/Dataset/'\
                     'qsar_fish_toxicity.csv', \
                     names=_headers, sep=';')
    
  4. Inspect the data:

    df.head()
    

    The output should be similar to the following:

Caption: Inspecting the first five rows of the DataFrame
  1. Split the data into features and labels, and into training and validation sets:

    features = df.drop('response', axis=1).values
    labels = df[['response']].values
    X_train, X_eval, y_train, y_eval = train_test_split\
                                       (features, labels, \
                                        test_size=0.2, \
                                        random_state=0)
    X_val, X_test, y_val, y_test = train_test_split(X_eval, y_eval,\
                                                    random_state=0)
    
  2. Create a linear regression model:

    model = LinearRegression()
    print(model)
    

    The output will be as follows:

Caption: Training a linear regression model
  1. Fit the training data to the model:

    model.fit(X_train, y_train)
    
  2. Use the model for prediction:

    y_pred = model.predict(X_val)
    
  3. Import joblib:

    from sklearn.externals import joblib
    
  4. Save the model:

    joblib.dump(model, './model.joblib')
    

    The output should be similar to the following:

Caption: Saving the model
  1. Load it as a new model:

    m2 = joblib.load('./model.joblib')
    
  2. Use the new model for predictions:

    m2_preds = m2.predict(X_val)
    
  3. Compare the predictions:

    ys = pd.DataFrame(dict(predicted=y_pred.reshape(-1), \
                           m2=m2_preds.reshape(-1)))
    ys.head()
    

    The output should be similar to the following:

Caption: Comparing predictions

Activity 6.01: Train Three Different Models and Use Evaluation Metrics to Pick the Best Performing Model

You work as a data scientist at a bank. The bank would like to implement a model that predicts the likelihood of a customer purchasing a term deposit. The bank provides you with a dataset, which is the same as the one in Lab 3, Binary Classification. You have previously learned how to train a logistic regression model for binary classification. You have also heard about other non-parametric modeling techniques and would like to try out a decision tree as well as a random forest to see how well they perform against the logistic regression models you have been training.

In this activity, you will train a logistic regression model and compute a classification report. You will then proceed to train a decision tree classifier and compute a classification report. You will compare the models using the classification reports. Finally, you will train a random forest classifier and generate the classification report. You will then compare the logistic regression model with the random forest using the classification reports to determine which model you should put into production.

The steps to accomplish this task are:

  1. Open a Jupyter notebook.

  2. Load the necessary libraries.

  3. Read in the data.

  4. Explore the data.

  5. Convert categorical variables using pandas.get_dummies().

  6. Prepare the X and y variables.

  7. Split the data into training and evaluation sets.

  8. Create an instance of LogisticRegression.

  9. Fit the training data to the LogisticRegression model.

  10. Use the evaluation set to make a prediction.

  11. Use the prediction from the LogisticRegression model to compute the classification report.

  12. Create an instance of DecisionTreeClassifier:

    dt_model = DecisionTreeClassifier(max_depth= 6)
    
  13. Fit the training data to the DecisionTreeClassifier model:

    dt_model.fit(train_X, train_y)
    
  14. Using the DecisionTreeClassifier model, make a prediction on the evaluation dataset:

    dt_preds = dt_model.predict(val_X)
    
  15. Use the prediction from the DecisionTreeClassifier model to compute the classification report:

    dt_report = classification_report(val_y, dt_preds)
    print(dt_report)
    

    Note

    We will be studying decision trees in detail in Lab 7, The Generalization of Machine Learning Models.

  16. Compare the classification report from the linear regression model and the classification report from the decision tree classifier to determine which is the better model.

  17. Create an instance of RandomForestClassifier.

  18. Fit the training data to the RandomForestClassifier model.

  19. Using the RandomForestClassifier model, make a prediction on the evaluation dataset.

  20. Using the prediction from the random forest classifier, compute the classification report.

  21. Compare the classification report from the linear regression model with the classification report from the random forest classifier to decide which model to keep or improve upon.

  22. Compare the R[2] scores of all three models. The output should be similar to the following:

Summary

In this lab we observed that some of the evaluation metrics for classification models require a binary classification model. We saw that when we worked with more than two classes, we were required to use the one-versus-all approach. The one-versus-all approach builds one model for each class and tries to predict the probability that the input belongs to a specific class. We saw that once this was done, we then predicted that the input belongs to the class where the model has the highest prediction probability. We also split our evaluation dataset into two, it's because X_test and y_test are used once for a final evaluation of the model's performance. You can make use of them before putting your model into production to see how the model would perform in a production environment.