35 KiB
Lab 9. Interpreting a Machine Learning Model
Overview
This lab will show you how to interpret a machine learning model's
results and get deeper insights into the patterns it found. By the end
of the lab, you will be able to analyze weights from linear models
and variable importance for RandomForest. You will be able
to implement variable importance via permutation to analyze feature
importance. You will use a partial dependence plot to analyze single
variables and make use of the lime package for local interpretation.
In this lab, we will go through some techniques on how to interpret your models or their results.
Linear Model Coefficients
In sklearn, it is extremely easy to get the coefficient of a
linear model; you just need to call the coef_ attribute.
Let's implement this on a real example with the Diabetes dataset from
sklearn:
from sklearn.datasets import load_diabetes
from sklearn.linear_model import LinearRegression
data = load_diabetes()
# fit a linear regression model to the data
lr_model = LinearRegression()
lr_model.fit(data.data, data.target)
lr_model.coef_
The output will be as follows:
Caption: Coefficients of the linear regression parameters
Let's create a DataFrame with these values and column names:
import pandas as pd
coeff_df = pd.DataFrame()
coeff_df['feature'] = data.feature_names
coeff_df['coefficient'] = lr_model.coef_
coeff_df.head()
The output will be as follows:
Exercise 9.01: Extracting the Linear Regression Coefficient
In this exercise, we will train a linear regression model to predict the customer drop-out ratio and extract its coefficients.
The following steps will help you complete the exercise:
-
Open a new Jupyter notebook.
-
Import the following packages:
pandas,train_test_splitfromsklearn.model_selection,StandardScalerfromsklearn.preprocessing,LinearRegressionfromsklearn.linear_model,mean_squared_errorfromsklearn.metrics, andaltair:import pandas as pd from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.linear_model import LinearRegression from sklearn.metrics import mean_squared_error import altair as alt -
Create a variable called
file_urlthat contains the URL to the dataset:file_url = 'https://raw.githubusercontent.com/'\ 'fenago/data-science/'\ 'master/Lab09/Dataset/phpYYZ4Qc.csv' -
Load the dataset into a DataFrame called
dfusing.read_csv():df = pd.read_csv(file_url) -
Print the first five rows of the DataFrame:
df.head()You should get the following output:
Caption: First five rows of the loaded DataFrame
-
Extract the
rejcolumn using.pop()and save it into a variable calledy:y = df.pop('rej') -
Print the summary of the DataFrame using
.describe().df.describe()You should get the following output:
Caption: Statistical measures of the DataFrame
-
Split the DataFrame into training and testing sets using
train_test_split()withtest_size=0.3andrandom_state = 1:X_train, X_test, y_train, y_test = train_test_split\ (df, y, test_size=0.3, \ random_state=1) -
Instantiate
StandardScaler:scaler = StandardScaler() -
Train
StandardScaleron the training set and standardize it using.fit_transform():X_train = scaler.fit_transform(X_train) -
Standardize the testing set using
.transform():X_test = scaler.transform(X_test) -
Instantiate
LinearRegressionand save it to a variable calledlr_model:lr_model = LinearRegression() -
Train the model on the training set using
.fit():lr_model.fit(X_train, y_train)You should get the following output:
Caption: Logs of LinearRegression
-
Predict the outcomes of the training and testing sets using
.predict():preds_train = lr_model.predict(X_train) preds_test = lr_model.predict(X_test) -
Calculate the mean squared error on the training set and print its value:
train_mse = mean_squared_error(y_train, preds_train) train_mseYou should get the following output:
Caption: MSE score of the training set
We achieved quite a low MSE score on the training set.
-
Calculate the mean squared error on the testing set and print its value:
test_mse = mean_squared_error(y_test, preds_test) test_mseYou should get the following output:
Caption: MSE score of the testing set
We also have a low MSE score on the testing set that is very similar
to the training one. So, our model is not overfitting.
Note
You may get slightly different outputs than those present here.
However, the values you would obtain should largely agree with those
obtained in this exercise.
-
Print the coefficients of the linear regression model using
.coef_:lr_model.coef_You should get the following output:
Caption: Coefficients of the linear regression model
-
Create an empty DataFrame called
coef_df:coef_df = pd.DataFrame() -
Create a new column called
featurefor this DataFrame with the name of the columns ofdfusing.columns:coef_df['feature'] = df.columns -
Create a new column called
coefficientfor this DataFrame with the coefficients of the linear regression model using.coef_:coef_df['coefficient'] = lr_model.coef_ -
Print the first five rows of
coef_df:coef_df.head()You should get the following output:
Caption: The first five rows of coef\_df
From this output, we can see the variables `a1sx` and
`a1sy` have the lowest value (the biggest negative value)
so they are contributing more to the prediction than the three other
variables shown here.
-
Plot a bar chart with Altair using
coef_dfandcoefficientas thexaxis andfeatureas theyaxis:alt.Chart(coef_df).mark_bar().encode(x='coefficient',\ y="feature")You should get the following output:
RandomForest Variable Importance
After training RandomForest, you can assess its variable
importance (or feature importance) with the
feature_importances_ attribute.
Let's see how to extract this information from the Breast Cancer
dataset from sklearn:
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
data = load_breast_cancer()
X, y = data.data, data.target
rf_model = RandomForestClassifier(random_state=168)
rf_model.fit(X, y)
rf_model.feature_importances_
The output will be as shown in the following figure:
Note: Due to randomization, you may get a slightly different result.
It might be a little difficult to evaluate which importance value corresponds to which variable from this output. Let's create a DataFrame that will contain these values with the name of the columns:
import pandas as pd
varimp_df = pd.DataFrame()
varimp_df['feature'] = data.feature_names
varimp_df['importance'] = rf_model.feature_importances_
varimp_df.head()
The output will be as follows:
Caption: RandomForest variable importance for the first five features of the Breast Cancer dataset
From this output, we can see that mean radius and
mean perimeter have the highest scores, which means they are
the most important in predicting the target variable. The
mean smoothness column has a very low value, so it seems it
doesn't influence the model much to predict the output.
Note
The range of values of variable importance is different for datasets; it is not a standardized measure.
Let's plot these variable importance values onto a graph using
altair:
import altair as alt
alt.Chart(varimp_df).mark_bar().encode(x='importance',\
y="feature")
The output will be as follows:
Caption: Graph showing RandomForest variable importance
Exercise 9.02: Extracting RandomForest Feature Importance
In this exercise, we will extract the feature importance of a Random Forest classifier model trained to predict the customer drop-out ratio.
We will be using the same dataset as in the previous exercise.
The following steps will help you complete the exercise:
-
Open a new Jupyter notebook.
-
Import the following packages:
pandas,train_test_splitfromsklearn.model_selection, andRandomForestRegressorfromsklearn.ensemble:import pandas as pd from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestRegressor from sklearn.metrics import mean_squared_error import altair as alt -
Create a variable called
file_urlthat contains the URL to the dataset:file_url = 'https://raw.githubusercontent.com/'\ 'fenago/data-science/'\ 'master/Lab09/Dataset/phpYYZ4Qc.csv' -
Load the dataset into a DataFrame called
dfusing.read_csv():df = pd.read_csv(file_url) -
Extract the
rejcolumn using.pop()and save it into a variable calledy:y = df.pop('rej') -
Split the DataFrame into training and testing sets using
train_test_split()withtest_size=0.3andrandom_state = 1:X_train, X_test, y_train, y_test = train_test_split\ (df, y, test_size=0.3, \ random_state=1) -
Instantiate
RandomForestRegressorwithrandom_state=1,n_estimators=50,max_depth=6, andmin_samples_leaf=60:rf_model = RandomForestRegressor(random_state=1, \ n_estimators=50, max_depth=6,\ min_samples_leaf=60) -
Train the model on the training set using
.fit():rf_model.fit(X_train, y_train)You should get the following output:
Caption: Logs of the Random Forest model
-
Predict the outcomes of the training and testing sets using
.predict():preds_train = rf_model.predict(X_train) preds_test = rf_model.predict(X_test) -
Calculate the mean squared error on the training set and print its value:
train_mse = mean_squared_error(y_train, preds_train) train_mseYou should get the following output:
Caption: MSE score of the training set
We achieved quite a low MSE score on the training set.
-
Calculate the MSE on the testing set and print its value:
test_mse = mean_squared_error(y_test, preds_test) test_mseYou should get the following output:
Caption: MSE score of the testing set
We also have a low MSE score on the testing set that is very similar
to the training one. So, our model is not overfitting.
-
Print the variable importance using
.feature_importances_:rf_model.feature_importances_You should get the following output:
Caption: MSE score of the testing set
-
Create an empty DataFrame called
varimp_df:varimp_df = pd.DataFrame() -
Create a new column called
featurefor this DataFrame with the name of the columns ofdf, using.columns:varimp_df['feature'] = df.columns varimp_df['importance'] = rf_model.feature_importances_ -
Print the first five rows of
varimp_df:varimp_df.head()You should get the following output:
Caption: Variable importance of the first five variables
From this output, we can see the variables `a1cy` and
`a1sy` have the highest value, so they are more important
for predicting the target variable than the three other variables
shown here.
-
Plot a bar chart with Altair using
coef_dfandimportanceas thexaxis andfeatureas theyaxis:alt.Chart(varimp_df).mark_bar().encode(x='importance',\ y="feature")You should get the following output:
Caption: Graph showing the variable importance of the first five variables
From this output, we can see the variables that impact the prediction
the most for this Random Forest model are a2pop,
a1pop, a3pop, b1eff, and
temp, by decreasing order of importance.
Variable Importance via Permutation
In the previous section, we saw how to extract feature importance for RandomForest. There is actually another technique that shares the same name, but its underlying logic is different and can be applied to any algorithm, not only tree-based ones.
This technique can be referred to as variable importance via permutation. Let's say we trained a model to predict a target variable with five classes and achieved an accuracy of 0.95. One way to assess the importance of one of the features is to remove and train a model and see the new accuracy score. If the accuracy score dropped significantly, then we could infer that this variable has a significant impact on the prediction. On the other hand, if the score slightly decreased or stayed the same, we could say this variable is not very important and doesn't influence the final prediction much. So, we can use this difference between the model's performance to assess the importance of a variable.
The drawback of this method is that you need to retrain a new model for each variable. If it took you a few hours to train the original model and you have 100 different features, it would take quite a while to compute the importance of each variable. It would be great if we didn't have to retrain different models. So, another solution would be to generate noise or new values for a given column and predict the final outcomes from this modified data and compare the accuracy score. For example, if you have a column with values between 0 and 100, you can take the original data and randomly generate new values for this column (keeping all other variables the same) and predict the class for them.
This option also has a catch. The randomly generated values can be very different from the original data. Going back to the same example we saw before, if the original range of values for a column is between 0 and 100 and we generate values that can be negative or take a very high value, it is not very representative of the real distribution of the original data. So, we will need to understand the distribution of each variable before generating new values.
Rather than generating random values, we can simply swap (or permute) values of a column between different rows and use these modified cases for predictions. Then, we can calculate the related accuracy score and compare it with the original one to assess the importance of this variable. For example, we have the following rows in the original dataset:
Caption: Example of the dataset
We can swap the values for the X1 column and get a new dataset:
Caption: Example of a swapped column from the dataset
The mlxtend package provides a function to perform variable
permutation and calculate variable importance values:
feature_importance_permutation. Let's see how to use it
with the Breast Cancer dataset from sklearn.
First, let's load the data and train a Random Forest model:
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
data = load_breast_cancer()
X, y = data.data, data.target
rf_model = RandomForestClassifier(random_state=168)
rf_model.fit(X, y)
Then, we will call the feature_importance_permutation
function from mlxtend.evaluate. This function takes the
following parameters:
predict_method: A function that will be called for model prediction. Here, we will provide thepredictmethod from our trainedrf_modelmodel.X: The features from the dataset. It needs to be in NumPy array form.y: The target variable from the dataset. It needs to be inNumpyarray form.metric: The metric used for comparing the performance of the model. For the classification task, we will use accuracy.num_round: The number of roundsmlxtendwill perform permutation on the data and assess the performance change.seed: The seed set for getting reproducible results.
Consider the following code snippet:
from mlxtend.evaluate import feature_importance_permutation
imp_vals, _ = feature_importance_permutation\
(predict_method=rf_model.predict, X=X, y=y, \
metric='r2', num_rounds=1, seed=2)
imp_vals
The output should be as follows:
Caption: Variable importance by permutation
Let's create a DataFrame containing these values and the names of the
features and plot them on a graph with altair:
import pandas as pd
varimp_df = pd.DataFrame()
varimp_df['feature'] = data.feature_names
varimp_df['importance'] = imp_vals
varimp_df.head()
import altair as alt
alt.Chart(varimp_df).mark_bar().encode(x='importance',\
y="feature")
The output should be as follows:
Caption: Graph showing variable importance by permutation
These results are different from the ones we got from
RandomForest in the previous section. Here, worst concave
points is the most important, followed by worst area, and worst
perimeter has a higher value than mean radius. So, we got the same list
of the most important variables but in a different order. This confirms
these three features are indeed the most important in predicting whether
a tumor is malignant or not. The variable importance from
RandomForest and the permutation have different logic,
therefore, you might get different outputs when you run the code given
in the preceding section.
Exercise 9.03: Extracting Feature Importance via Permutation
In this exercise, we will compute and extract feature importance by permutating a Random Forest classifier model trained to predict the customer drop-out ratio.
We will using the same dataset as in the previous exercise.
The following steps will help you complete the exercise:
-
Open a new Jupyter notebook.
-
Import the following packages:
pandas,train_test_splitfromsklearn.model_selection,RandomForestRegressorfromsklearn.ensemble,feature_importance_permutationfrommlxtend.evaluate, andaltair:import pandas as pd from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestRegressor from mlxtend.evaluate import feature_importance_permutation import altair as alt -
Create a variable called
file_urlthat contains the URL of the dataset:file_url = 'https://raw.githubusercontent.com/'\ 'fenago/data-science/'\ 'master/Lab09/Dataset/phpYYZ4Qc.csv' -
Load the dataset into a DataFrame called
dfusing.read_csv():df = pd.read_csv(file_url) -
Extract the
rejcolumn using.pop()and save it into a variable calledy:y = df.pop('rej') -
Split the DataFrame into training and testing sets using
train_test_split()withtest_size=0.3andrandom_state = 1:X_train, X_test, y_train, y_test = train_test_split\ (df, y, test_size=0.3, \ random_state=1) -
Instantiate
RandomForestRegressorwithrandom_state=1,n_estimators=50,max_depth=6, andmin_samples_leaf=60:rf_model = RandomForestRegressor(random_state=1, \ n_estimators=50, max_depth=6, \ min_samples_leaf=60) -
Train the model on the training set using
.fit():rf_model.fit(X_train, y_train)You should get the following output:
Caption: Logs of RandomForest
-
Extract the feature importance via permutation using
feature_importance_permutationfrommlxtendwith the Random Forest model, the testing set,r2as the metric used,num_rounds=1, andseed=2. Save the results into a variable calledimp_valsand print its values:imp_vals, _ = feature_importance_permutation\ (predict_method=rf_model.predict, \ X=X_test.values, y=y_test.values, \ metric='r2', num_rounds=1, seed=2) imp_valsYou should get the following output:
Caption: Variable importance by permutation
It is quite hard to interpret the raw results. Let\'s plot the
variable importance by permutating the model on a graph.
-
Create a DataFrame called
varimp_dfwith two columns:featurecontaining the name of the columns ofdf, using.columnsand'importance'containing the values ofimp_vals:varimp_df = pd.DataFrame({'feature': df.columns, \ 'importance': imp_vals}) -
Plot a bar chart with Altair using
coef_dfandimportanceas thexaxis andfeatureas theyaxis:alt.Chart(varimp_df).mark_bar().encode(x='importance',\ y="feature")You should get the following output:
Caption: Graph showing the variable importance by permutation
Partial Dependence Plots
Another tool that is model-agnostic is a partial dependence plot. It is
a visual tool for analyzing the effect of a feature on the target
variable. To achieve this, we can plot the values of the feature we are
interested in analyzing on the x-axis and the target
variable on the y-axis and then show all the observations
from the dataset on this graph. Let's try it on the Breast Cancer
dataset from sklearn:
from sklearn.datasets import load_breast_cancer
import pandas as pd
data = load_breast_cancer()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target
Now that we have loaded the data and converted it to a DataFrame, let's have a look at the worst concave points column:
import altair as alt
alt.Chart(df).mark_circle(size=60)\
.encode(x='worst concave points', y='target')
The resulting plot is as follows:
Caption: Scatter plot of the worst concave points and target variables
sklearn provides a function called
plot_partial_dependence() to display the partial dependence
plot for a given feature. Let's see how to use it on the Breast Cancer
dataset. First, we need to get the index of the column we are interested
in. We will use the .get_loc() method from
pandas to get the index for the
worst concave points column:
import altair as alt
from sklearn.inspection import plot_partial_dependence
feature_index = df.columns.get_loc("worst concave points")
Now we can call the plot_partial_dependence() function. We
need to provide the following parameters: the trained model, the
training set, and the indices of the features to be analyzed:
plot_partial_dependence(rf_model, df, \
features=[feature_index])
Caption: Partial dependence plot for the worst concave points column
Exercise 9.04: Plotting Partial Dependence
In this exercise, we will plot partial dependence plots for two
variables, a1pop and temp, from a Random Forest
classifier model trained to predict the customer drop-out ratio.
We will using the same dataset as in the previous exercise.
-
Open a new Jupyter notebook.
-
Import the following packages:
pandas,train_test_splitfromsklearn.model_selection,RandomForestRegressorfromsklearn.ensemble,plot_partial_dependencefromsklearn.inspection, andaltair:import pandas as pd from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestRegressor from sklearn.inspection import plot_partial_dependence import altair as alt -
Create a variable called
file_urlthat contains the URL for the dataset:file_url = 'https://raw.githubusercontent.com/'\ 'fenago/data-science/'\ 'master/Lab09/Dataset/phpYYZ4Qc.csv' -
Load the dataset into a DataFrame called
dfusing.read_csv():df = pd.read_csv(file_url) -
Extract the
rejcolumn using.pop()and save it into a variable calledy:y = df.pop('rej') -
Split the DataFrame into training and testing sets using
train_test_split()withtest_size=0.3andrandom_state = 1:X_train, X_test, y_train, y_test = train_test_split\ (df, y, test_size=0.3, \ random_state=1) -
Instantiate
RandomForestRegressorwithrandom_state=1,n_estimators=50,max_depth=6, andmin_samples_leaf=60:rf_model = RandomForestRegressor(random_state=1, \ n_estimators=50, max_depth=6,\ min_samples_leaf=60) -
Train the model on the training set using
.fit():rf_model.fit(X_train, y_train)You should get the following output:
Caption: Logs of RandomForest
-
Plot the partial dependence plot using
plot_partial_dependence()fromsklearnwith the Random Forest model, the testing set, and the index of thea1popcolumn:plot_partial_dependence(rf_model, X_test, \ features=[df.columns.get_loc('a1pop')])You should get the following output:
Caption: Partial dependence plot for a1pop
-
Plot the partial dependence plot using
plot_partial_dependence()fromsklearnwith the Random Forest model, the testing set, and the index of thetempcolumn:plot_partial_dependence(rf_model, X_test, \ features=[df.columns.get_loc('temp')])You should get the following output:
Caption: Partial dependence plot for temp
Local Interpretation with LIME
Let's see how we can use it on the Breast Cancer dataset. First, we will load the data and train a Random Forest model:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
data = load_breast_cancer()
X, y = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split\
(X, y, test_size=0.3, \
random_state=1)
rf_model = RandomForestClassifier(random_state=168)
rf_model.fit(X_train, y_train)
The lime package is not directly accessible on Jupyter,
so we need to manually install it with the following command:
!pip install lime
The output will be as follows:
Caption: Installation logs for the lime package
Once installed, we will instantiate the LimeTabularExplainer
class by providing the training data, the names of the features, the
names of the classes to be predicted, and the task type (in this
example, it is classification):
from lime.lime_tabular import LimeTabularExplainer
lime_explainer = LimeTabularExplainer\
(X_train, feature_names=data.feature_names,\
class_names=data.target_names,\
mode='classification')
Then, we will call the .explain_instance() method with the
observations we are interested in (here, it will be the second
observation from the testing set) and the function that will predict the
outcome probabilities (here, it is .predict_proba()).
Finally, we will call the .show_in_notebook() method to
display the results from lime:
exp = lime_explainer.explain_instance\
(X_test[1], rf_model.predict_proba, num_features=10)
exp.show_in_notebook()
The output will be as follows:
Exercise 9.05: Local Interpretation with LIME
In this exercise, we will analyze some predictions from a Random Forest classifier model trained to predict the customer drop-out ratio using LIME.
We will be using the same dataset as in the previous exercise.
-
Open a new Jupyter notebook.
-
Import the following packages:
pandas,train_test_splitfromsklearn.model_selection, andRandomForestRegressorfromsklearn.ensemble:import pandas as pd from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestRegressor -
Create a variable called
file_urlthat contains the URL of the dataset:file_url = 'https://raw.githubusercontent.com/'\ 'fenago/data-science/'\ 'master/Lab09/Dataset/phpYYZ4Qc.csv' -
Load the dataset into a DataFrame called
dfusing.read_csv():df = pd.read_csv(file_url) -
Extract the
rejcolumn using.pop()and save it into a variable calledy:y = df.pop('rej') -
Split the DataFrame into training and testing sets using
train_test_split()withtest_size=0.3andrandom_state = 1:X_train, X_test, y_train, y_test = train_test_split\ (df, y, test_size=0.3, \ random_state=1) -
Instantiate
RandomForestRegressorwithrandom_state=1,n_estimators=50,max_depth=6, andmin_samples_leaf=60:rf_model = RandomForestRegressor(random_state=1, \ n_estimators=50, max_depth=6,\ min_samples_leaf=60) -
Train the model on the training set using
.fit():rf_model.fit(X_train, y_train)You should get the following output:
Caption: Logs of RandomForest
-
Install the lime package using the
!pipinstall command:!pip install lime -
Import
LimeTabularExplainerfromlime.lime_tabular:from lime.lime_tabular import LimeTabularExplainer -
Instantiate
LimeTabularExplainerwith the training set andmode='regression':lime_explainer = LimeTabularExplainer\ (X_train.values, \ feature_names=X_train.columns, \ mode='regression') -
Display the LIME analysis on the first row of the testing set using
.explain_instance()and.show_in_notebook():exp = lime_explainer.explain_instance\ (X_test.values[0], rf_model.predict) exp.show_in_notebook()You should get the following output:
Caption: LIME output for the first observation of the testing set
This output shows that the predicted value for this observation is a
0.02 chance of customer drop-out and it has been mainly influenced
by the `a1pop`, `a3pop`, `a2pop`, and
`b2eff` features. For instance, the fact that
`a1pop` was under 0.87 has decreased the value of the
target variable by 0.01.
-
Display the LIME analysis on the third row of the testing set using
.explain_instance()and.show_in_notebook():exp = lime_explainer.explain_instance\ (X_test.values[2], rf_model.predict) exp.show_in_notebook()You should get the following output:
Activity 9.01: Train and Analyze a Network Intrusion Detection Model
You are working for a cybersecurity company and you have been tasked with building a model that can recognize network intrusion then analyze its feature importance, plot partial dependence, and perform local interpretation on a single observation using LIME.
The dataset provided contains data from 7 weeks of network traffic.
The following steps will help you to complete this activity:
-
Download and load the dataset using
.read_csv()frompandas. -
Extract the response variable using
.pop()frompandas. -
Split the dataset into training and test sets using
train_test_split()fromsklearn.model_selection. -
Create a function that will instantiate and fit
RandomForestClassifierusing.fit()fromsklearn.ensemble. -
Create a function that will predict the outcome for the training and testing sets using
.predict(). -
Create a function that will print the accuracy score for the training and testing sets using
accuracy_score()fromsklearn.metrics. -
Compute the feature importance via permutation with
feature_importance_permutation()and display it on a bar chart usingaltair. -
Plot the partial dependence plot using
plot_partial_dependenceon thesrc_bytesvariable. -
Install
limeusing!pip install. -
Perform a LIME analysis on row
99893withexplain_instance().The output should be as follows:
Summary
In this lab, we learned a few techniques for interpreting machine learning models. We saw that there are techniques that are specific to the model used: coefficients for linear models and variable importance for tree-based models. There are also some methods that are model-agnostic, such as variable importance via permutation.




































