36 KiB
Lab 3. Binary Classification
In this lab, we will be using a real-world dataset and a supervised learning technique called classification to generate business outcomes.
Exercise 3.01: Loading and Exploring the Data from the Dataset
In this exercise, we will load the dataset in our Jupyter notebook and do
some basic explorations such as printing the dimensions of the dataset
using the .shape() function and generating summary
statistics of the dataset using the .describe() function.
The following steps will help you to complete this exercise:
-
Open a new Jupyter notebook.
-
Now,
importpandasaspdin your Jupyter notebook:import pandas as pd -
Assign the link to the dataset to a variable called
file_urlfile_url = 'https://raw.githubusercontent.com/fenago'\ '/data-science/master/Lab03'\ '/bank-full.csv' -
Now, read the file using the
pd.read_csv()function from the pandas DataFrame:# Loading the data using pandas bankData = pd.read_csv(file_url, sep=";") bankData.head()You should get the following output:
Caption: Loading data into a Jupyter notebook
Here, we loaded the `CSV` file and then stored it as a
pandas DataFrame for further analysis.
-
Next, print the shape of the dataset, as mentioned in the following code snippet:
# Printing the shape of the data print(bankData.shape)The
.shapefunction is used to find the overall shape of the dataset.You should get the following output:
(45211, 17) -
Now, find the summary of the numerical raw data as a table output using the
.describe()function in pandas, as mentioned in the following code snippet:# Summarizing the statistics of the numerical raw data bankData.describe()You should get the following output:
Stacked bar charts
Let\'s create some dummy data and generate a stacked bar chart to
check the proportion of jobs in different sectors.
**Note:** Do not execute any of the following code snippets until the final
step. Enter all the code in the same cell.
Import the library files required for the task:
```
# Importing library files
import matplotlib.pyplot as plt
import numpy as np
```
Next, create some sample data detailing a list of jobs:
```
# Create a simple list of categories
jobList = ['admin','scientist','doctor','management']
```
Each job will have two categories to be plotted, `yes` and
`No`, with some proportion between `yes` and
`No`. These are detailed as follows:
```
# Getting two categories ( 'yes','No') for each of jobs
jobYes = [20,60,70,40]
jobNo = [80,40,30,60]
```
In the next steps, the length of the job list is taken for plotting
`xlabels` and then they are arranged using the
`np.arange()` function:
```
# Get the length of x axis labels and arranging its indexes
xlabels = len(jobList)
ind = np.arange(xlabels)
```
Next, let\'s define the width of each bar and do the plotting. In
the plot, `p2`, we define that when stacking,
`yes` will be at the bottom and `No` at top:
```
# Get width of each bar
width = 0.35
# Getting the plots
p1 = plt.bar(ind, jobYes, width)
p2 = plt.bar(ind, jobNo, width, bottom=jobYes)
```
Define the labels for the *Y* axis and the title of the plot:
```
# Getting the labels for the plots
plt.ylabel('Proportion of Jobs')
plt.title('Job')
```
The indexes for the *X* and *Y* axes are defined next. For the *X*
axis, the list of jobs are given, and, for the *Y* axis, the indices
are in proportion from `0` to `100` with an
increment of `10` (0, 10, 20, 30, and so on):
```
# Defining the x label indexes and y label indexes
plt.xticks(ind, jobList)
plt.yticks(np.arange(0, 100, 10))
```
The last step is to define the legends and to rotate the axis labels
to `90` degrees. The plot is finally displayed:
```
# Defining the legends
plt.legend((p1[0], p2[0]), ('Yes', 'No'))
# To rotate the axis labels
plt.xticks(rotation=90)
plt.show()
```
Here is what a stacked bar chart looks like based on the preceding example:
Caption: Example of a stacked bar plot
Let's use these graphs in the following exercises and activities.
Exercise 3.02: Business Hypothesis Testing for Age versus Propensity for a Term Loan
The goal of this exercise is to define a hypothesis to check the propensity for an individual to purchase a term deposit plan against their age. We will be using a line graph for this exercise.
The following steps will help you to complete this exercise:
-
Let's first define our hypothesis on age and propensity to buy term deposits:
The propensity to buy term deposits is more with elderly customers compared to younger ones. This is our hypothesis.
-
Import the pandas and altair packages:
import pandas as pd import altair as alt -
Next, you need to load the dataset, just like you loaded the dataset in Exercise 3.01, Loading and Exploring the Data from the Dataset:
file_url = 'https://raw.githubusercontent.com/'\ 'fenago/data-science/'\ 'master/Lab03/bank-full.csv' bankData = pd.read_csv(file_url, sep=";")Note
Steps 2-3 will be repeated in the following exercises for this lab.
We will be verifying how the purchased term deposits are distributed by age.
-
Next, we will count the number of records for each age group. We will be using the combination of
.groupby(),.agg(),.reset_index()methods frompandas.Note
You will see further details of these methods in Lab 12, Feature Engineering.
filter_mask = bankData['y'] == 'yes' bankSub1 = bankData[filter_mask]\ .groupby('age')['y'].agg(agegrp='count')\ .reset_index() -
Now, plot a line chart using altair and the
.Chart().mark_line().encode()methods and we will define thexandyvariables, as shown in the following code snippet:# Visualising the relationship using altair alt.Chart(bankSub1).mark_line().encode(x='age', y='agegrp')You should get the following output:
-
Group the data per age using the
groupby()method and find the total number of customers under each age group using theagg()method:# Getting another perspective ageTot = bankData.groupby('age')['y']\ .agg(ageTot='count').reset_index() ageTot.head()The output is as follows:
Caption: Customers per age group
-
Now, group the data by both age and propensity of purchase and find the total counts under each category of propensity, which are
yesandno:# Getting all the details in one place ageProp = bankData.groupby(['age','y'])['y']\ .agg(ageCat='count').reset_index() ageProp.head()The output is as follows:
Caption: Propensity by age group
-
Merge both of these DataFrames based on the
agevariable using thepd.merge()function, and then divide each category of propensity within each age group by the total customers in the respective age group to get the proportion of customers, as shown in the following code snippet:# Merging both the data frames ageComb = pd.merge(ageProp, ageTot,left_on = ['age'], \ right_on = ['age']) ageComb['catProp'] = (ageComb.ageCat/ageComb.ageTot)*100 ageComb.head()The output is as follows:
-
Now, display the proportion where you plot both categories (yes and no) as separate plots. This can be achieved through a method within
altaircalledfacet():# Visualising the relationship using altair alt.Chart(ageComb).mark_line()\ .encode(x='age', y='catProp').facet(column='y')This function makes as many plots as there are categories within the variable. Here, we give the
'y'variable, which is the variable name for theyesandnocategories to thefacet()function, and we get two different plots: one foryesand another forno.You should get the following output:
Caption: Visualizing normalized relationships
Activity 3.01: Business Hypothesis Testing to Find Employment Status versus Propensity for Term Deposits
You are working as a data scientist for a bank. You are provided with historical data from the management of the bank and are asked to try to formulate a hypothesis between employment status and the propensity to buy term deposits.
In this activity, we will use a verify the relationship between employment status and term deposit purchase propensity.
The steps are as follows:
-
Formulate the hypothesis between employment status and the propensity for term deposits. Let the hypothesis be as follows: High paying employees prefer term deposits than other categories of employees.
-
Open a Jupyter notebook file similar to what was used in Exercise 3.02, Business Hypothesis Testing for Age versus Propensity for a Term Loan and install and import the necessary libraries such as
pandasandaltair. -
From the banking DataFrame,
bankData, find the distribution of employment status using the.groupby(),.agg()and.reset_index()methods.Group the data with respect to employment status using the
.groupby()method and find the total count of propensities for each employment status using the.agg()method. -
Now, merge both DataFrames using the
pd.merge()function and then find the propensity count by calculating the proportion of propensity for each type of employment status. When creating the new variable for finding the propensity proportion. -
Plot the data and summarize intuitions from the plot using
matplotlib. Use the stacked bar chart for this activity.
Expected output: The final plot of the propensity to buy with respect to employment status will be similar to the following plot:
Exercise 3.03: Feature Engineering -- Exploration of Individual Features
In this exercise, we will explore the relationship between two variables, which are whether an individual owns a house and whether an individual has a loan, to the propensity for term deposit purchases by these individuals.
The following steps will help you to complete this exercise:
-
Open a new Jupyter notebook.
-
Import the
pandaspackage.import pandas as pd -
Assign the link to the dataset to a variable called
file_url:file_url = 'https://raw.githubusercontent.com'\ '/fenago/data-science'\ '/master/Lab03/bank-full.csv' -
Read the banking dataset using the
.read_csv()function:# Reading the banking data bankData = pd.read_csv(file_url, sep=";") -
Next, we will find a relationship between housing and the propensity for term deposits, as mentioned in the following code snippet:
# Relationship between housing and propensity for term deposits bankData.groupby(['housing', 'y'])['y']\ .agg(houseTot='count').reset_index()You should get the following output:
-
Explore the
'loan'variable to find its relationship with the propensity for a term deposit, as mentioned in the following code snippet:""" Relationship between having a loan and propensity for term deposits """ bankData.groupby(['loan', 'y'])['y']\ .agg(loanTot='count').reset_index()Note
The triple-quotes (
""") shown in the code snippet above are used to denote the start and end points of a multi-line code comment. This is an alternative to using the#symbol.You should get the following output:
Caption: Loan versus term deposit propensity
In the case of loan portfolios, the propensity to buy term deposits
is higher for customers without loans:
`( 4805 / ( 4805 + 33162) = 12 % to 484/ ( 484 + 6760) = 6%)`.
Housing and loans were categorical data and finding a relationship
was straightforward. However, bank balance data is numerical and to
analyze it, we need to have a different strategy. One common
strategy is to convert the continuous numerical data into ordinal
data and look at how the propensity varies across each category.
-
To convert numerical values into ordinal values, we first find the quantile values and take them as threshold values. The quantiles are obtained using the following code snippet:
#Taking the quantiles for 25%, 50% and 75% of the balance data import numpy as np np.quantile(bankData['balance'],[0.25,0.5,0.75])You should get the following output:
Caption: Quantiles for bank balance data
-
Now, convert the numerical values of bank balances into categorical values, as mentioned in the following code snippet:
bankData['balanceClass'] = 'Quant1' bankData.loc[(bankData['balance'] > 72) \ & (bankData['balance'] < 448), \ 'balanceClass'] = 'Quant2' bankData.loc[(bankData['balance'] > 448) \ & (bankData['balance'] < 1428), \ 'balanceClass'] = 'Quant3' bankData.loc[bankData['balance'] > 1428, \ 'balanceClass'] = 'Quant4' bankData.head()You should get the following output:
Caption: New features from bank balance data
-
Next, we need to find the propensity of term deposit purchases based on each quantile the customers fall into. This task is similar to what we did in Exercise 3.02, Business Hypothesis Testing for Age versus Propensity for a Term Loan:
# Calculating the customers under each quantile balanceTot = bankData.groupby(['balanceClass'])['y']\ .agg(balanceTot='count').reset_index() balanceTotYou should get the following output:
Caption: Classification based on quantiles
-
Calculate the total number of customers categorized by quantile and propensity classification, as mentioned in the following code snippet:
""" Calculating the total customers categorised as per quantile and propensity classification """ balanceProp = bankData.groupby(['balanceClass', 'y'])['y']\ .agg(balanceCat='count').reset_index() balancePropYou should get the following output:
Caption: Total number of customers categorized by quantile and
propensity classification
-
Now,
mergeboth DataFrames:# Merging both the data frames balanceComb = pd.merge(balanceProp, balanceTot, \ on = ['balanceClass']) balanceComb['catProp'] = (balanceComb.balanceCat \ / balanceComb.balanceTot)*100 balanceCombYou should get the following output:
Caption: Propensity versus balance category
In the next exercise, we will use these intuitions to derive a new feature.
Exercise 3.04: Feature Engineering -- Creating New Features from Existing Ones
In this exercise, we will combine the individual variables we analyzed in Exercise 3.03, Feature Engineering -- Exploration of Individual Features to derive a new feature called an asset index. One methodology to create an asset index is by assigning weights based on the asset or liability of the customer.
For instance, a higher bank balance or home ownership will have a positive bearing on the overall asset index and, therefore, will be assigned a higher weight. In contrast, the presence of a loan will be a liability and, therefore, will have to have a lower weight. Let's give a weight of 5 if the customer has a house and 1 in its absence. Similarly, we can give a weight of 1 if the customer has a loan and 5 in case of no loans:
-
Open a new Jupyter notebook.
-
Import the pandas and numpy package:
import pandas as pd import numpy as np -
Assign the link to the dataset to a variable called 'file_url'.
file_url = 'https://raw.githubusercontent.com'\ '/fenago/data-science'\ '/master/Lab03/bank-full.csv' -
Read the banking dataset using the
.read_csv()function:# Reading the banking data bankData = pd.read_csv(file_url,sep=";") -
The first step we will follow is to normalize the numerical variables. This is implemented using the following code snippet:
# Normalizing data from sklearn import preprocessing x = bankData[['balance']].values.astype(float) -
As the bank balance dataset contains numerical values, we need to first normalize the data. The purpose of normalization is to bring all of the variables that we are using to create the new feature into a common scale. One effective method we can use here for the normalizing function is called
MinMaxScaler(), which converts all of the numerical data between a scaled range of 0 to 1. TheMinMaxScalerfunction is available within thepreprocessingmethod insklearn:minmaxScaler = preprocessing.MinMaxScaler() -
Transform the balance data by normalizing it with
minmaxScaler:bankData['balanceTran'] = minmaxScaler.fit_transform(x)In this step, we created a new feature called
'balanceTran'to store the normalized bank balance values. -
Print the head of the data using the
.head()function:bankData.head()You should get the following output:
Caption: Normalizing the bank balance data
-
After creating the normalized variable, add a small value of
0.001so as to eliminate the 0 values in the variable. This is mentioned in the following code snippet:# Adding a small numerical constant to eliminate 0 values bankData['balanceTran'] = bankData['balanceTran'] + 0.00001 -
Now, add two additional columns for introducing the transformed variables for loans and housing, as per the weighting approach discussed at the start of this exercise:
# Let us transform values for loan data bankData['loanTran'] = 1 # Giving a weight of 5 if there is no loan bankData.loc[bankData['loan'] == 'no', 'loanTran'] = 5 bankData.head()You should get the following output:
-
Now, transform values for the
Housing data, as mentioned here:# Let us transform values for Housing data bankData['houseTran'] = 5 -
Give a weight of
1if the customer has a house and print the results, as mentioned in the following code snippet:bankData.loc[bankData['housing'] == 'no', 'houseTran'] = 1 print(bankData.head())You should get the following output:
-
Now, create a new variable, which is the product of all of the transformed variables:
""" Let us now create the new variable which is a product of all these """ bankData['assetIndex'] = bankData['balanceTran'] \ * bankData['loanTran'] \ * bankData['houseTran'] bankData.head()You should get the following output:
Caption: Creating a composite index
-
Explore the propensity with respect to the composite index.
We observe the relationship between the asset index and the propensity of term deposit purchases. We adopt a similar strategy of converting the numerical values of the asset index into ordinal values by taking the quantiles and then mapping the quantiles to the propensity of term deposit purchases, as mentioned in Exercise 3.03, Feature Engineering -- Exploration of Individual Features:
# Finding the quantile np.quantile(bankData['assetIndex'],[0.25,0.5,0.75])You should get the following output:
Caption: Conversion of numerical values into ordinal values
-
Next, create quantiles from the
assetindexdata, as mentioned in the following code snippet:bankData['assetClass'] = 'Quant1' bankData.loc[(bankData['assetIndex'] > 0.38) \ & (bankData['assetIndex'] < 0.57), \ 'assetClass'] = 'Quant2' bankData.loc[(bankData['assetIndex'] > 0.57) \ & (bankData['assetIndex'] < 1.9), \ 'assetClass'] = 'Quant3' bankData.loc[bankData['assetIndex'] > 1.9, \ 'assetClass'] = 'Quant4' bankData.head() bankData.assetClass[bankData['assetIndex'] > 1.9] = 'Quant4' bankData.head()You should get the following output:
Caption: Quantiles for the asset index
-
Calculate the total of each asset class and the category-wise counts, as mentioned in the following code snippet:
# Calculating total of each asset class assetTot = bankData.groupby('assetClass')['y']\ .agg(assetTot='count').reset_index() # Calculating the category wise counts assetProp = bankData.groupby(['assetClass', 'y'])['y']\ .agg(assetCat='count').reset_index() -
Next, merge both DataFrames:
# Merging both the data frames assetComb = pd.merge(assetProp, assetTot, on = ['assetClass']) assetComb['catProp'] = (assetComb.assetCat \ / assetComb.assetTot)*100 assetCombYou should get the following output:
Caption: Composite index relationship mapping
A Quick Peek at Data Types and a Descriptive Summary
Looking at the data types such as categorical or numeric and then deriving summary statistics is a good way to take a quick peek into data before you do some of the downstream feature engineering steps. Let's take a look at an example from our dataset:
# Looking at Data types
print(bankData.dtypes)
# Looking at descriptive statistics
print(bankData.describe())
You should get the following output:
The following output is that of a descriptive summary statistic, which
displays some of the basic measures such as mean,
standard deviation, count, and the
quantile values of the respective features:
Correlation Matrix and Visualization
Let's look at how data correlation can be generated and then visualized in the following exercise.
Exercise 3.05: Finding the Correlation in Data to Generate a Correlation Plot Using Bank Data
In this exercise, we will be creating a correlation plot and analyzing the results of the bank dataset.
The following steps will help you to complete the exercise:
-
Open a new Jupyter notebook, install the
pandaspackages and load the banking data:import pandas as pd file_url = 'https://raw.githubusercontent.com'\ '/fenago/data-science'\ '/master/Lab03/bank-full.csv' bankData = pd.read_csv(file_url, sep=";") -
Now,
importtheset_optionlibrary frompandas, as mentioned here:from pandas import set_optionThe
set_optionfunction is used to define the display options for many operations. -
Next, create a variable that would store numerical variables such as
'age','balance','day','duration','campaign','pdays','previous',as mentioned in the following code snippet. A correlation plot can be extracted only with numerical data. This is why the numerical data has to be extracted separately:bankNumeric = bankData[['age','balance','day','duration',\ 'campaign','pdays','previous']] -
Now, use the
.corr()function to find the correlation matrix for the dataset:set_option('display.width',150) set_option('precision',3) bankCorr = bankNumeric.corr(method = 'pearson') bankCorrYou should get the following output:
-
Now, plot the data:
from matplotlib import pyplot corFig = pyplot.figure() figAxis = corFig.add_subplot(111) corAx = figAxis.matshow(bankCorr,vmin=-1,vmax=1) corFig.colorbar(corAx) pyplot.show()You should get the following output:
Caption: Correlation plot
Skewness of Data
Let's take a look at the following example. Here, we use the
.skew() function to find the skewness in data. For instance,
to find the skewness of data in our bank-full.csv dataset,
we perform the following:
# Skewness of numeric attributes
bankNumeric.skew()
Note
This code refers to the bankNumeric data, so you should
ensure you are working in the same notebook as the previous exercise.
You should get the following output:
Histograms
Histograms are an effective way to plot the distribution of data and to identify skewness in data, if any.
# Histograms
from matplotlib import pyplot as plt
fig, axs = plt.subplots(1,2)
axs[0].hist(bankNumeric['age'])
axs[0].set_title('Distribution of age')
axs[1].hist(bankNumeric['balance'])
axs[1].set_title('Distribution of Balance')
# Ensure plots do not overlap
plt.tight_layout()
You should get the following output:
Density Plots
Density plots help in visualizing the distribution of data. A density
plot can be created using the kind = 'density' parameter:
from matplotlib import pyplot as plt
# Density plots
bankNumeric['age'].plot(kind = 'density', subplots = False, \
layout = (1,1))
plt.title('Age Distribution')
plt.xlabel('Age')
plt.ylabel('Normalised age distribution')
pyplot.show()
You should get the following output:
Caption: Code showing the generation of a density plot
Other Feature Engineering Methods
The normalizer and standard scaler functions are important feature engineering steps that are applied to the data before downstream modeling steps. Let's look at both of these techniques:
# Standardize data (0 mean, 1 stdev)
from sklearn.preprocessing import StandardScaler
from numpy import set_printoptions
scaling = StandardScaler().fit(bankNumeric)
rescaledNum = scaling.transform(bankNumeric)
set_printoptions(precision = 3)
print(rescaledNum)
You should get the following output:
Caption: Output from standardizing the data
The following code uses the normalizer data transmission techniques:
# Normalizing Data (Length of 1)
from sklearn.preprocessing import Normalizer
normaliser = Normalizer().fit(bankNumeric)
normalisedNum = normaliser.transform(bankNumeric)
set_printoptions(precision = 3)
print(normalisedNum)
You should get the following output:
Exercise 3.06: A Logistic Regression Model for Predicting the Propensity of Term Deposit Purchases in a Bank
In this exercise, we will build a logistic regression model, which will be used for predicting the propensity of term deposit purchases. This exercise will have three parts. The first part will be the preprocessing of the data, the second part will deal with the training process, and the last part will be spent on prediction, analysis of metrics, and deriving strategies for further improvement of the model.
You begin with data preprocessing.
In this part, we will first load the data, convert the ordinal data into dummy data, and then split the data into training and test sets for the subsequent training phase:
-
Open a Jupyter notebook, mount the drives, install necessary packages, and load the data, as in previous exercises:
import pandas as pd import altair as alt file_url = 'https://raw.githubusercontent.com'\ '/fenago/data-science'\ '/master/Lab03/bank-full.csv' bankData = pd.read_csv(file_url, sep=";") -
Now, load the library functions and data:
from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split -
Now, find the data types:
bankData.dtypesYou should get the following output:
Caption: Data types
-
Convert the ordinal data into dummy data.
The value against each variable would be either 1 or 0, depending on whether that category was present in the variable as an example. Let's look at the code for doing that:
""" Converting all the categorical variables to dummy variables """ bankCat = pd.get_dummies\ (bankData[['job','marital',\ 'education','default','housing',\ 'loan','contact','month','poutcome']]) bankCat.shapeYou should get the following output:
(45211, 44) -
Now, separate the numerical variables:
bankNum = bankData[['age','balance','day','duration',\ 'campaign','pdays','previous']] bankNum.shapeYou should get the following output:
(45211, 7) -
Now, prepare the
XandYvariables and print theYshape. TheXvariable is the concatenation of the transformed categorical variable and the separated numerical data:# Preparing the X variables X = pd.concat([bankCat, bankNum], axis=1) print(X.shape) # Preparing the Y variable Y = bankData['y'] print(Y.shape) X.head()The output shown below is truncated:
-
Split the data into training and test sets:
# Splitting the data into train and test sets X_train, X_test, y_train, y_test = train_test_split\ (X, Y, test_size=0.3, \ random_state=123)Now, the data is all prepared for the modeling task. Next, we begin with modeling.
In this part, we will train the model using the training set we created in the earlier step. First, we call the
logistic regressionfunction and then fit the model with the training set data. -
Define the
LogisticRegressionfunction:bankModel = LogisticRegression() bankModel.fit(X_train, y_train)You should get the following output:
Caption: Parameters of the model that fits
-
Now, that the model is created, use it for predicting on the test sets and then getting the accuracy level of the predictions:
pred = bankModel.predict(X_test) print('Accuracy of Logistic regression model' \ 'prediction on test set: {:.2f}'\ .format(bankModel.score(X_test, y_test)))You should get the following output:
Caption: Prediction with the model
-
From an initial look, an accuracy metric of 90% gives us the impression that the model has done a decent job of approximating the data generating process. Or is it otherwise? Let's take a closer look at the details of the prediction by generating the metrics for the model. We will use two metric-generating functions, the confusion matrix and classification report:
# Confusion Matrix for the model from sklearn.metrics import confusion_matrix confusionMatrix = confusion_matrix(y_test, pred) print(confusionMatrix)You should get the following output in the following format; however, the values can vary as the modeling task will involve variability:
Caption: Generation of the confusion matrix
-
Next, let's generate a
classification_report:from sklearn.metrics import classification_report print(classification_report(y_test, pred))You should get a similar output; however, with different values due to variability in the modeling process:
Activity 3.02: Model Iteration 2 -- Logistic Regression Model with Feature Engineered Variables
As the data scientist of the bank, you created a benchmark model to
predict which customers are likely to buy a term deposit. However,
management wants to improve the results you got in the benchmark model.
In Exercise 3.04, Feature Engineering -- Creating New Features from
Existing Ones, you discussed the business scenario with the marketing
and operations teams and created a new variable, assetIndex,
by feature engineering three raw variables. You are now fitting another
logistic regression model on the feature engineered variables and are
trying to improve the results.
In this activity, you will be feature engineering some of the variables to verify their effects on the predictions.
The steps are as follows:
-
Open the Jupyter notebook used for the feature engineering in Exercise 3.04, Feature Engineering -- Creating New Features from Existing Ones, and execute all of the steps from that exercise.
-
Create dummy variables for the categorical variables using the
pd.get_dummies()function. Exclude original raw variables such as loan and housing, which were used to create the new variable,assetIndex. -
Select the numerical variables including the new feature engineered variable,
assetIndex, that was created. -
Transform some of the numerical variables by normalizing them using the
MinMaxScaler()function. -
Concatenate the numerical variables and categorical variables using the
pd.concat()function and then createXandYvariables. -
Split the dataset using the
train_test_split()function and then fit a new model using theLogisticRegression()model on the new features. -
Analyze the results after generating the confusion matrix and classification report.
You should get the following output:
Summary
In this lab, we learned about binary classification using logistic regression from the perspective of solving a use case. Let's summarize our learnings in this lab. We were introduced to classification problems and specifically binary classification problems. We also looked at the classification problem from the perspective of predicting term deposit propensity through a business discovery process. In the business discovery process, we identified different business drivers that influence business outcomes.






































