lee/mlessentials

Fork 0

mirror of https://github.com/fenago/data-science.git synced 2026-05-06 01:22:41 +00:00

Files

T

fenago 39139f2b0e added

2021-02-09 03:17:43 +05:00

36 KiB

Raw Permalink Blame History

Lab 3. Binary Classification

In this lab, we will be using a real-world dataset and a supervised learning technique called classification to generate business outcomes.

Exercise 3.01: Loading and Exploring the Data from the Dataset

In this exercise, we will load the dataset in our Jupyter notebook and do some basic explorations such as printing the dimensions of the dataset using the .shape() function and generating summary statistics of the dataset using the .describe() function.

The following steps will help you to complete this exercise:

Open a new Jupyter notebook.
Now, import pandas as pd in your Jupyter notebook:
```
import pandas as pd
```

Assign the link to the dataset to a variable called file_url

file_url = 'https://raw.githubusercontent.com/fenago'\
           '/data-science/master/Lab03'\
           '/bank-full.csv'

Now, read the file using the pd.read_csv() function from the pandas DataFrame:
```
# Loading the data using pandas
bankData = pd.read_csv(file_url, sep=";")
bankData.head()
```
You should get the following output:

Caption: Loading data into a Jupyter notebook

Here, we loaded the `CSV` file and then stored it as a
pandas DataFrame for further analysis.

Next, print the shape of the dataset, as mentioned in the following code snippet:
```
# Printing the shape of the data 
print(bankData.shape)
```
The .shape function is used to find the overall shape of the dataset.

You should get the following output:
```
(45211, 17)
```
Now, find the summary of the numerical raw data as a table output using the .describe() function in pandas, as mentioned in the following code snippet:
```
# Summarizing the statistics of the numerical raw data
bankData.describe()
```
You should get the following output:

Stacked bar charts

Let\'s create some dummy data and generate a stacked bar chart to
check the proportion of jobs in different sectors.

**Note:** Do not execute any of the following code snippets until the final
step. Enter all the code in the same cell.

Import the library files required for the task:

```
# Importing library files
import matplotlib.pyplot as plt
import numpy as np
```


Next, create some sample data detailing a list of jobs:

```
# Create a simple list of categories
jobList = ['admin','scientist','doctor','management']
```


Each job will have two categories to be plotted, `yes` and
`No`, with some proportion between `yes` and
`No`. These are detailed as follows:

```
# Getting two categories ( 'yes','No') for each of jobs
jobYes = [20,60,70,40]
jobNo = [80,40,30,60]
```


In the next steps, the length of the job list is taken for plotting
`xlabels` and then they are arranged using the
`np.arange()` function:

```
# Get the length of x axis labels and arranging its indexes
xlabels = len(jobList)
ind = np.arange(xlabels)
```


Next, let\'s define the width of each bar and do the plotting. In
the plot, `p2`, we define that when stacking,
`yes` will be at the bottom and `No` at top:

```
# Get width of each bar
width = 0.35
# Getting the plots
p1 = plt.bar(ind, jobYes, width)
p2 = plt.bar(ind, jobNo, width, bottom=jobYes)
```


Define the labels for the *Y* axis and the title of the plot:

```
# Getting the labels for the plots
plt.ylabel('Proportion of Jobs')
plt.title('Job')
```


The indexes for the *X* and *Y* axes are defined next. For the *X*
axis, the list of jobs are given, and, for the *Y* axis, the indices
are in proportion from `0` to `100` with an
increment of `10` (0, 10, 20, 30, and so on):

```
# Defining the x label indexes and y label indexes
plt.xticks(ind, jobList)
plt.yticks(np.arange(0, 100, 10))
```


The last step is to define the legends and to rotate the axis labels
to `90` degrees. The plot is finally displayed:

```
# Defining the legends
plt.legend((p1[0], p2[0]), ('Yes', 'No'))
# To rotate the axis labels 
plt.xticks(rotation=90)
plt.show()
```

Here is what a stacked bar chart looks like based on the preceding example:

Caption: Example of a stacked bar plot

Let's use these graphs in the following exercises and activities.

Exercise 3.02: Business Hypothesis Testing for Age versus Propensity for a Term Loan

The goal of this exercise is to define a hypothesis to check the propensity for an individual to purchase a term deposit plan against their age. We will be using a line graph for this exercise.

The following steps will help you to complete this exercise:

Let's first define our hypothesis on age and propensity to buy term deposits:

The propensity to buy term deposits is more with elderly customers compared to younger ones. This is our hypothesis.

Import the pandas and altair packages:

import pandas as pd
import altair as alt

Next, you need to load the dataset, just like you loaded the dataset in Exercise 3.01, Loading and Exploring the Data from the Dataset:
```
file_url = 'https://raw.githubusercontent.com/'\
           'fenago/data-science/'\
           'master/Lab03/bank-full.csv'
bankData = pd.read_csv(file_url, sep=";")
```
Note

Steps 2-3 will be repeated in the following exercises for this lab.

We will be verifying how the purchased term deposits are distributed by age.
Next, we will count the number of records for each age group. We will be using the combination of .groupby(), .agg(), .reset_index() methods from pandas.

Note

You will see further details of these methods in Lab 12, Feature Engineering.
```
filter_mask = bankData['y'] == 'yes'
bankSub1 = bankData[filter_mask]\
           .groupby('age')['y'].agg(agegrp='count')\
           .reset_index()
```
Now, plot a line chart using altair and the .Chart().mark_line().encode() methods and we will define the x and y variables, as shown in the following code snippet:
```
# Visualising the relationship using altair
alt.Chart(bankSub1).mark_line().encode(x='age', y='agegrp')
```
You should get the following output:

Group the data per age using the groupby() method and find the total number of customers under each age group using the agg() method:
```
# Getting another perspective
ageTot = bankData.groupby('age')['y']\
         .agg(ageTot='count').reset_index()
ageTot.head()
```
The output is as follows:

Caption: Customers per age group

Now, group the data by both age and propensity of purchase and find the total counts under each category of propensity, which are yes and no:
```
# Getting all the details in one place
ageProp = bankData.groupby(['age','y'])['y']\
          .agg(ageCat='count').reset_index()
ageProp.head()
```
The output is as follows:

Caption: Propensity by age group

Merge both of these DataFrames based on the age variable using the pd.merge() function, and then divide each category of propensity within each age group by the total customers in the respective age group to get the proportion of customers, as shown in the following code snippet:
```
# Merging both the data frames
ageComb = pd.merge(ageProp, ageTot,left_on = ['age'], \
                   right_on = ['age'])
ageComb['catProp'] = (ageComb.ageCat/ageComb.ageTot)*100
ageComb.head()
```
The output is as follows:

Now, display the proportion where you plot both categories (yes and no) as separate plots. This can be achieved through a method within altair called facet():
```
# Visualising the relationship using altair
alt.Chart(ageComb).mark_line()\
   .encode(x='age', y='catProp').facet(column='y')
```
This function makes as many plots as there are categories within the variable. Here, we give the 'y' variable, which is the variable name for the yes and no categories to the facet() function, and we get two different plots: one for yes and another for no.

You should get the following output:

Caption: Visualizing normalized relationships

Activity 3.01: Business Hypothesis Testing to Find Employment Status versus Propensity for Term Deposits

You are working as a data scientist for a bank. You are provided with historical data from the management of the bank and are asked to try to formulate a hypothesis between employment status and the propensity to buy term deposits.

In this activity, we will use a verify the relationship between employment status and term deposit purchase propensity.

The steps are as follows:

Formulate the hypothesis between employment status and the propensity for term deposits. Let the hypothesis be as follows: High paying employees prefer term deposits than other categories of employees.
Open a Jupyter notebook file similar to what was used in Exercise 3.02, Business Hypothesis Testing for Age versus Propensity for a Term Loan and install and import the necessary libraries such as pandas and altair.
From the banking DataFrame, bankData, find the distribution of employment status using the .groupby(), .agg() and .reset_index() methods.

Group the data with respect to employment status using the .groupby() method and find the total count of propensities for each employment status using the .agg() method.
Now, merge both DataFrames using the pd.merge() function and then find the propensity count by calculating the proportion of propensity for each type of employment status. When creating the new variable for finding the propensity proportion.
Plot the data and summarize intuitions from the plot using matplotlib. Use the stacked bar chart for this activity.

Expected output: The final plot of the propensity to buy with respect to employment status will be similar to the following plot:

Exercise 3.03: Feature Engineering -- Exploration of Individual Features

In this exercise, we will explore the relationship between two variables, which are whether an individual owns a house and whether an individual has a loan, to the propensity for term deposit purchases by these individuals.

The following steps will help you to complete this exercise:

Open a new Jupyter notebook.
Import the pandas package.
```
import pandas as pd
```

Assign the link to the dataset to a variable called file_url:

file_url = 'https://raw.githubusercontent.com'\
           '/fenago/data-science'\
           '/master/Lab03/bank-full.csv'

Read the banking dataset using the .read_csv() function:

# Reading the banking data
bankData = pd.read_csv(file_url, sep=";")

Next, we will find a relationship between housing and the propensity for term deposits, as mentioned in the following code snippet:
```
# Relationship between housing and propensity for term deposits
bankData.groupby(['housing', 'y'])['y']\
        .agg(houseTot='count').reset_index()
```
You should get the following output:

Explore the 'loan' variable to find its relationship with the propensity for a term deposit, as mentioned in the following code snippet:
```
"""
Relationship between having a loan and propensity for term 
deposits
"""
bankData.groupby(['loan', 'y'])['y']\
        .agg(loanTot='count').reset_index()
```
Note

The triple-quotes ( """ ) shown in the code snippet above are used to denote the start and end points of a multi-line code comment. This is an alternative to using the # symbol.

You should get the following output:

Caption: Loan versus term deposit propensity

In the case of loan portfolios, the propensity to buy term deposits
is higher for customers without loans:
`( 4805 / ( 4805 + 33162) = 12 % to  484/ ( 484 + 6760) =  6%)`.

Housing and loans were categorical data and finding a relationship
was straightforward. However, bank balance data is numerical and to
analyze it, we need to have a different strategy. One common
strategy is to convert the continuous numerical data into ordinal
data and look at how the propensity varies across each category.

To convert numerical values into ordinal values, we first find the quantile values and take them as threshold values. The quantiles are obtained using the following code snippet:
```
#Taking the quantiles for 25%, 50% and 75% of the balance data
import numpy as np
np.quantile(bankData['balance'],[0.25,0.5,0.75])
```
You should get the following output:

Caption: Quantiles for bank balance data

Now, convert the numerical values of bank balances into categorical values, as mentioned in the following code snippet:

bankData['balanceClass'] = 'Quant1'
bankData.loc[(bankData['balance'] > 72) \
              & (bankData['balance'] < 448), \
              'balanceClass'] = 'Quant2'
bankData.loc[(bankData['balance'] > 448) \
              & (bankData['balance'] < 1428), \
              'balanceClass'] = 'Quant3'
bankData.loc[bankData['balance'] > 1428, \
             'balanceClass'] = 'Quant4'
bankData.head()

You should get the following output:

Caption: New features from bank balance data

Next, we need to find the propensity of term deposit purchases based on each quantile the customers fall into. This task is similar to what we did in Exercise 3.02, Business Hypothesis Testing for Age versus Propensity for a Term Loan:
```
# Calculating the customers under each quantile 
balanceTot = bankData.groupby(['balanceClass'])['y']\
                     .agg(balanceTot='count').reset_index()
balanceTot
```
You should get the following output:

Caption: Classification based on quantiles

Calculate the total number of customers categorized by quantile and propensity classification, as mentioned in the following code snippet:

"""
Calculating the total customers categorised as per quantile 
and propensity classification
"""
balanceProp = bankData.groupby(['balanceClass', 'y'])['y']\
                      .agg(balanceCat='count').reset_index()
balanceProp

You should get the following output:

Caption: Total number of customers categorized by quantile and
propensity classification

Now, merge both DataFrames:

# Merging both the data frames
balanceComb = pd.merge(balanceProp, balanceTot, \
                       on = ['balanceClass'])
balanceComb['catProp'] = (balanceComb.balanceCat \
                          / balanceComb.balanceTot)*100
balanceComb

You should get the following output:

Caption: Propensity versus balance category

In the next exercise, we will use these intuitions to derive a new feature.

Exercise 3.04: Feature Engineering -- Creating New Features from Existing Ones

In this exercise, we will combine the individual variables we analyzed in Exercise 3.03, Feature Engineering -- Exploration of Individual Features to derive a new feature called an asset index. One methodology to create an asset index is by assigning weights based on the asset or liability of the customer.

For instance, a higher bank balance or home ownership will have a positive bearing on the overall asset index and, therefore, will be assigned a higher weight. In contrast, the presence of a loan will be a liability and, therefore, will have to have a lower weight. Let's give a weight of 5 if the customer has a house and 1 in its absence. Similarly, we can give a weight of 1 if the customer has a loan and 5 in case of no loans:

Open a new Jupyter notebook.
Import the pandas and numpy package:
```
import pandas as pd
import numpy as np
```

Assign the link to the dataset to a variable called 'file_url'.

file_url = 'https://raw.githubusercontent.com'\
           '/fenago/data-science'\
           '/master/Lab03/bank-full.csv'

Read the banking dataset using the .read_csv() function:

# Reading the banking data
bankData = pd.read_csv(file_url,sep=";")

The first step we will follow is to normalize the numerical variables. This is implemented using the following code snippet:
```
# Normalizing data
from sklearn import preprocessing
x = bankData[['balance']].values.astype(float)
```
As the bank balance dataset contains numerical values, we need to first normalize the data. The purpose of normalization is to bring all of the variables that we are using to create the new feature into a common scale. One effective method we can use here for the normalizing function is called MinMaxScaler(), which converts all of the numerical data between a scaled range of 0 to 1. The MinMaxScaler function is available within the preprocessing method in sklearn:
```
minmaxScaler = preprocessing.MinMaxScaler()
```
Transform the balance data by normalizing it with minmaxScaler:
```
bankData['balanceTran'] = minmaxScaler.fit_transform(x)
```
In this step, we created a new feature called 'balanceTran' to store the normalized bank balance values.
Print the head of the data using the .head() function:
```
bankData.head()
```
You should get the following output:

Caption: Normalizing the bank balance data

After creating the normalized variable, add a small value of 0.001 so as to eliminate the 0 values in the variable. This is mentioned in the following code snippet:
```
# Adding a small numerical constant to eliminate 0 values
bankData['balanceTran'] = bankData['balanceTran'] + 0.00001
```
Now, add two additional columns for introducing the transformed variables for loans and housing, as per the weighting approach discussed at the start of this exercise:
```
# Let us transform values for loan data
bankData['loanTran'] = 1
# Giving a weight of 5 if there is no loan
bankData.loc[bankData['loan'] == 'no', 'loanTran'] = 5
bankData.head()
```
You should get the following output:

Now, transform values for the Housing data, as mentioned here:

# Let us transform values for Housing data
bankData['houseTran'] = 5

Give a weight of 1 if the customer has a house and print the results, as mentioned in the following code snippet:
```
bankData.loc[bankData['housing'] == 'no', 'houseTran'] = 1
print(bankData.head())
```
You should get the following output:

Now, create a new variable, which is the product of all of the transformed variables:

""" 
Let us now create the new variable which is a product of all 
these
"""
bankData['assetIndex'] = bankData['balanceTran'] \
                         * bankData['loanTran'] \
                         * bankData['houseTran']
bankData.head()

You should get the following output:

Caption: Creating a composite index

Explore the propensity with respect to the composite index.

We observe the relationship between the asset index and the propensity of term deposit purchases. We adopt a similar strategy of converting the numerical values of the asset index into ordinal values by taking the quantiles and then mapping the quantiles to the propensity of term deposit purchases, as mentioned in Exercise 3.03, Feature Engineering -- Exploration of Individual Features:
```
# Finding the quantile
np.quantile(bankData['assetIndex'],[0.25,0.5,0.75])
```
You should get the following output:

Caption: Conversion of numerical values into ordinal values

Next, create quantiles from the assetindex data, as mentioned in the following code snippet:

bankData['assetClass'] = 'Quant1'
bankData.loc[(bankData['assetIndex'] > 0.38) \
              & (bankData['assetIndex'] < 0.57), \
              'assetClass'] = 'Quant2'
bankData.loc[(bankData['assetIndex'] > 0.57) \
              & (bankData['assetIndex'] < 1.9), \
              'assetClass'] = 'Quant3'
bankData.loc[bankData['assetIndex'] > 1.9, \
             'assetClass'] = 'Quant4'
bankData.head()
bankData.assetClass[bankData['assetIndex'] > 1.9] = 'Quant4'
bankData.head()

You should get the following output:

Caption: Quantiles for the asset index

Calculate the total of each asset class and the category-wise counts, as mentioned in the following code snippet:

# Calculating total of each asset class
assetTot = bankData.groupby('assetClass')['y']\
                   .agg(assetTot='count').reset_index()
# Calculating the category wise counts
assetProp = bankData.groupby(['assetClass', 'y'])['y']\
                    .agg(assetCat='count').reset_index()

Next, merge both DataFrames:

# Merging both the data frames
assetComb = pd.merge(assetProp, assetTot, on = ['assetClass'])
assetComb['catProp'] = (assetComb.assetCat \
                        / assetComb.assetTot)*100
assetComb

You should get the following output:

Caption: Composite index relationship mapping

A Quick Peek at Data Types and a Descriptive Summary

Looking at the data types such as categorical or numeric and then deriving summary statistics is a good way to take a quick peek into data before you do some of the downstream feature engineering steps. Let's take a look at an example from our dataset:

# Looking at Data types
print(bankData.dtypes)
# Looking at descriptive statistics
print(bankData.describe())

You should get the following output:

The following output is that of a descriptive summary statistic, which displays some of the basic measures such as mean, standard deviation, count, and the quantile values of the respective features:

Correlation Matrix and Visualization

Let's look at how data correlation can be generated and then visualized in the following exercise.

Exercise 3.05: Finding the Correlation in Data to Generate a Correlation Plot Using Bank Data

In this exercise, we will be creating a correlation plot and analyzing the results of the bank dataset.

The following steps will help you to complete the exercise:

Open a new Jupyter notebook, install the pandas packages and load the banking data:

import pandas as pd
file_url = 'https://raw.githubusercontent.com'\
           '/fenago/data-science'\
           '/master/Lab03/bank-full.csv'
bankData = pd.read_csv(file_url, sep=";")

Now, import the set_option library from pandas, as mentioned here:
```
from pandas import set_option
```
The set_option function is used to define the display options for many operations.
Next, create a variable that would store numerical variables such as 'age','balance','day','duration','campaign','pdays','previous', as mentioned in the following code snippet. A correlation plot can be extracted only with numerical data. This is why the numerical data has to be extracted separately:
```
bankNumeric = bankData[['age','balance','day','duration',\
                        'campaign','pdays','previous']]
```

Now, use the .corr() function to find the correlation matrix for the dataset:

set_option('display.width',150)
set_option('precision',3)
bankCorr = bankNumeric.corr(method = 'pearson')
bankCorr

You should get the following output:

Now, plot the data:

from matplotlib import pyplot
corFig = pyplot.figure()
figAxis = corFig.add_subplot(111)
corAx = figAxis.matshow(bankCorr,vmin=-1,vmax=1)
corFig.colorbar(corAx)
pyplot.show()

You should get the following output:

Caption: Correlation plot

Skewness of Data

Let's take a look at the following example. Here, we use the .skew() function to find the skewness in data. For instance, to find the skewness of data in our bank-full.csv dataset, we perform the following:

# Skewness of numeric attributes
bankNumeric.skew()

Note

This code refers to the bankNumeric data, so you should ensure you are working in the same notebook as the previous exercise.

You should get the following output:

Histograms

Histograms are an effective way to plot the distribution of data and to identify skewness in data, if any.

# Histograms
from matplotlib import pyplot as plt
fig, axs = plt.subplots(1,2)
axs[0].hist(bankNumeric['age'])
axs[0].set_title('Distribution of age')
axs[1].hist(bankNumeric['balance'])
axs[1].set_title('Distribution of Balance')
# Ensure plots do not overlap
plt.tight_layout()

You should get the following output:

Density Plots

Density plots help in visualizing the distribution of data. A density plot can be created using the kind = 'density' parameter:

from matplotlib import pyplot as plt
# Density plots
bankNumeric['age'].plot(kind = 'density', subplots = False, \
                        layout = (1,1))
plt.title('Age Distribution')
plt.xlabel('Age')
plt.ylabel('Normalised age distribution')
pyplot.show()

You should get the following output:

Caption: Code showing the generation of a density plot

Other Feature Engineering Methods

The normalizer and standard scaler functions are important feature engineering steps that are applied to the data before downstream modeling steps. Let's look at both of these techniques:

# Standardize data (0 mean, 1 stdev)
from sklearn.preprocessing import StandardScaler
from numpy import set_printoptions
scaling = StandardScaler().fit(bankNumeric)
rescaledNum = scaling.transform(bankNumeric)
set_printoptions(precision = 3)
print(rescaledNum)

You should get the following output:

Caption: Output from standardizing the data

The following code uses the normalizer data transmission techniques:

# Normalizing Data (Length of 1)
from sklearn.preprocessing import Normalizer
normaliser = Normalizer().fit(bankNumeric)
normalisedNum = normaliser.transform(bankNumeric)
set_printoptions(precision = 3)
print(normalisedNum)

You should get the following output:

Exercise 3.06: A Logistic Regression Model for Predicting the Propensity of Term Deposit Purchases in a Bank

In this exercise, we will build a logistic regression model, which will be used for predicting the propensity of term deposit purchases. This exercise will have three parts. The first part will be the preprocessing of the data, the second part will deal with the training process, and the last part will be spent on prediction, analysis of metrics, and deriving strategies for further improvement of the model.

You begin with data preprocessing.

In this part, we will first load the data, convert the ordinal data into dummy data, and then split the data into training and test sets for the subsequent training phase:

Open a Jupyter notebook, mount the drives, install necessary packages, and load the data, as in previous exercises:

import pandas as pd
import altair as alt
file_url = 'https://raw.githubusercontent.com'\
           '/fenago/data-science'\
           '/master/Lab03/bank-full.csv'
bankData = pd.read_csv(file_url, sep=";")

Now, load the library functions and data:

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

Now, find the data types:
```
bankData.dtypes
```
You should get the following output:

Caption: Data types

Convert the ordinal data into dummy data.

The value against each variable would be either 1 or 0, depending on whether that category was present in the variable as an example. Let's look at the code for doing that:

"""
Converting all the categorical variables to dummy variables
"""
bankCat = pd.get_dummies\
          (bankData[['job','marital',\
                     'education','default','housing',\
                     'loan','contact','month','poutcome']])
bankCat.shape

You should get the following output:

(45211, 44)

Now, separate the numerical variables:

bankNum = bankData[['age','balance','day','duration',\
                    'campaign','pdays','previous']]
bankNum.shape

You should get the following output:

(45211, 7)

Now, prepare the X and Y variables and print the Y shape. The X variable is the concatenation of the transformed categorical variable and the separated numerical data:
```
# Preparing the X variables
X = pd.concat([bankCat, bankNum], axis=1)
print(X.shape)
# Preparing the Y variable
Y = bankData['y']
print(Y.shape)
X.head()
```
The output shown below is truncated:

Split the data into training and test sets:
```
# Splitting the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split\
                                   (X, Y, test_size=0.3, \
                                    random_state=123)
```
Now, the data is all prepared for the modeling task. Next, we begin with modeling.

In this part, we will train the model using the training set we created in the earlier step. First, we call the logistic regression function and then fit the model with the training set data.
Define the LogisticRegression function:
```
bankModel = LogisticRegression()
bankModel.fit(X_train, y_train)
```
You should get the following output:

Caption: Parameters of the model that fits

Now, that the model is created, use it for predicting on the test sets and then getting the accuracy level of the predictions:

pred = bankModel.predict(X_test)
print('Accuracy of Logistic regression model' \
      'prediction on test set: {:.2f}'\
      .format(bankModel.score(X_test, y_test)))

You should get the following output:

Caption: Prediction with the model

From an initial look, an accuracy metric of 90% gives us the impression that the model has done a decent job of approximating the data generating process. Or is it otherwise? Let's take a closer look at the details of the prediction by generating the metrics for the model. We will use two metric-generating functions, the confusion matrix and classification report:
```
# Confusion Matrix for the model
from sklearn.metrics import confusion_matrix
confusionMatrix = confusion_matrix(y_test, pred)
print(confusionMatrix)
```
You should get the following output in the following format; however, the values can vary as the modeling task will involve variability:

Caption: Generation of the confusion matrix

Next, let's generate a classification_report:
```
from sklearn.metrics import classification_report
print(classification_report(y_test, pred))
```
You should get a similar output; however, with different values due to variability in the modeling process:

Activity 3.02: Model Iteration 2 -- Logistic Regression Model with Feature Engineered Variables

As the data scientist of the bank, you created a benchmark model to predict which customers are likely to buy a term deposit. However, management wants to improve the results you got in the benchmark model. In Exercise 3.04, Feature Engineering -- Creating New Features from Existing Ones, you discussed the business scenario with the marketing and operations teams and created a new variable, assetIndex, by feature engineering three raw variables. You are now fitting another logistic regression model on the feature engineered variables and are trying to improve the results.

In this activity, you will be feature engineering some of the variables to verify their effects on the predictions.

The steps are as follows:

Open the Jupyter notebook used for the feature engineering in Exercise 3.04, Feature Engineering -- Creating New Features from Existing Ones, and execute all of the steps from that exercise.
Create dummy variables for the categorical variables using the pd.get_dummies() function. Exclude original raw variables such as loan and housing, which were used to create the new variable, assetIndex.
Select the numerical variables including the new feature engineered variable, assetIndex, that was created.
Transform some of the numerical variables by normalizing them using the MinMaxScaler() function.
Concatenate the numerical variables and categorical variables using the pd.concat() function and then create X and Y variables.
Split the dataset using the train_test_split() function and then fit a new model using the LogisticRegression() model on the new features.
Analyze the results after generating the confusion matrix and classification report.

You should get the following output:

Summary

In this lab, we learned about binary classification using logistic regression from the perspective of solving a use case. Let's summarize our learnings in this lab. We were introduced to classification problems and specifically binary classification problems. We also looked at the classification problem from the perspective of predicting term deposit propensity through a business discovery process. In the business discovery process, we identified different business drivers that influence business outcomes.

36 KiB Raw Permalink Blame History

Lab 3. Binary Classification

Exercise 3.01: Loading and Exploring the Data from the Dataset

Exercise 3.02: Business Hypothesis Testing for Age versus Propensity for a Term Loan

Activity 3.01: Business Hypothesis Testing to Find Employment Status versus Propensity for Term Deposits

Exercise 3.03: Feature Engineering -- Exploration of Individual Features

Exercise 3.04: Feature Engineering -- Creating New Features from Existing Ones

A Quick Peek at Data Types and a Descriptive Summary

Correlation Matrix and Visualization

Exercise 3.05: Finding the Correlation in Data to Generate a Correlation Plot Using Bank Data

Skewness of Data

Histograms

Density Plots

Other Feature Engineering Methods

Exercise 3.06: A Logistic Regression Model for Predicting the Propensity of Term Deposit Purchases in a Bank

Activity 3.02: Model Iteration 2 -- Logistic Regression Model with Feature Engineered Variables

Summary

36 KiB

Raw Permalink Blame History