mirror of
https://github.com/fenago/data-science.git
synced 2026-05-06 01:22:41 +00:00
2118 lines
58 KiB
Markdown
2118 lines
58 KiB
Markdown
|
||
<img align="right" src="./logo.png">
|
||
|
||
|
||
Lab 14. Dimensionality Reduction
|
||
============================
|
||
|
||
|
||
|
||
Overview
|
||
|
||
This lab introduces dimensionality reduction in data science. You
|
||
will be using the Internet Advertisements dataset to analyze and
|
||
evaluate different techniques in dimensionality reduction. By the end of
|
||
this lab, you will be able to analyze datasets with high dimensions
|
||
and deal with the challenges posed by these datasets. As well as
|
||
applying different dimensionality reduction techniques to large
|
||
datasets, you will fit models based on those datasets and analyze their
|
||
results. By the end of this lab, you will be able to deal with huge
|
||
datasets in the real world.
|
||
|
||
|
||
|
||
Exercise 14.01: Loading and Cleaning the Dataset
|
||
------------------------------------------------
|
||
|
||
In this exercise, we will download the dataset, load it in our Jupyter
|
||
notebook, and do some basic explorations, such as printing the
|
||
dimensions of the dataset using the `.shape()` and
|
||
`.describe()` functions, and also cleaning the dataset.
|
||
|
||
|
||
The following steps will help you complete this exercise:
|
||
|
||
1. Open a new Jupyter notebook file.
|
||
|
||
2. Now, `import pandas` into your Jupyter notebook:
|
||
```
|
||
import pandas as pd
|
||
```
|
||
|
||
|
||
3. Next, set the path of the drive where the `ad.Data` file
|
||
is uploaded, as shown in the following code snippet:
|
||
```
|
||
# Defining file name of the GitHub repository
|
||
filename = 'https://raw.githubusercontent.com'\
|
||
'/fenago/data-science'\
|
||
'/master/Lab14/Dataset/ad.data'
|
||
```
|
||
|
||
|
||
4. Read the file using the `pd.read_csv()` function from the
|
||
pandas data frame:
|
||
|
||
```
|
||
adData = pd.read_csv(filename,sep=",",header = None,\
|
||
error_bad_lines=False)
|
||
adData.head()
|
||
```
|
||
|
||
|
||
After reading the file, the data frame is printed using the
|
||
`.head()` function.
|
||
|
||
You should get the following output:
|
||
|
||
|
||

|
||
|
||
|
||
Caption: Loading data into the Jupyter notebook
|
||
|
||
5. Now, print the shape of the dataset, as shown in the following code
|
||
snippet:
|
||
|
||
```
|
||
# Printing the shape of the data
|
||
print(adData.shape)
|
||
```
|
||
|
||
|
||
You should get the following output:
|
||
|
||
```
|
||
(3279, 1559)
|
||
```
|
||
|
||
|
||
From the shape, we can see that we have a large number of features,
|
||
`1559`.
|
||
|
||
6. Find the summary of the numerical features of the raw data using the
|
||
`.describe()` function in pandas, as shown in the
|
||
following code snippet:
|
||
|
||
```
|
||
# Summarizing the statistics of the numerical raw data
|
||
adData.describe()
|
||
```
|
||
|
||
|
||
You should get the following output:
|
||
|
||
|
||

|
||
|
||
|
||
Caption: Loading data into the Jupyter notebook
|
||
|
||
|
||
7. Separate the dependent and independent variables from our dataset,
|
||
as shown in the following code snippet:
|
||
|
||
```
|
||
# Separate the dependent and independent variables
|
||
# Preparing the X variables
|
||
X = adData.loc[:,0:1557]
|
||
print(X.shape)
|
||
# Preparing the Y variable
|
||
Y = adData[1558]
|
||
print(Y.shape)
|
||
```
|
||
|
||
|
||
You should get the following output:
|
||
|
||
```
|
||
(3279, 1558)
|
||
(3279, )
|
||
```
|
||
|
||
|
||
|
||
8. Print the first `15` examples of the independent
|
||
variables:
|
||
|
||
```
|
||
# Printing the head of the independent variables
|
||
X.head(15)
|
||
```
|
||
|
||
|
||
The output is as follows:
|
||
|
||
|
||

|
||
|
||
|
||
Caption: First 15 examples of independent variables
|
||
|
||
|
||
9. Print the data types of the dataset:
|
||
|
||
```
|
||
# Printing the data types
|
||
print(X.dtypes)
|
||
```
|
||
|
||
|
||
We should get the following output:
|
||
|
||
|
||

|
||
|
||
|
||
Caption: The data types in our dataset
|
||
|
||
|
||
10. Replace special characters with `NaN` values for the first
|
||
four columns.
|
||
|
||
Replace the special characters in the first four columns, which are
|
||
of the object type, with `NaN` values. `NaN` is
|
||
an abbreviation for \"not a number.\" Replacing special characters
|
||
with `NaN` values makes it easy to further impute data.
|
||
|
||
This is achieved through the following code snippet:
|
||
|
||
```
|
||
"""
|
||
Replacing special characters in first 3 columns
|
||
which are of type object
|
||
"""
|
||
for i in range(0,3):
|
||
X[i] = X[i].str.replace("?", 'nan')\
|
||
.values.astype(float)
|
||
print(X.head(15))
|
||
```
|
||
|
||
|
||
You should get the following output:
|
||
|
||
|
||

|
||
|
||
|
||
Caption: After replacing special characters with NaN
|
||
|
||
11. Now, replace special characters for the integer features.
|
||
|
||
As in *Step 9*, let\'s also replace the special characters from the
|
||
features of the `int64` data type with the following code
|
||
snippet:
|
||
|
||
```
|
||
"""
|
||
Replacing special characters in the remaining
|
||
columns which are of type integer
|
||
"""
|
||
for i in range(3,1557):
|
||
X[i] = X[i].replace("?", 'NaN').values.astype(float)
|
||
```
|
||
|
||
|
||
12. Now, impute the mean of each column for the `NaN` values.
|
||
|
||
Now that we have replaced special characters in the data with
|
||
`NaN` values, we can use the `fillna()` function
|
||
in pandas to replace the `NaN` values with the mean of the
|
||
column. This is executed using the following code snippet:
|
||
|
||
```
|
||
import numpy as np
|
||
# Impute the 'NaN' with mean of the values
|
||
for i in range(0,1557):
|
||
X[i] = X[i].fillna(X[i].mean())
|
||
print(X.head(15))
|
||
```
|
||
|
||
|
||
In the preceding code snippet, the `.mean()` function
|
||
calculates the mean of each column and then replaces the
|
||
`nan` values with the mean of the column.
|
||
|
||
You should get the following output:
|
||
|
||
|
||

|
||
|
||
|
||
Caption: Mean of the NaN columns
|
||
|
||
13. Scale the dataset using the `minmaxScaler()` function.
|
||
|
||
As in *Lab 3*, *Binary Classification*, scaling data is useful
|
||
in the modeling step. Let\'s scale the dataset using the
|
||
`minmaxScaler()` function as learned in *Lab 3*,
|
||
*Binary Classification*.
|
||
|
||
This is shown in the following code snippet:
|
||
|
||
```
|
||
# Scaling the data sets
|
||
# Import library function
|
||
from sklearn import preprocessing
|
||
# Creating the scaling function
|
||
minmaxScaler = preprocessing.MinMaxScaler()
|
||
# Transforming with the scaler function
|
||
X_tran = pd.DataFrame(minmaxScaler.fit_transform(X))
|
||
# Printing the output
|
||
X_tran.head()
|
||
```
|
||
|
||
|
||
You should get the following output. Here, we have displayed the
|
||
first 24 columns:
|
||
|
||
|
||

|
||
|
||
|
||
Caption: Scaling the dataset using the MinMaxScaler() function
|
||
|
||
|
||
|
||
Creating a High-Dimensional Dataset
|
||
===================================
|
||
|
||
|
||
In the earlier section, we worked with a dataset that has around
|
||
`1,558` features. In order to demonstrate the challenges with
|
||
high-dimensional datasets, let\'s create an extremely high dimensional
|
||
dataset from the internet dataset that we already have.
|
||
|
||
This we will achieve by replicating the existing number of features
|
||
multiple times so that the dataset becomes really large. To replicate
|
||
the dataset, we will use a function called `np.tile()`, which
|
||
copies a data frame multiple times across the axes we want. We will also
|
||
calculate the time it takes for any activity using the
|
||
`time()` function.
|
||
|
||
Let\'s look at both these functions in action with a toy example.
|
||
|
||
You begin by importing the necessary library functions:
|
||
|
||
```
|
||
import pandas as pd
|
||
import numpy as np
|
||
```
|
||
|
||
Then, to create a dummy data frame, we will use a small dataset with two
|
||
rows and three columns for this example. We use the
|
||
`pd.np.array()` function to create a data frame:
|
||
|
||
```
|
||
# Creating a simple data frame
|
||
df = pd.np.array([[1, 2, 3], [4, 5, 6]])
|
||
print(df.shape)
|
||
df
|
||
```
|
||
|
||
You should get the following output:
|
||
|
||

|
||
|
||
Caption: Array for the sample dummy data frame
|
||
|
||
Next, you replicate the dummy data frame and this replication of the
|
||
columns is done using the `pd.np.tile()` function in the
|
||
following code snippet:
|
||
|
||
```
|
||
# Replicating the data frame and noting the time
|
||
import time
|
||
# Starting a timing function
|
||
t0=time.time()
|
||
Newdf = pd.DataFrame(pd.np.tile(df, (1, 5)))
|
||
print(Newdf.shape)
|
||
print(Newdf)
|
||
# Finding the end time
|
||
print("Total time:", round(time.time()-t0, 3), "s")
|
||
```
|
||
|
||
You should get the following output:
|
||
|
||

|
||
|
||
Caption: Replication of the data frame
|
||
|
||
As we can see in the snippet, the `pd.np.tile()` function
|
||
accepts two sets of arguments. The first one is the data frame,
|
||
`df`, that we want to replicate. The next argument,
|
||
`(1,5)`, defines which axes we want to replicate. In this
|
||
example, we define that the rows will remain as is because of the
|
||
`1` argument, and the columns will be replicated `5`
|
||
times with the `5` argument. We can see from the
|
||
`shape()` function that the original data frame, which was of
|
||
shape `(2,3)`, has been transformed into a data frame with a
|
||
shape of `(2,15)`.
|
||
|
||
|
||
|
||
Activity 14.01: Fitting a Logistic Regression Model on a HighDimensional Dataset
|
||
--------------------------------------------------------------------------------
|
||
|
||
You want to test the performance of your models when the dataset is
|
||
large. To do this, you are artificially augmenting the internet ads
|
||
dataset so that the dataset is 300 times bigger in dimension than the
|
||
original dataset. You will be fitting a logistic regression model on
|
||
this new dataset and then observe the results.
|
||
|
||
**Hint**: In this activity, we will use a notebook similar to *Exercise
|
||
14.01*, *Loading and Cleaning the Dataset*, and we will also be fitting
|
||
a logistic regression model as done in *Lab 3*, *Binary
|
||
Classification*.
|
||
|
||
|
||
|
||
The steps to complete this activity are as follows:
|
||
|
||
1. Open a new Jupyter notebook.
|
||
|
||
2. Implement all steps from *Exercise 14.01*, *Loading and Cleaning the
|
||
Dataset*, until the normalization of data. Derive the transformed
|
||
independent `X_tran` variable.
|
||
|
||
3. Create a high-dimensional dataset by replicating the columns 300
|
||
times using the `pd.np.tile()` function. Print the shape
|
||
of the new dataset and observe the number of features in the new
|
||
dataset.
|
||
|
||
4. Split the dataset into train and test sets.
|
||
|
||
5. Fit a logistic regression model on the new dataset and note the time
|
||
it takes to fit the model.
|
||
|
||
**Expected Output**:
|
||
|
||
You should get output similar to the following after fitting the
|
||
logistic regression model on the new dataset:
|
||
|
||
```
|
||
Total training time: 23.86 s
|
||
```
|
||
|
||
|
||
6. Predict on the test set and print the classification report and
|
||
confusion matrix.
|
||
|
||
You should get the following output:
|
||
|
||
|
||

|
||
|
||
|
||
Caption: Confusion matrix and the classification report results
|
||
|
||
|
||
|
||
|
||
We begin by defining the path of the dataset for the GitHub repository
|
||
to our \"ads\" dataset:
|
||
|
||
```
|
||
# Defining the file name from GitHub
|
||
filename = 'https://raw.githubusercontent.com'\
|
||
'/fenago/data-science'\
|
||
'/master/Lab14/Dataset/ad.data'
|
||
```
|
||
Next, we simply load the data using pandas:
|
||
|
||
```
|
||
# import pandas as pd
|
||
# Loading the data using pandas
|
||
adData = pd.read_csv(filename,sep=",",header = None,\
|
||
error_bad_lines=False)
|
||
```
|
||
Create a high-dimensional dataset with a scaling factor of
|
||
`500`:
|
||
|
||
```
|
||
# Creating a high dimension dataset
|
||
X_hd = pd.DataFrame(pd.np.tile(adData, (1, 500)))
|
||
```
|
||
|
||
|
||
From the output, you can see that the session might crash because all the
|
||
RAM provided by Jupyter has been used. The session might restart, and you
|
||
will lose all your variables. Hence, it is always good to be mindful of
|
||
the resources you are provided with, along with the dataset.
|
||
|
||
|
||
Strategies for Addressing High-Dimensional Datasets
|
||
===================================================
|
||
|
||
|
||

|
||
|
||
|
||
|
||
|
||
|
||
Exercise 14.02: Dimensionality Reduction Using Backward Feature Elimination
|
||
---------------------------------------------------------------------------
|
||
|
||
In this exercise, we will fit a logistic regression model after
|
||
eliminating features using the backward elimination technique to find
|
||
the accuracy of the model. We will be using the same ads dataset as
|
||
before, and we will be enhancing it with additional features for this
|
||
exercise.
|
||
|
||
The following steps will help you complete this exercise:
|
||
|
||
1. Open a new Jupyter notebook file.
|
||
|
||
2. Implement all the initial steps similar to *Exercise 14.01*,
|
||
*Loading and Cleaning the Dataset*, until scaling the dataset using
|
||
the `minmaxscaler()` function:
|
||
```
|
||
filename = 'https://raw.githubusercontent.com'\
|
||
'/fenago/data-science'\
|
||
'/master/Lab14/Dataset/ad.data'
|
||
import pandas as pd
|
||
adData = pd.read_csv(filename,sep=",",header = None,\
|
||
error_bad_lines=False)
|
||
X = adData.loc[:,0:1557]
|
||
Y = adData[1558]
|
||
import numpy as np
|
||
for i in range(0,3):
|
||
X[i] = X[i].str.replace("?", 'NaN').values.astype(float)
|
||
for i in range(3,1557):
|
||
X[i] = X[i].replace("?", 'NaN').values.astype(float)
|
||
for i in range(0,1557):
|
||
X[i] = X[i].fillna(X[i].mean())
|
||
from sklearn import preprocessing
|
||
minmaxScaler = preprocessing.MinMaxScaler()
|
||
X_tran = pd.DataFrame(minmaxScaler.fit_transform(X))
|
||
```
|
||
|
||
|
||
3. Next, create a high-dimensional dataset. We\'ll augment the dataset
|
||
artificially by a factor of `2`. The process of backward
|
||
feature elimination is a very compute-intensive process, and using
|
||
higher dimensions will involve a longer processing time. This is why
|
||
the augmenting factor has been kept at `2`. This is
|
||
implemented using the following code snippet:
|
||
|
||
```
|
||
# Creating a high dimension data set
|
||
X_hd = pd.DataFrame(pd.np.tile(X_tran, (1, 2)))
|
||
print(X_hd.shape)
|
||
```
|
||
|
||
|
||
You should get the following output:
|
||
|
||
```
|
||
(3279, 3116)
|
||
```
|
||
|
||
|
||
4. Define the backward elimination model. Backward elimination works by
|
||
providing two arguments to the `RFE()` function, which is
|
||
the model we want to try (logistic regression in our case) and the
|
||
number of features we want the dataset to be reduced to. This is
|
||
implemented as follows:
|
||
|
||
```
|
||
from sklearn.linear_model import LogisticRegression
|
||
from sklearn.feature_selection import RFE
|
||
# Defining the Classification function
|
||
backModel = LogisticRegression()
|
||
"""
|
||
Reducing dimensionality to 250 features for the
|
||
backward elimination model
|
||
"""
|
||
rfe = RFE(backModel, 250)
|
||
```
|
||
|
||
|
||
In this implementation, the number of features that we have given,
|
||
`250`, is identified through trial and error. The process
|
||
is to first assume an arbitrary number of features and then, based
|
||
on the final metrics, arrive at the most optimum number of features
|
||
for the model. In this implementation, our first assumption of
|
||
`250` implies that we want the backward elimination model
|
||
to start eliminating features until we get the best `250`
|
||
features.
|
||
|
||
5. Fit the backward elimination method to identify the best
|
||
`250` features.
|
||
|
||
We are now ready to fit the backward elimination method on the
|
||
higher-dimensional dataset. We will also note the time it takes for
|
||
backward elimination to work. This is implemented using the
|
||
following code snippet:
|
||
|
||
```
|
||
# Fitting the rfe for selecting the top 250 features
|
||
import time
|
||
t0 = time.time()
|
||
rfe = rfe.fit(X_hd, Y)
|
||
t1 = time.time()
|
||
print("Backward Elimination time:", \
|
||
round(t1-t0, 3), "s")
|
||
```
|
||
|
||
|
||
Fitting the backward elimination method is done using the
|
||
`.fit()` function. We give the independent and dependent
|
||
training sets.
|
||
|
||
Note
|
||
|
||
The backward elimination method is a compute-intensive process, and
|
||
therefore this process will take a lot of time to execute. The
|
||
larger the number of features, the longer it will take.
|
||
|
||
The time for backward elimination is at the end of the
|
||
notifications:
|
||
|
||
|
||

|
||
|
||
|
||
Caption: The time taken for the backward elimination process
|
||
|
||
You can see that the backward elimination process to find the best
|
||
`250` features has taken `230.35` seconds to
|
||
implement.
|
||
|
||
6. Display the features identified using the backward elimination
|
||
method. We can display the `250` features that were
|
||
identified using the backward elimination process using the
|
||
`get_support()` function. This is implemented as follows:
|
||
|
||
```
|
||
# Getting the indexes of the features used
|
||
rfe.get_support(indices = True)
|
||
```
|
||
|
||
|
||
You should get the following output:
|
||
|
||
|
||

|
||
|
||
|
||
Caption: The identified features being displayed
|
||
|
||
These are the best `250` features that were finally
|
||
selected using the backward elimination process from the entire
|
||
dataset.
|
||
|
||
7. Now, split the dataset into training and testing sets for modeling:
|
||
|
||
```
|
||
from sklearn.model_selection import train_test_split
|
||
# Splitting the data into train and test sets
|
||
X_train, X_test, y_train, y_test = train_test_split\
|
||
(X_hd, Y, test_size=0.3,\
|
||
random_state=123)
|
||
print('Training set shape',X_train.shape)
|
||
print('Test set shape',X_test.shape)
|
||
```
|
||
|
||
|
||
You should get the following output:
|
||
|
||
```
|
||
Training set shape (2295, 3116)
|
||
Test set shape (984, 3116)
|
||
```
|
||
|
||
|
||
From the output, you see the shapes of both the training set and
|
||
testing sets.
|
||
|
||
8. Transform the train and test sets. In *step 5*, we identified the
|
||
top `250` features through backward elimination. Now we
|
||
need to reduce the train and test sets to those top `250`
|
||
features. This is done using the `.transform()` function.
|
||
This is implemented using the following code snippet:
|
||
|
||
```
|
||
# Transforming both train and test sets
|
||
X_train_tran = rfe.transform(X_train)
|
||
X_test_tran = rfe.transform(X_test)
|
||
print("Training set shape",X_train_tran.shape)
|
||
print("Test set shape",X_test_tran.shape)
|
||
```
|
||
|
||
|
||
You should get the following output:
|
||
|
||
```
|
||
Training set shape (2295, 250)
|
||
Test set shape (984, 250)
|
||
```
|
||
|
||
|
||
We can see that both the training set and test sets have been
|
||
reduced to the `250` best features.
|
||
|
||
9. Fit a logistic regression model on the training set and note the
|
||
time:
|
||
|
||
```
|
||
# Fitting the logistic regression model
|
||
import time
|
||
# Defining the LogisticRegression function
|
||
RfeModel = LogisticRegression()
|
||
# Starting a timing function
|
||
t0=time.time()
|
||
# Fitting the model
|
||
RfeModel.fit(X_train_tran, y_train)
|
||
# Finding the end time
|
||
print("Total training time:", \
|
||
round(time.time()-t0, 3), "s")
|
||
```
|
||
|
||
|
||
You should get the following output:
|
||
|
||
```
|
||
Total training time: 0.016 s
|
||
```
|
||
|
||
|
||
As expected, the total time it takes to fit a model on a reduced set
|
||
of features is much lower than the time it took for the larger
|
||
dataset in *Activity 14.01*, *Fitting a Logistic Regression Model on
|
||
a HighDimensional Dataset*, which was `23.86` seconds.
|
||
This is a great improvement.
|
||
|
||
10. Now, predict on the test set and print the accuracy metrics, as
|
||
shown in the following code snippet:
|
||
|
||
```
|
||
# Predicting on the test set and getting the accuracy
|
||
pred = RfeModel.predict(X_test_tran)
|
||
print('Accuracy of Logistic regression model after '\
|
||
'backward elimination: {:.2f}'\
|
||
.format(RfeModel.score(X_test_tran, y_test)))
|
||
```
|
||
|
||
|
||
You should get the following output:
|
||
|
||
|
||

|
||
|
||
|
||
Caption: The achieved accuracy of the logistic regression model
|
||
|
||
You can see that the accuracy measure for this model has improved
|
||
compared to the one we got for the model with higher dimensionality,
|
||
which was `0.97` in *Activity 14.01*, *Fitting a Logistic
|
||
Regression Model on a HighDimensional Dataset*. This increase could
|
||
be attributed to the identification of non-correlated features from
|
||
the complete feature set, which could have boosted the performance
|
||
of the model.
|
||
|
||
11. Print the confusion matrix:
|
||
|
||
```
|
||
from sklearn.metrics import confusion_matrix
|
||
confusionMatrix = confusion_matrix(y_test, pred)
|
||
print(confusionMatrix)
|
||
```
|
||
|
||
|
||
You should get the following output:
|
||
|
||
|
||

|
||
|
||
|
||
Caption: Confusion matrix
|
||
|
||
12. Printing the classification report:
|
||
|
||
```
|
||
from sklearn.metrics import classification_report
|
||
# Getting the Classification_report
|
||
print(classification_report(y_test, pred))
|
||
```
|
||
|
||
|
||
You should get the following output:
|
||
|
||
|
||

|
||
|
||
|
||
|
||
Exercise 14.03: Dimensionality Reduction Using Forward Feature Selection
|
||
------------------------------------------------------------------------
|
||
|
||
In this exercise, we will fit a logistic regression model by selecting
|
||
the optimum features through forward feature selection and observing the
|
||
performance of the model. We will be using the same ads dataset as
|
||
before, and we will be enhancing it with additional features for this
|
||
exercise.
|
||
|
||
The following steps will help you complete this exercise:
|
||
|
||
1. Open a new Jupyter notebook.
|
||
|
||
2. Implement all the initial steps similar to *Exercise 14.01*,
|
||
*Loading and Cleaning the Dataset*, up until scaling the dataset
|
||
using `MinMaxScaler()`:
|
||
```
|
||
filename = 'https://raw.githubusercontent.com'\
|
||
'/fenago/data-science'\
|
||
'/master/Lab14/Dataset/ad.data'
|
||
import pandas as pd
|
||
adData = pd.read_csv(filename,sep=",",header = None,\
|
||
error_bad_lines=False)
|
||
X = adData.loc[:,0:1557]
|
||
Y = adData[1558]
|
||
import numpy as np
|
||
for i in range(0,3):
|
||
X[i] = X[i].str.replace("?", 'NaN')\
|
||
.values.astype(float)
|
||
for i in range(3,1557):
|
||
X[i] = X[i].replace("?", 'NaN').values.astype(float)
|
||
for i in range(0,1557):
|
||
X[i] = X[i].fillna(X[i].mean())
|
||
from sklearn import preprocessing
|
||
minmaxScaler = preprocessing.MinMaxScaler()
|
||
X_tran = pd.DataFrame(minmaxScaler.fit_transform(X))
|
||
```
|
||
|
||
|
||
3. Create a high-dimensional dataset. Now, augment the dataset
|
||
artificially to a factor of `50`. Augmenting the dataset
|
||
to higher factors will result in the notebook crashing because of
|
||
lack of memory. This is implemented using the following code
|
||
snippet:
|
||
|
||
```
|
||
# Creating a high dimension dataset
|
||
X_hd = pd.DataFrame(pd.np.tile(X_tran, (1, 50)))
|
||
print(X_hd.shape)
|
||
```
|
||
|
||
|
||
You should get the following output:
|
||
|
||
```
|
||
(3279, 77900)
|
||
```
|
||
|
||
|
||
4. Split the high dimensional dataset into training and testing sets:
|
||
```
|
||
from sklearn.model_selection import train_test_split
|
||
# Splitting the data into train and test sets
|
||
X_train, X_test, y_train, y_test = train_test_split\
|
||
(X_hd, Y, test_size=0.3, \
|
||
random_state=123)
|
||
```
|
||
|
||
|
||
5. Now we define the threshold features. Once the train and test sets
|
||
are created, the next step is to import the feature selection
|
||
function, `SelectKBest`. The argument we give to this
|
||
function is the number of features we want. The features are
|
||
selected through experimentation and, as a first step, we assume a
|
||
threshold value. In this example, we assume a threshold value of
|
||
`250`. This is implemented using the following code
|
||
snippet:
|
||
```
|
||
from sklearn.feature_selection import SelectKBest
|
||
# feature extraction
|
||
feats = SelectKBest(k=250)
|
||
```
|
||
|
||
|
||
6. Iterate and get the best set of threshold features. Based on the
|
||
threshold set of features we defined, we have to fit the training
|
||
set and get the best set of threshold features. Fitting on the
|
||
training set is done using the `.fit()` function. We also
|
||
note the time it takes to find the best set of features. This is
|
||
executed using the following code snippet:
|
||
|
||
```
|
||
# Fitting the features for training set
|
||
import time
|
||
t0 = time.time()
|
||
fit = feats.fit(X_train, y_train)
|
||
t1 = time.time()
|
||
print("Forward selection fitting time:", \
|
||
round(t1-t0, 3), "s")
|
||
```
|
||
|
||
|
||
You should get something similar to the following output:
|
||
|
||
```
|
||
Forward selection fitting time: 2.682 s
|
||
```
|
||
|
||
|
||
We can see that the forward selection method has taken around
|
||
`2.68` seconds, which is much lower than the backward
|
||
selection method.
|
||
|
||
7. Create new training and test sets. Once we have identified the best
|
||
set of features, we have to modify our training and test sets so
|
||
that they have only those selected features. This is accomplished
|
||
using the `.transform()` function:
|
||
```
|
||
# Creating new training set and test sets
|
||
features_train = fit.transform(X_train)
|
||
features_test = fit.transform(X_test)
|
||
```
|
||
|
||
|
||
8. Let\'s verify the shapes of the train and test sets before
|
||
transformation and after transformation:
|
||
|
||
```
|
||
"""
|
||
Printing the shape of training and test sets
|
||
before transformation
|
||
"""
|
||
print('Train shape before transformation',\
|
||
X_train.shape)
|
||
print('Test shape before transformation',\
|
||
X_test.shape)
|
||
"""
|
||
Printing the shape of training and test sets
|
||
after transformation
|
||
"""
|
||
print('Train shape after transformation',\
|
||
features_train.shape)
|
||
print('Test shape after transformation',\
|
||
features_test.shape)
|
||
```
|
||
|
||
|
||
You should get the following output:
|
||
|
||
|
||

|
||
|
||
|
||
Caption: Shape of the training and testing datasets
|
||
|
||
You can see that both the training and test sets are reduced to
|
||
`250` features each.
|
||
|
||
9. Let\'s now fit a logistic regression model on the transformed
|
||
dataset and note the time it takes to fit the model:
|
||
```
|
||
# Fitting a Logistic Regression Model
|
||
from sklearn.linear_model import LogisticRegression
|
||
import time
|
||
t0 = time.time()
|
||
forwardModel = LogisticRegression()
|
||
forwardModel.fit(features_train, y_train)
|
||
t1 = time.time()
|
||
```
|
||
|
||
|
||
10. Print the total time:
|
||
|
||
```
|
||
print("Total training time:", round(t1-t0, 3), "s")
|
||
```
|
||
|
||
|
||
You should get the following output:
|
||
|
||
```
|
||
Total training time: 0.035 s
|
||
```
|
||
|
||
|
||
You can see that the training time is much less than the model that
|
||
was fit in *Activity 14.01*, *Fitting a Logistic Regression Model on
|
||
a HighDimensional Dataset*, which was `23.86` seconds.
|
||
This shorter time is attributed to the number of features in the
|
||
forward selection model.
|
||
|
||
11. Now, perform predictions on the test set and print the accuracy
|
||
metrics:
|
||
|
||
```
|
||
# Predicting with the forward model
|
||
pred = forwardModel.predict(features_test)
|
||
print('Accuracy of Logistic regression'\
|
||
' model prediction on test set: {:.2f}'
|
||
.format(forwardModel.score(features_test, y_test)))
|
||
```
|
||
|
||
|
||
You should get the following output:
|
||
|
||
```
|
||
Accuracy of Logistic regression model prediction on test set: 0.94
|
||
```
|
||
|
||
|
||
12. Print the confusion matrix:
|
||
|
||
```
|
||
from sklearn.metrics import confusion_matrix
|
||
confusionMatrix = confusion_matrix(y_test, pred)
|
||
print(confusionMatrix)
|
||
```
|
||
|
||
|
||
You should get something similar to the following output:
|
||
|
||
|
||

|
||
|
||
|
||
Caption: Resulting confusion matrix
|
||
|
||
13. Print the classification report:
|
||
|
||
```
|
||
from sklearn.metrics import classification_report
|
||
# Getting the Classification_report
|
||
print(classification_report(y_test, pred))
|
||
```
|
||
|
||
|
||
You should get something similar to the following output:
|
||
|
||
|
||

|
||
|
||
|
||
Caption: Resulting classification report
|
||
|
||
|
||
|
||
Principal Component Analysis (PCA)
|
||
----------------------------------
|
||
|
||
Let\'s look at the idea of PCA with an example.
|
||
|
||
We will create a sample dataset with 2 variables and 100 random data
|
||
points in each variable. Random data points are created using the
|
||
`rand()` function. This is implemented in the following code:
|
||
|
||
```
|
||
import numpy as np
|
||
# Setting the seed for reproducibility
|
||
seed = np.random.RandomState(123)
|
||
# Generating an array of random numbers
|
||
X = seed.rand(100,2)
|
||
# Printing the shape of the dataset
|
||
X.shape
|
||
```
|
||
|
||
The resulting output is: `(100, 2)`.
|
||
|
||
Note
|
||
|
||
A random state is defined using the `RandomState(123)`
|
||
function. This is defined to ensure that anyone who reproduces this
|
||
example gets the same output.
|
||
|
||
Let\'s visualize this data using `matplotlib`:
|
||
|
||
```
|
||
import matplotlib.pyplot as plt
|
||
%matplotlib inline
|
||
plt.scatter(X[:, 0], X[:, 1])
|
||
plt.axis('equal')
|
||
```
|
||
|
||
You should get the following output:
|
||
|
||
```
|
||
(-0.04635361265714105,
|
||
1.0325632864350174,
|
||
-0.003996887112708292,
|
||
1.0429468329457663)
|
||
```
|
||

|
||
|
||
Caption: Visualization of the data
|
||
|
||
In the graph, we can see that the data is evenly spread out.
|
||
|
||
Let\'s now find the principal components for this dataset. We will
|
||
reduce this two-dimensional dataset into a one-dimensional dataset. In
|
||
other words, we will reduce the original dataset into one of its
|
||
principal components.
|
||
|
||
This is implemented in code as follows:
|
||
|
||
```
|
||
from sklearn.decomposition import PCA
|
||
# Defining one component
|
||
pca = PCA(n_components=1)
|
||
# Fitting the PCA function
|
||
pca.fit(X)
|
||
# Getting the new dataset
|
||
X_pca = pca.transform(X)
|
||
# Printing the shapes
|
||
print("Original data set: ", X.shape)
|
||
print("Data set after transformation:", X_pca.shape)
|
||
```
|
||
|
||
You should get the following output:
|
||
|
||
```
|
||
original shape: (100, 2)
|
||
transformed shape: (100, 1)
|
||
```
|
||
As we can see in the code, we first define the number of components
|
||
using the `'n_components' = 1` argument. After this, the PCA
|
||
algorithm is fit on the input dataset. After fitting on the input data,
|
||
the initial dataset is transformed into a new dataset with only one
|
||
variable, which is its principal component.
|
||
|
||
The algorithm transforms the original dataset into its first principal
|
||
component by using an axis where the data has the largest variability.
|
||
|
||
To visualize this concept, let\'s reverse the transformation of the
|
||
`X_pca` dataset to its original form and then visualize this
|
||
data along with the original data. To reverse the transformation, we use
|
||
the `.inverse_transform()` function:
|
||
|
||
```
|
||
# Reversing the transformation and plotting
|
||
X_reverse = pca.inverse_transform(X_pca)
|
||
# Plotting the original data
|
||
plt.scatter(X[:, 0], X[:, 1], alpha=0.1)
|
||
# Plotting the reversed data
|
||
plt.scatter(X_reverse[:, 0], X_reverse[:, 1], alpha=0.9)
|
||
plt.axis('equal');
|
||
```
|
||
|
||
You should get the following output:
|
||
|
||

|
||
|
||
Caption: Plot with reverse transformation
|
||
|
||
As we can see in the plot, the data points in orange represent an axis
|
||
with the highest variability. All the data points were projected to that
|
||
axis to generate the first principal component.
|
||
|
||
The data points that are generated when transforming into various
|
||
principal components will be very different from the original data
|
||
points before transformation. Each principal component will be in an
|
||
axis that is orthogonal (perpendicular) to the other principal
|
||
component. If a second principal component was generated for the
|
||
preceding example, the second principal component would be along an axis
|
||
indicated by the blue arrow in the graph. The way we pick the number of
|
||
principal components for model building is by selecting the number of
|
||
components that explains a certain threshold of variability.
|
||
|
||
For example, if there were originally 1,000 features and we reduced it
|
||
to 100 principal components, and then we find that out of the 100
|
||
principal components the first 75 components explain 90% of the
|
||
variability of data, we would pick those 75 components to build the
|
||
model. This process is called picking principal components with the
|
||
percentage of variance explained.
|
||
|
||
Let\'s now see how to use PCA as a tool for dimensionality reduction in
|
||
our use case.
|
||
|
||
|
||
|
||
Exercise 14.04: Dimensionality Reduction Using PCA
|
||
--------------------------------------------------
|
||
|
||
In this exercise, we will fit a logistic regression model by selecting
|
||
the principal components that explain the maximum variability of the
|
||
data. We will also observe the performance of the feature selection and
|
||
model building process. We will be using the same ads dataset as before,
|
||
and we will be enhancing it with additional features for this exercise.
|
||
|
||
The following steps will help you complete this exercise:
|
||
|
||
1. Open a new Jupyter notebook file.
|
||
|
||
2. Implement the initial steps from *Exercise 14.01*, *Loading and
|
||
Cleaning the Dataset*, up until scaling the dataset using the
|
||
`minmaxscaler()` function:
|
||
```
|
||
filename = 'https://raw.githubusercontent.com'\
|
||
'/fenago/data-science'\
|
||
'/master/Lab14/Dataset/ad.data'
|
||
import pandas as pd
|
||
adData = pd.read_csv(filename,sep=",",header = None,\
|
||
error_bad_lines=False)
|
||
X = adData.loc[:,0:1557]
|
||
Y = adData[1558]
|
||
import numpy as np
|
||
for i in range(0,3):
|
||
X[i] = X[i].str.replace("?", 'NaN').values.astype(float)
|
||
for i in range(3,1557):
|
||
X[i] = X[i].replace("?", 'NaN').values.astype(float)
|
||
for i in range(0,1557):
|
||
X[i] = X[i].fillna(X[i].mean())
|
||
from sklearn import preprocessing
|
||
minmaxScaler = preprocessing.MinMaxScaler()
|
||
X_tran = pd.DataFrame(minmaxScaler.fit_transform(X))
|
||
```
|
||
|
||
|
||
3. Create a high-dimensional dataset. Let\'s now augment the dataset
|
||
artificially to a factor of 50. Augmenting the dataset to higher
|
||
factors will result in the notebook crashing because of a lack of
|
||
memory. This is implemented using the following code snippet:
|
||
|
||
```
|
||
# Creating a high dimension data set
|
||
X_hd = pd.DataFrame(pd.np.tile(X_tran, (1, 50)))
|
||
print(X_hd.shape)
|
||
```
|
||
|
||
|
||
You should get the following output
|
||
|
||
```
|
||
(3279, 77900)
|
||
```
|
||
|
||
|
||
4. Let\'s split the high-dimensional dataset to training and test sets:
|
||
```
|
||
from sklearn.model_selection import train_test_split
|
||
# Splitting the data into train and test sets
|
||
X_train, X_test, y_train, y_test = train_test_split\
|
||
(X_hd, Y, test_size=0.3, \
|
||
random_state=123)
|
||
```
|
||
|
||
|
||
5. Let\'s now fit the PCA function on the training set. This is done
|
||
using the `.fit()` function, as shown in the following
|
||
snippet. We will also note the time it takes to fit the PCA model on
|
||
the dataset:
|
||
|
||
```
|
||
from sklearn.decomposition import PCA
|
||
import time
|
||
t0 = time.time()
|
||
pca = PCA().fit(X_train)
|
||
t1 = time.time()
|
||
print("PCA fitting time:", round(t1-t0, 3), "s")
|
||
```
|
||
|
||
|
||
You should get the following output:
|
||
|
||
```
|
||
PCS fitting time: 179.545 s
|
||
```
|
||
|
||
|
||
We can see that the time taken to fit the PCA function on the
|
||
dataset is less than the backward elimination model (230.35 seconds)
|
||
and higher than the forward selection method (2.682 seconds).
|
||
|
||
6. We will now determine the number of principal components by plotting
|
||
the cumulative variance explained by all the principal components.
|
||
The variance explained is determined by the
|
||
`pca.explained_variance_ratio_` method. This is plotted in
|
||
`matplotlib` using the following code snippet:
|
||
|
||
```
|
||
%matplotlib inline
|
||
import numpy as np
|
||
import matplotlib.pyplot as plt
|
||
plt.plot(np.cumsum(pca.explained_variance_ratio_))
|
||
plt.xlabel('Number of Principal Components')
|
||
plt.ylabel('Cumulative explained variance');
|
||
```
|
||
|
||
|
||
In the code, the `np.cumsum()` function is used to get the
|
||
cumulative variance of each principal component.
|
||
|
||
You will get the following plot as output:
|
||
|
||
|
||

|
||
|
||
|
||
Caption: The variance graph
|
||
|
||
From the plot, we can see that the first `250` principal
|
||
components explain more than `90%` of the variance. Based
|
||
on this graph, we can decide how many principal components we want
|
||
to have depending on the variability it explains. Let\'s select
|
||
`250` components for fitting our model.
|
||
|
||
7. Now that we have identified that `250` components explain
|
||
a lot of the variability, let\'s refit the training set for
|
||
`250` components. This is described in the following code
|
||
snippet:
|
||
```
|
||
# Defining PCA with 250 components
|
||
pca = PCA(n_components=250)
|
||
# Fitting PCA on the training set
|
||
pca.fit(X_train)
|
||
```
|
||
|
||
|
||
8. We now transform the training and test sets with the 200 principal
|
||
components:
|
||
```
|
||
# Transforming training set and test set
|
||
X_pca = pca.transform(X_train)
|
||
X_test_pca = pca.transform(X_test)
|
||
```
|
||
|
||
|
||
9. Let\'s verify the shapes of the train and test sets before
|
||
transformation and after transformation:
|
||
|
||
```
|
||
"""
|
||
Printing the shape of train and test sets before
|
||
and after transformation
|
||
"""
|
||
print("original shape of Training set: ", \
|
||
X_train.shape)
|
||
print("original shape of Test set: ", \
|
||
X_test.shape)
|
||
print("Transformed shape of training set:", \
|
||
X_pca.shape)
|
||
print("Transformed shape of test set:", \
|
||
X_test_pca.shape)
|
||
```
|
||
|
||
|
||
You should get the following output:
|
||
|
||
|
||

|
||
|
||
|
||
Caption: Transformed and the original training and testing sets
|
||
|
||
You can see that both the training and test sets are reduced to
|
||
`250` features each.
|
||
|
||
10. Let\'s now fit the logistic regression model on the transformed
|
||
dataset and note the time it takes to fit the model:
|
||
```
|
||
# Fitting a Logistic Regression Model
|
||
from sklearn.linear_model import LogisticRegression
|
||
import time
|
||
pcaModel = LogisticRegression()
|
||
t0 = time.time()
|
||
pcaModel.fit(X_pca, y_train)
|
||
t1 = time.time()
|
||
```
|
||
|
||
|
||
11. Print the total time:
|
||
|
||
```
|
||
print("Total training time:", round(t1-t0, 3), "s")
|
||
```
|
||
|
||
|
||
You should get the following output:
|
||
|
||
```
|
||
Total training time: 0.293 s
|
||
```
|
||
|
||
|
||
You can see that the training time is much lower than the model that
|
||
was fit in *Activity 14.01*, *Fitting a Logistic Regression Model on
|
||
a HighDimensional Dataset*, which was 23.86 seconds. The shorter
|
||
time is attributed to the smaller number of features,
|
||
`250`, selected in PCA.
|
||
|
||
12. Now, predict on the test set and print the accuracy metrics:
|
||
|
||
```
|
||
# Predicting with the pca model
|
||
pred = pcaModel.predict(X_test_pca)
|
||
print('Accuracy of Logistic regression model '\
|
||
'prediction on test set: {:.2f}'\
|
||
.format(pcaModel.score(X_test_pca, y_test)))
|
||
```
|
||
|
||
|
||
You should get the following output:
|
||
|
||
|
||

|
||
|
||
|
||
Caption: Accuracy of the logistic regression model
|
||
|
||
You can see that the accuracy level is better than the benchmark
|
||
model with all the features (`97%`) and the forward
|
||
selection model (`94%`).
|
||
|
||
13. Print the confusion matrix:
|
||
|
||
```
|
||
from sklearn.metrics import confusion_matrix
|
||
confusionMatrix = confusion_matrix(y_test, pred)
|
||
print(confusionMatrix)
|
||
```
|
||
|
||
|
||
You should get the following output:
|
||
|
||
|
||

|
||
|
||
|
||
Caption: Resulting confusion matrix
|
||
|
||
14. Print the classification report:
|
||
|
||
```
|
||
from sklearn.metrics import classification_report
|
||
# Getting the Classification_report
|
||
print(classification_report(y_test, pred))
|
||
```
|
||
|
||
|
||
You should get the following output:
|
||
|
||
|
||

|
||
|
||
|
||
|
||
|
||
|
||
Independent Component Analysis (ICA)
|
||
------------------------------------
|
||
|
||
ICA is a technique of dimensionality reduction that conceptually follows
|
||
a similar path as PCA. Both ICA and PCA try to derive new sources of
|
||
data by linearly combining the original data.
|
||
|
||
|
||
Let\'s look at the implementation of ICA for our use case.
|
||
|
||
|
||
|
||
Exercise 14.05: Dimensionality Reduction Using Independent Component Analysis
|
||
-----------------------------------------------------------------------------
|
||
|
||
In this exercise, we will fit a logistic regression model using the ICA
|
||
technique and observe the performance of the model. We will be using the
|
||
same ads dataset as before, and we will be enhancing it with additional
|
||
features for this exercise.
|
||
|
||
The following steps will help you complete this exercise:
|
||
|
||
1. Open a new Jupyter notebook file.
|
||
|
||
2. Implement all the steps from *Exercise 14.01*, *Loading and Cleaning
|
||
the Dataset*, up until scaling the dataset using
|
||
`MinMaxScaler()`:
|
||
```
|
||
filename = 'https://raw.githubusercontent.com'\
|
||
'/fenago/data-science'\
|
||
'/master/Lab14/Dataset/ad.data'
|
||
import pandas as pd
|
||
adData = pd.read_csv(filename,sep=",",header = None,\
|
||
error_bad_lines=False)
|
||
X = adData.loc[:,0:1557]
|
||
Y = adData[1558]
|
||
import numpy as np
|
||
for i in range(0,3):
|
||
X[i] = X[i].str.replace("?", 'NaN')\
|
||
.values.astype(float)
|
||
for i in range(3,1557):
|
||
X[i] = X[i].replace("?", 'NaN')\
|
||
.values.astype(float)
|
||
for i in range(0,1557):
|
||
X[i] = X[i].fillna(X[i].mean())
|
||
from sklearn import preprocessing
|
||
minmaxScaler = preprocessing.MinMaxScaler()
|
||
X_tran = pd.DataFrame(minmaxScaler.fit_transform(X))
|
||
```
|
||
|
||
|
||
3. Let\'s now augment the dataset artificially to a factor of
|
||
`50`. Augmenting the dataset to factors that are higher
|
||
than `50` will result in the notebook crashing because of
|
||
a lack of memory. This is implemented using the following
|
||
code snippet:
|
||
|
||
```
|
||
# Creating a high dimension data set
|
||
X_hd = pd.DataFrame(pd.np.tile(X_tran, (1, 50)))
|
||
print(X_hd.shape)
|
||
```
|
||
|
||
|
||
You should get the following output:
|
||
|
||
```
|
||
(3279, 77900)
|
||
```
|
||
|
||
|
||
4. Let\'s split the high-dimensional dataset into training and testing
|
||
sets:
|
||
```
|
||
from sklearn.model_selection import train_test_split
|
||
# Splitting the data into train and test sets
|
||
X_train, X_test, y_train, y_test = train_test_split\
|
||
(X_hd, Y, test_size=0.3,\
|
||
random_state=123)
|
||
```
|
||
|
||
|
||
5. Let\'s load the ICA function, `FastICA`, and then define
|
||
the number of components we require. We will use the same number of
|
||
components that we used for PCA:
|
||
```
|
||
# Defining the ICA with number of components
|
||
from sklearn.decomposition import FastICA
|
||
ICA = FastICA(n_components=250, random_state=123)
|
||
```
|
||
|
||
|
||
6. Once the ICA method is defined, we will fit the method on the
|
||
training set and also transform the training set to get a new
|
||
training set with the required number of components. We will also
|
||
note the time taken for fitting and transforming:
|
||
|
||
```
|
||
"""
|
||
Fitting the ICA method and transforming the
|
||
training set import time
|
||
"""
|
||
t0 = time.time()
|
||
X_ica=ICA.fit_transform(X_train)
|
||
t1 = time.time()
|
||
print("ICA fitting time:", round(t1-t0, 3), "s")
|
||
```
|
||
|
||
|
||
In the code, the `.fit()` function is used to fit on the
|
||
training set and the `transform()` method is used to get a
|
||
new training set with the required number of features.
|
||
|
||
You should get the following output:
|
||
|
||
```
|
||
ICA fitting time: 203.02 s
|
||
```
|
||
|
||
|
||
We can see that implementing ICA has taken much more time than PCA
|
||
(179.54 seconds).
|
||
|
||
7. We now transform the test set with the `250` components:
|
||
```
|
||
# Transforming the test set
|
||
X_test_ica=ICA.transform(X_test)
|
||
```
|
||
|
||
|
||
8. Let\'s verify the shapes of the train and test sets before
|
||
transformation and after transformation:
|
||
|
||
```
|
||
"""
|
||
Printing the shape of train and test sets
|
||
before and after transformation
|
||
"""
|
||
print("original shape of Training set: ", \
|
||
X_train.shape)
|
||
print("original shape of Test set: ", \
|
||
X_test.shape)
|
||
print("Transformed shape of training set:", \
|
||
X_ica.shape)
|
||
print("Transformed shape of test set:", \
|
||
X_test_ica.shape)
|
||
```
|
||
|
||
|
||
You should get the following output:
|
||
|
||
|
||

|
||
|
||
|
||
Caption: Shape of the original and transformed datasets
|
||
|
||
You can see that both the training and test sets are reduced to
|
||
`250` features each.
|
||
|
||
9. Let\'s now fit the logistic regression model on the transformed
|
||
dataset and note the time it takes:
|
||
```
|
||
# Fitting a Logistic Regression Model
|
||
from sklearn.linear_model import LogisticRegression
|
||
import time
|
||
icaModel = LogisticRegression()
|
||
t0 = time.time()
|
||
icaModel.fit(X_ica, y_train)
|
||
t1 = time.time()
|
||
```
|
||
|
||
|
||
10. Print the total time:
|
||
|
||
```
|
||
print("Total training time:", round(t1-t0, 3), "s")
|
||
```
|
||
|
||
|
||
You should get the following output:
|
||
|
||
```
|
||
Total training time: 0.054 s
|
||
```
|
||
|
||
|
||
11. Let\'s now predict on the test set and print the accuracy metrics:
|
||
|
||
```
|
||
# Predicting with the ica model
|
||
pred = icaModel.predict(X_test_ica)
|
||
print('Accuracy of Logistic regression model '\
|
||
'prediction on test set: {:.2f}'\
|
||
.format(icaModel.score(X_test_ica, y_test)))
|
||
```
|
||
|
||
|
||
You should get the following output:
|
||
|
||
```
|
||
Accuracy of Logistic regression model prediction on test set: 0.87
|
||
```
|
||
|
||
|
||
We can see that the ICA model has worse results than other models.
|
||
|
||
12. Print the confusion matrix:
|
||
|
||
```
|
||
from sklearn.metrics import confusion_matrix
|
||
confusionMatrix = confusion_matrix(y_test, pred)
|
||
print(confusionMatrix)
|
||
```
|
||
|
||
|
||
You should get the following output:
|
||
|
||
|
||

|
||
|
||
|
||
Caption: Resulting confusion matrix
|
||
|
||
|
||
13. Print the classification report:
|
||
|
||
```
|
||
from sklearn.metrics import classification_report
|
||
# Getting the Classification_report
|
||
print(classification_report(y_test, pred))
|
||
```
|
||
|
||
|
||
You should get the following output:
|
||
|
||
|
||

|
||
|
||
|
||
|
||
|
||
|
||
Exercise 14.06: Dimensionality Reduction Using Factor Analysis
|
||
--------------------------------------------------------------
|
||
|
||
In this exercise, we will fit a logistic regression model after reducing
|
||
the original dimensions to some key factors and then observe the
|
||
performance of the model.
|
||
|
||
The following steps will help you complete this exercise:
|
||
|
||
1. Open a new Jupyter notebook file.
|
||
|
||
2. Implement the same initial steps from *Exercise 14.01*, *Loading and
|
||
Cleaning the Dataset*, up until scaling the dataset using the
|
||
`minmaxscaler()` function:
|
||
```
|
||
filename = 'https://raw.githubusercontent.com'\
|
||
'/fenago/data-science'\
|
||
'/master/Lab14/Dataset/ad.data'
|
||
import pandas as pd
|
||
adData = pd.read_csv(filename,sep=",",header = None,\
|
||
error_bad_lines=False)
|
||
X = adData.loc[:,0:1557]
|
||
Y = adData[1558]
|
||
import numpy as np
|
||
for i in range(0,3):
|
||
X[i] = X[i].str.replace("?", 'NaN')\
|
||
.values.astype(float)
|
||
for i in range(3,1557):
|
||
X[i] = X[i].replace("?", 'NaN')\
|
||
.values.astype(float)
|
||
for i in range(0,1557):
|
||
X[i] = X[i].fillna(X[i].mean())
|
||
from sklearn import preprocessing
|
||
minmaxScaler = preprocessing.MinMaxScaler()
|
||
X_tran = pd.DataFrame(minmaxScaler.fit_transform(X))
|
||
```
|
||
|
||
|
||
3. Let\'s now augment the dataset artificially to a factor of
|
||
`50`. Augmenting the dataset to factors that are higher
|
||
than `50` will result in the notebook crashing because of
|
||
a lack of memory. This is implemented using the following
|
||
code snippet:
|
||
|
||
```
|
||
# Creating a high dimension data set
|
||
X_hd = pd.DataFrame(pd.np.tile(X_tran, (1, 50)))
|
||
print(X_hd.shape)
|
||
```
|
||
|
||
|
||
You should get the following output:
|
||
|
||
```
|
||
(3279, 77900)
|
||
```
|
||
|
||
|
||
4. Let\'s split the high-dimensional dataset into train and test sets:
|
||
```
|
||
from sklearn.model_selection import train_test_split
|
||
# Splitting the data into train and test sets
|
||
X_train, X_test, y_train, y_test = train_test_split\
|
||
(X_hd, Y, test_size=0.3,\
|
||
random_state=123)
|
||
```
|
||
|
||
|
||
5. An important step in factor analysis is defining the number of
|
||
factors in a dataset. This step is achieved through experimentation.
|
||
In our case, we will arbitrarily assume that there are
|
||
`20` factors. This is implemented as follows:
|
||
|
||
```
|
||
# Defining the number of factors
|
||
from sklearn.decomposition import FactorAnalysis
|
||
fa = FactorAnalysis(n_components = 20,\
|
||
random_state=123)
|
||
```
|
||
|
||
|
||
The number of factors is defined through the
|
||
`n_components` argument. We also define a random state for
|
||
reproducibility.
|
||
|
||
6. Once the factor method is defined, we will fit the method on the
|
||
training set and also transform the training set to get a new
|
||
training set with the required number of factors. We will also note
|
||
the time it takes to fit the required number of factors:
|
||
|
||
```
|
||
"""
|
||
Fitting the Factor analysis method and
|
||
transforming the training set
|
||
"""
|
||
import time
|
||
t0 = time.time()
|
||
X_fac=fa.fit_transform(X_train)
|
||
t1 = time.time()
|
||
print("Factor analysis fitting time:", \
|
||
round(t1-t0, 3), "s")
|
||
```
|
||
|
||
|
||
In the code, the `.fit()` function is used to fit on the
|
||
training set, and the `transform()` method is used to get
|
||
a new training set with the required number of factors.
|
||
|
||
You should get the following output:
|
||
|
||
```
|
||
Factor analysis fitting time: 130.688 s
|
||
```
|
||
|
||
|
||
Factor analysis is also a compute-intensive method. This is the
|
||
reason that only 20 factors were selected. We can see that it has
|
||
taken `130.688` seconds for `20` factors.
|
||
|
||
7. We now transform the test set with the same number of factors:
|
||
```
|
||
# Transforming the test set
|
||
X_test_fac=fa.transform(X_test)
|
||
```
|
||
|
||
|
||
8. Let\'s verify the shapes of the train and test sets before
|
||
transformation and after transformation:
|
||
|
||
```
|
||
"""
|
||
Printing the shape of train and test sets
|
||
before and after transformation
|
||
"""
|
||
print("original shape of Training set: ", \
|
||
X_train.shape)
|
||
print("original shape of Test set: ", \
|
||
X_test.shape)
|
||
print("Transformed shape of training set:", \
|
||
X_fac.shape)
|
||
print("Transformed shape of test set:", \
|
||
X_test_fac.shape)
|
||
```
|
||
|
||
|
||
You should get the following output:
|
||
|
||
|
||

|
||
|
||
|
||
Caption: Original and transformed dataset values
|
||
|
||
You can see that both the training and test sets have been reduced
|
||
to `20` factors each.
|
||
|
||
9. Let\'s now fit the logistic regression model on the transformed
|
||
dataset and note the time it takes to fit the model:
|
||
```
|
||
# Fitting a Logistic Regression Model
|
||
from sklearn.linear_model import LogisticRegression
|
||
import time
|
||
facModel = LogisticRegression()
|
||
t0 = time.time()
|
||
facModel.fit(X_fac, y_train)
|
||
t1 = time.time()
|
||
```
|
||
|
||
|
||
10. Print the total time:
|
||
|
||
```
|
||
print("Total training time:", round(t1-t0, 3), "s")
|
||
```
|
||
|
||
|
||
You should get the following output:
|
||
|
||
```
|
||
Total training time: 0.028 s
|
||
```
|
||
|
||
|
||
We can see that the time it has taken to fit the logistic regression
|
||
model is comparable with other methods.
|
||
|
||
11. Let\'s now predict on the test set and print the accuracy metrics:
|
||
|
||
```
|
||
# Predicting with the factor analysis model
|
||
pred = facModel.predict(X_test_fac)
|
||
print('Accuracy of Logistic regression '\
|
||
'model prediction on test set: {:.2f}'
|
||
.format(facModel.score(X_test_fac, y_test)))
|
||
```
|
||
|
||
|
||
You should get the following output:
|
||
|
||
```
|
||
Accuracy of Logistic regression model prediction on test set: 0.92
|
||
```
|
||
|
||
|
||
We can see that the factor model has better results than the ICA
|
||
model, but worse results than the other models.
|
||
|
||
12. Print the confusion matrix:
|
||
|
||
```
|
||
from sklearn.metrics import confusion_matrix
|
||
confusionMatrix = confusion_matrix(y_test, pred)
|
||
print(confusionMatrix)
|
||
```
|
||
|
||
|
||
You should get the following output:
|
||
|
||
|
||

|
||
|
||
|
||
Caption: Resulting confusion matrix
|
||
|
||
We can see that the factor model has done a better job at
|
||
classifying the ads than the ICA model. However, there is still a
|
||
high number of false positives.
|
||
|
||
13. Print the classification report:
|
||
|
||
```
|
||
from sklearn.metrics import classification_report
|
||
# Getting the Classification_report
|
||
print(classification_report(y_test, pred))
|
||
```
|
||
|
||
|
||
You should get the following output:
|
||
|
||
|
||

|
||
|
||
|
||
|
||
|
||
Comparing Different Dimensionality Reduction Techniques
|
||
=======================================================
|
||
|
||
|
||
Now that we have learned different dimensionality reduction techniques,
|
||
let\'s apply all of these techniques to a new dataset that we will
|
||
create from the existing ads dataset.
|
||
|
||
We will randomly sample some data points from a known distribution and
|
||
then add these random samples to the existing dataset to create a new
|
||
dataset. Let\'s carry out an experiment to see how a new dataset can be
|
||
created from an existing dataset.
|
||
|
||
We import the necessary libraries:
|
||
|
||
```
|
||
import pandas as pd
|
||
import numpy as np
|
||
```
|
||
Next, we create a dummy data frame.
|
||
|
||
We will use a small dataset with two rows and three columns for this
|
||
example. We use the `pd.np.array()` function to create a data
|
||
frame:
|
||
|
||
```
|
||
# Creating a simple data frame
|
||
df = pd.np.array([[1, 2, 3], [4, 5, 6]])
|
||
print(df.shape)
|
||
df
|
||
```
|
||
|
||
You should get the following output:
|
||
|
||

|
||
|
||
|
||
|
||
By assuming a mean and standard deviation, we will be able to draw
|
||
samples from a normal distribution using the
|
||
`np.random.normal()` Python function. The arguments that we
|
||
have to give for this function are the mean, the standard deviation, and
|
||
the shape of the new dataset.
|
||
|
||
Let\'s see how this is implemented in code:
|
||
|
||
```
|
||
# Defining the mean and standard deviation
|
||
mu, sigma = 0, 0.1
|
||
# Generating random sample
|
||
noise = np.random.normal(mu, sigma, [2,3])
|
||
noise.shape
|
||
```
|
||
|
||
You should get the following output:
|
||
|
||
```
|
||
(2, 3)
|
||
```
|
||
As we can see, we give the mean (`mu`), standard deviation
|
||
(`sigma`), and the shape of the data frame `[2,3]`
|
||
to generate the new random samples.
|
||
|
||
Print the sampled data frame:
|
||
|
||
```
|
||
# Sampled data frame
|
||
noise
|
||
```
|
||
|
||
You will get something like the following output:
|
||
|
||
```
|
||
array([[-0.07175021, -0.21135372, 0.10258917],
|
||
[ 0.03737542, 0.00045449, -0.04866098]])
|
||
```
|
||
|
||
The next step is to add the original data frame and the sampled data
|
||
frame to get the new dataset:
|
||
|
||
```
|
||
# Creating a new data set by adding sampled data frame
|
||
df_new = df + noise
|
||
df_new
|
||
```
|
||
|
||
You should get something like the following output:
|
||
|
||
```
|
||
array([[0.92824979, 1.78864628, 3.10258917],
|
||
[4.03737542, 5.00045449, 5.95133902]])
|
||
```
|
||
Having seen how to create a new dataset, let\'s use this knowledge in
|
||
the next activity.
|
||
|
||
|
||
|
||
Activity 14.02: Comparison of Dimensionality Reduction Techniques on the Enhanced Ads Dataset
|
||
---------------------------------------------------------------------------------------------
|
||
|
||
You have learned different dimensionality reduction techniques. You want
|
||
to determine which is the best technique among them for a dataset you
|
||
will create.
|
||
|
||
**Hint**: In this activity, we will use the different techniques that
|
||
you have used in all the exercises so far. You will also create a new
|
||
dataset as we did in the previous section.
|
||
|
||
The steps to complete this activity are as follows:
|
||
|
||
1. Open a new Jupyter notebook.
|
||
|
||
2. Normalize the original ads data and derive the transformed
|
||
independent variable, `X_tran`.
|
||
|
||
3. Create a high-dimensional dataset by replicating the columns twice
|
||
using the `pd.np.tile()` function.
|
||
|
||
4. Create random samples from a normal distribution with mean = 0 and
|
||
standard deviation = 0.1. Make the new dataset with the same shape
|
||
as the high-dimensional dataset created in *step 3*.
|
||
|
||
5. Add the high dimensional dataset and the random samples to get the
|
||
new dataset.
|
||
|
||
6. Split the dataset into train and test sets.
|
||
|
||
7. Implement backward elimination with the following steps:
|
||
|
||
Implement the backward elimination step using the `RFE()`
|
||
function.
|
||
|
||
Use logistic regression as the model and select the best
|
||
`300` features.
|
||
|
||
Fit the `RFE()` function on the training set and measure
|
||
the time it takes to fit the RFE model on the training set.
|
||
|
||
Transform the train and test sets with the RFE model.
|
||
|
||
Fit a logistic regression model on the transformed training set.
|
||
|
||
Predict on the test set and print the accuracy score, confusion
|
||
matrix, and classification report.
|
||
|
||
8. Implement the forward selection technique with the following steps:
|
||
|
||
Define the number of features using the `SelectKBest()`
|
||
function. Select the best `300` features.
|
||
|
||
Fit the forward selection on the training set using the
|
||
`.fit()` function and note the time taken for the fit.
|
||
|
||
Transform both the training and test sets using the
|
||
`.transform()` function.
|
||
|
||
Fit a logistic regression model on the transformed training set.
|
||
|
||
Predict on the transformed test set and print the accuracy,
|
||
confusion matrix, and classification report.
|
||
|
||
9. Implement PCA:
|
||
|
||
Define the principal components using the `PCA()`
|
||
function. Use 300 components.
|
||
|
||
Fit `PCA()` on the training set. Note the time.
|
||
|
||
Transform both the training set and test set to get the respective
|
||
number of components for these datasets using the
|
||
`.transform()` function.
|
||
|
||
Fit a logistic regression model on the transformed training set.
|
||
|
||
Predict on the transformed test set and print the accuracy,
|
||
confusion matrix, and classification report.
|
||
|
||
10. Implement ICA:
|
||
|
||
Define independent components using the `FastICA()`
|
||
function using `300` components.
|
||
|
||
Fit the independent components on the training set and transform the
|
||
training set. Note the time for the implementation.
|
||
|
||
Transform the test set to get the respective number of components
|
||
for these datasets using the `.transform()` function.
|
||
|
||
Fit a logistic regression model on the transformed training set.
|
||
|
||
Predict on the transformed test set and print the accuracy,
|
||
confusion matrix, and classification report.
|
||
|
||
11. Implement factor analysis:
|
||
|
||
Define the number of factors using the `FactorAnalysis()`
|
||
function and `30` factors.
|
||
|
||
Fit the factors on the training set and transform the training set.
|
||
Note the time for the implementation.
|
||
|
||
Transform the test set to get the respective number of components
|
||
for these datasets using the `.transform()` function.
|
||
|
||
Fit a logistic regression model on the transformed training set.
|
||
|
||
Predict on the transformed test set and print the accuracy,
|
||
confusion matrix, and classification report.
|
||
|
||
12. Compare the outputs of all the methods.
|
||
|
||
**Expected Output**:
|
||
|
||
An example summary table of the results is as follows:
|
||
|
||

|
||
|
||
Caption: Summary output of all the reduction techniques
|
||
|
||
|
||
|
||
Summary
|
||
=======
|
||
|
||
|
||
In this lab, we have learned about various techniques for
|
||
dimensionality reduction. Let\'s summarize what we have learned in this
|
||
lab.
|
||
|
||
At the beginning of the lab, we were introduced to the challenges
|
||
inherent with some of the modern-day datasets in terms of scalability.
|
||
To further learn about these challenges, we downloaded the Internet
|
||
Advertisement dataset and did an activity where we witnessed the
|
||
scalability challenges posed by a large dataset. In the activity, we
|
||
artificially created a large dataset and fit a logistic regression model
|
||
to it.
|