Files
fenago 39139f2b0e added
2021-02-09 03:17:43 +05:00

657 lines
20 KiB
Markdown
Raw Permalink Blame History

This file contains invisible Unicode characters
This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
<img align="right" src="./logo.png">
Lab 2. Regression
=============
Overview
This lab is an introduction to linear regression analysis and its
application to practical problem-solving in data science. You will learn
how to use Python, a versatile programming language, to carry out
regression analysis and examine the results. The use of the logarithm
function to transform inherently non-linear relationships between
variables and to enable the application of the linear regression method
of analysis will also be introduced.
Exercise 2.01: Loading and Preparing the Data for Analysis
----------------------------------------------------------
In this exercise, we will learn how to load Python modules, and the
dataset we need for analysis, into our Python session and prepare the
data for analysis.
The following steps will help you to complete this exercise:
1. Open a new Jupyter notebook file.
2. Load the necessary Python modules by entering the following code
snippet into a single Jupyter notebook cell. Press the **Shift** and
**Enter** keys together to run the block of code:
```
%matplotlib inline
import matplotlib as mpl
import seaborn as sns
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf
import statsmodels.graphics.api as smg
import pandas as pd
import numpy as np
import patsy
from statsmodels.graphics.correlation import plot_corr
from sklearn.model_selection import train_test_split
plt.style.use('seaborn')
```
The `plot_corr` and `train_test_split` functions
are imported from the `statsmodels.graphics.correlation`
and `sklearn.model_selection` modules respectively. The
last statement is used to set the aesthetic look of the graphs that
`matplotlib` generates to the type displayed by the
`seaborn` module.
3. Next, load the `Boston.CSV` file and assign the variable
name `rawBostonData` to it by running the following code:
```
rawBostonData = pd.read_csv\
('https://raw.githubusercontent.com/'\
'fenago/data-science'\
'/master/Lab02/'\
'Dataset/Boston.csv')
```
4. Inspect the first five records in the DataFrame:
```
rawBostonData.head()
```
You should get the following output:
![](./images/B15019_02_06.jpg)
Caption: First five rows of the dataset
5. Check for missing values (*null* values) in the DataFrame and then
drop them in order to get a clean dataset Use the pandas method
`dropna()` to find and remove these missing values:
```
rawBostonData = rawBostonData.dropna()
```
6. Check for duplicate records in the DataFrame and then drop them in
order to get a clean dataset. Use the `drop_duplicates()`
method from pandas:
```
rawBostonData = rawBostonData.drop_duplicates()
```
7. List the column names of the DataFrame so that you can examine the
fields in your dataset, and modify the names, if necessary, to names
that are meaningful:
```
list(rawBostonData.columns)
```
You should get the following output:
![](./images/B15019_02_07.jpg)
Caption: Listing all the column names
8. Rename the DataFrame columns so that they are meaningful. Be mindful
to match the column names exactly as leaving out even white spaces
in the name strings will result in an error. For example, this
string, `ZN`, has a white space before and after and it is
different from `ZN`. After renaming, print the head of the
new DataFrame as follows:
```
renamedBostonData = rawBostonData.rename\
(columns = {\
'CRIM':'crimeRatePerCapita',\
' ZN ':'landOver25K_sqft',\
'INDUS ':'non-retailLandProptn',\
'CHAS':'riverDummy',\
'NOX':'nitrixOxide_pp10m',\
'RM':'AvgNo.RoomsPerDwelling',\
'AGE':'ProptnOwnerOccupied',\
'DIS':'weightedDist',\
'RAD':'radialHighwaysAccess',\
'TAX':'propTaxRate_per10K',\
'PTRATIO':'pupilTeacherRatio',\
'LSTAT':'pctLowerStatus',\
'MEDV':'medianValue_Ks'})
renamedBostonData.head()
```
You should get the following output:
![](./images/B15019_02_08.jpg)
Caption: DataFrames being renamed
9. Inspect the data types of the columns in your DataFrame using the
`.info()` function:
```
renamedBostonData.info()
```
You should get the following output:
![](./images/B15019_02_09.jpg)
Caption: The different data types in the dataset
10. Now, calculate basic statistics for the numeric columns in the
DataFrame:
```
renamedBostonData.describe(include=[np.number]).T
```
You should get the following output:
![](./images/B15019_02_10.jpg)
Caption: Basic statistics of the numeric column
11. Divide the DataFrame into training and test sets, as shown in the
following code snippet:
```
X = renamedBostonData.drop('crimeRatePerCapita', axis = 1)
y = renamedBostonData[['crimeRatePerCapita']]
seed = 10
test_data_size = 0.3
X_train, X_test, \
y_train, y_test = train_test_split(X, y, \
test_size = test_data_size, \
random_state = seed)
train_data = pd.concat([X_train, y_train], axis = 1)
test_data = pd.concat([X_test, y_test], axis = 1)
```
12. Calculate and plot a correlation matrix for the
`train_data` set:
```
corrMatrix = train_data.corr(method = 'pearson')
xnames=list(train_data.columns)
ynames=list(train_data.columns)
plot_corr(corrMatrix, xnames=xnames, ynames=ynames,\
title=None, normcolor=False, cmap='RdYlBu_r')
```
You should get the following output:
![](./images/B15019_02_11.jpg)
Exercise 2.02: Graphical Investigation of Linear Relationships Using Python
---------------------------------------------------------------------------
Scatter graphs fitted with a regression line are a quick way by which a
data scientist can visualize a possible correlation between a dependent
variable and an independent variable.
The goal of the exercise is to use this technique to investigate any
linear relationship that may exist between crime rate per capita and the
median value of owner-occupied homes in towns in the city of Boston.
The following steps will help you complete the exercise:
1. Open a new Jupyter notebook file and execute the steps up to and
including *Step 11* from *Exercise 2.01*, *Loading and Preparing the
Data for Analysis*. This is shown in the code blocks below, starting
with the import statements:
```
%matplotlib inline
import matplotlib as mpl
import seaborn as sns
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf
import statsmodels.graphics.api as smg
import pandas as pd
import numpy as np
import patsy
from statsmodels.graphics.correlation import plot_corr
from sklearn.model_selection import train_test_split
plt.style.use('seaborn')
```
Loading and preprocessing the data:
```
rawBostonData = pd.read_csv\
('https://raw.githubusercontent.com/'\
'fenago/data-science'\
'/master/Lab02/'
'Dataset/Boston.csv')
rawBostonData = rawBostonData.dropna()
rawBostonData = rawBostonData.drop_duplicates()
renamedBostonData = rawBostonData.rename\
(columns = {\
'CRIM':'crimeRatePerCapita',\
' ZN ':'landOver25K_sqft',\
'INDUS ':'non-retailLandProptn',\
'CHAS':'riverDummy',\
'NOX':'nitrixOxide_pp10m',\
'RM':'AvgNo.RoomsPerDwelling',\
'AGE':'ProptnOwnerOccupied',\
'DIS':'weightedDist',\
'RAD':'radialHighwaysAccess',\
'TAX':'propTaxRate_per10K',\
'PTRATIO':'pupilTeacherRatio',\
'LSTAT':'pctLowerStatus',\
'MEDV':'medianValue_Ks'})
```
Setting up the test and train data:
```
X = renamedBostonData.drop('crimeRatePerCapita', axis = 1)
y = renamedBostonData[['crimeRatePerCapita']]
seed = 10
test_data_size = 0.3
X_train, X_test, y_train, y_test = train_test_split\
(X, y, \
test_size = test_data_size,\
random_state = seed)
train_data = pd.concat([X_train, y_train], axis = 1)
test_data = pd.concat([X_test, y_test], axis = 1)
```
2. Now use the `subplots` function in `matplotlib`
to define a canvas (assigned the variable name `fig` in
the following code) and a graph object (assigned the variable name
`ax` in the following code) in Python. You can set the
size of the graph by setting the `figsize` (width =
`10`, height = `6`) argument of the function:
```
fig, ax = plt.subplots(figsize=(10, 6))
```
Do not execute the code yet.
3. Use the `seaborn` function `regplot` to create
the scatter plot. Do not execute this code cell yet; we will add
more code to style the plot in the next step:
```
sns.regplot(x='medianValue_Ks', y='crimeRatePerCapita', \
ci=None, data=train_data, ax=ax, color='k', \
scatter_kws={"s": 20,"color": "royalblue", \
"alpha":1})
```
The function accepts arguments for the independent variable
(`x`), the dependent variable (`y`), the
confidence interval of the regression parameters (`ci`),
which takes values from 0 to 100, the DataFrame that has
`x` and `y` (`data`), a matplotlib
graph object (`ax`), and others to control the aesthetics
of the points on the graph. (In this case, the confidence interval
is set to `None` -- we will see more on confidence
intervals later in the lab.)
4. In the same cell as step 3, set the `x` and `y`
labels, the `fontsize` and `name` labels, the
`x` and `y` limits, and the `tick`
parameters of the matplotlib graph object (`ax`). Also,
set the layout of the canvas to `tight`:
```
ax.set_ylabel('Crime rate per Capita', fontsize=15, \
fontname='DejaVu Sans')
ax.set_xlabel("Median value of owner-occupied homes "\
"in $1000's", fontsize=15, \
fontname='DejaVu Sans')
ax.set_xlim(left=None, right=None)
ax.set_ylim(bottom=None, top=30)
ax.tick_params(axis='both', which='major', labelsize=12)
fig.tight_layout()
```
Now execute the cell. You should get the following output:
![](./images/B15019_02_13.jpg)
Exercise 2.03: Examining a Possible Log-Linear Relationship Using Python
------------------------------------------------------------------------
In this exercise, we will use the logarithm function to transform
variables and investigate whether this helps provide a better fit of the
regression line to the data. We will also look at how to use confidence
intervals by including a 95% confidence interval of the regression
coefficients on the plot.
The following steps will help you to complete this exercise:
1. Open a new Jupyter notebook file and execute all the steps up to *Step
11* from *Exercise 2.01*, *Loading and Preparing the Data for
Analysis*.
2. Use the `subplots` function in `matplotlib` to
define a canvas and a graph object in Python:
```
fig, ax = plt.subplots(figsize=(10, 6))
```
Do not execute this code yet.
3. In the same code cell, use the logarithm function in
`numpy` (`np.log`) to transform the dependent
variable (`y`). This essentially creates a new variable,
`log(y)`:
```
y = np.log(train_data['crimeRatePerCapita'])
```
Do not execute this code yet.
4. Use the seaborn `regplot` function to create the scatter
plot. Set the `regplot` function confidence interval
argument (`ci`) to `95%`. This will calculate a
`95%` confidence interval for the regression coefficients
and have them plotted on the graph as a shaded area along the
regression line.
Note
A confidence interval gives an estimated range that is likely to
contain the true value that you\'re looking for. So, a
`95%` confidence interval indicates we can be
`95%` certain that the true regression coefficients lie in
that shaded area.
Parse the `y` argument with the new variable we defined in
the preceding step. The `x` argument is the original
variable from the DataFrame without any transformation. Continue in
the same code cell. Do not execute this cell yet; we will add in
more styling code in the next step.
```
sns.regplot(x='medianValue_Ks', y=y, ci=95, \
data=train_data, ax=ax, color='k', \
scatter_kws={"s": 20,"color": "royalblue", \
"alpha":1})
```
5. Continuing in the same cell, set the `x` and `y`
labels, the `fontsize` and `name` labels, the
`x` and `y` limits, and the `tick`
parameters of the `matplotlib` graph object
(`ax`). Also, set the layout of the canvas to
`tight`:
```
ax.set_ylabel('log of Crime rate per Capita', \
fontsize=15, fontname='DejaVu Sans')
ax.set_xlabel("Median value of owner-occupied homes "\
"in $1000's", fontsize=15, \
fontname='DejaVu Sans')
ax.set_xlim(left=None, right=None)
ax.set_ylim(bottom=None, top=None)
ax.tick_params(axis='both', which='major', labelsize=12)
fig.tight_layout()
```
Now execute this cell. The output is as follows:
![](./images/B15019_02_14.jpg)
Exercise 2.04: Fitting a Simple Linear Regression Model Using the Statsmodels formula API
-----------------------------------------------------------------------------------------
In this exercise, we examine a simple linear regression model where the
crime rate per capita is the dependent variable and the median value of
owner-occupied homes is the independent variable. We use the statsmodels
formula API to create a linear regression model for Python to analyze.
The following steps will help you complete this exercise:
1. Open a new Jupyter notebook file and import the required packages.
```
import pandas as pd
import statsmodels.formula.api as smf
from sklearn.model_selection import train_test_split
```
2. Execute *Step 2* to *11* from *Exercise 2.01*, *Loading and
Preparing the Data for Analysis*.
3. Define a linear regression model and assign it to a variable named
`linearModel`:
```
linearModel = smf.ols\
(formula='crimeRatePerCapita ~ medianValue_Ks',\
data=train_data)
```
As you can see, we call the `ols` function of the
statsmodels API and set its formula argument by defining a
`patsy` formula string that uses the tilde (`~`)
symbol to relate the dependent variable to the independent variable.
Tell the function where to find the variables named, in the string,
by assigning the data argument of the `ols` function to
the DataFrame that contains your variables (`train_data`).
4. Call the .`fit` method of the model instance and assign
the results of the method to a `linearModelResult`
variable, as shown in the following code snippet:
```
linearModelResult = linearModel.fit()
```
5. Print a summary of the results stored the
`linearModelResult` variable by running the following
code:
```
print(linearModelResult.summary())
```
You should get the following output:
![Caption: A summary of the simple linear regression analysis](./images/B15019_02_15.jpg)
Activity 2.01: Fitting a Log-Linear Model Using the Statsmodels Formula API
---------------------------------------------------------------------------
You have seen how to use the statsmodels API to fit a linear regression
model. In this activity, you are asked to fit a log-linear model. Your
model should represent the relationship between the log-transformed
dependent variable (log of crime rate per capita) and the median value
of owner-occupied homes.
The steps to complete this activity are as follows:
1. Define a linear regression model and assign it to a variable.
Remember to use the `log` function to transform the
dependent variable in the formula string.
2. Call the `fit` method of the log-linear model instance and
assign the results of the method to a variable.
3. Print a summary of the results and analyze the output.
Your output should look like the following figure:
![](./images/B15019_02_17.jpg)
Exercise 2.05: Fitting a Multiple Linear Regression Model Using the Statsmodels Formula API
-------------------------------------------------------------------------------------------
In this exercise, we will be using the plus operator (`+`) in
the `patsy` formula string to define a linear regression model
that includes more than one independent variable.
To complete this activity, run the code in the following steps in your
Jupyter notebook:
1. Open a new Jupyter notebook file and import the required packages.
```
import statsmodels.formula.api as smf
import pandas as pd
from sklearn.model_selection import train_test_split
```
2. Execute *Step 2* to *11* from *Exercise 2.01*, *Loading and
Preparing the Data for Analysis*.
3. Use the plus operator (`+`) of the Patsy formula language
to define a linear model that regresses
`crimeRatePerCapita` on `pctLowerStatus`,
`radialHighwaysAccess`, `medianValue_Ks`, and
`nitrixOxide_pp10m` and assign it to a variable named
`multiLinearModel`. Use the Python line continuation
symbol (`\`) to continue your code on a new line should
you run out of space:
```
multiLinearModel = smf.ols\
(formula = 'crimeRatePerCapita \
~ pctLowerStatus \
+ radialHighwaysAccess \
+ medianValue_Ks \
+ nitrixOxide_pp10m', \
data=train_data)
```
4. Call the `fit` method of the model instance and assign the
results of the method to a variable:
```
multiLinearModResult = multiLinearModel.fit()
```
5. Print a summary of the results stored the variable created in *Step
3*:
```
print(multiLinearModResult.summary())
```
The output is as follows:
![](./images/B15019_02_18.jpg)
Activity 2.02: Fitting a Multiple Log-Linear Regression Model
-------------------------------------------------------------
A log-linear regression model you developed earlier was able to explain
about 24% of the variability in the transformed crime rate per capita
variable (see the values in *Figure 2.17*). You are now asked to develop
a log-linear multiple regression model that will likely explain 80% or
more of the variability in the transformed dependent variable. You
should use independent variables from the Boston Housing dataset that
have a correlation coefficient of 0.4 or more.
You are also encouraged to include the interaction of these variables to
order two in your model. You should produce graphs and data that show
that your model satisfies the assumptions of linear regression.
The steps are as follows:
1. Define a linear regression model and assign it to a variable.
Remember to use the `log` function to transform the
dependent variable in the formula string, and also include more than
one independent variable in your analysis.
2. Call the `fit` method of the model instance and assign the
results of the method to a new variable.
3. Print a summary of the results and analyze your model.
Your output should appear as shown:
![](./images/B15019_02_19.jpg)
Summary
=======
This lab introduced the topic of linear regression analysis using
Python. We learned that regression analysis, in general, is a supervised
machine learning or data science problem. We learned about the
fundamentals of linear regression analysis, including the ideas behind
the method of least squares. We also learned about how to use the pandas
Python module to load and prepare data for exploration and analysis.