20 KiB
Lab 2. Regression
Overview
This lab is an introduction to linear regression analysis and its application to practical problem-solving in data science. You will learn how to use Python, a versatile programming language, to carry out regression analysis and examine the results. The use of the logarithm function to transform inherently non-linear relationships between variables and to enable the application of the linear regression method of analysis will also be introduced.
Exercise 2.01: Loading and Preparing the Data for Analysis
In this exercise, we will learn how to load Python modules, and the dataset we need for analysis, into our Python session and prepare the data for analysis.
The following steps will help you to complete this exercise:
-
Open a new Jupyter notebook file.
-
Load the necessary Python modules by entering the following code snippet into a single Jupyter notebook cell. Press the Shift and Enter keys together to run the block of code:
%matplotlib inline import matplotlib as mpl import seaborn as sns import matplotlib.pyplot as plt import statsmodels.formula.api as smf import statsmodels.graphics.api as smg import pandas as pd import numpy as np import patsy from statsmodels.graphics.correlation import plot_corr from sklearn.model_selection import train_test_split plt.style.use('seaborn')The
plot_corrandtrain_test_splitfunctions are imported from thestatsmodels.graphics.correlationandsklearn.model_selectionmodules respectively. The last statement is used to set the aesthetic look of the graphs thatmatplotlibgenerates to the type displayed by theseabornmodule. -
Next, load the
Boston.CSVfile and assign the variable namerawBostonDatato it by running the following code:rawBostonData = pd.read_csv\ ('https://raw.githubusercontent.com/'\ 'fenago/data-science'\ '/master/Lab02/'\ 'Dataset/Boston.csv') -
Inspect the first five records in the DataFrame:
rawBostonData.head()You should get the following output:
Caption: First five rows of the dataset
-
Check for missing values (null values) in the DataFrame and then drop them in order to get a clean dataset Use the pandas method
dropna()to find and remove these missing values:rawBostonData = rawBostonData.dropna() -
Check for duplicate records in the DataFrame and then drop them in order to get a clean dataset. Use the
drop_duplicates()method from pandas:rawBostonData = rawBostonData.drop_duplicates() -
List the column names of the DataFrame so that you can examine the fields in your dataset, and modify the names, if necessary, to names that are meaningful:
list(rawBostonData.columns)You should get the following output:
Caption: Listing all the column names
-
Rename the DataFrame columns so that they are meaningful. Be mindful to match the column names exactly as leaving out even white spaces in the name strings will result in an error. For example, this string,
ZN, has a white space before and after and it is different fromZN. After renaming, print the head of the new DataFrame as follows:renamedBostonData = rawBostonData.rename\ (columns = {\ 'CRIM':'crimeRatePerCapita',\ ' ZN ':'landOver25K_sqft',\ 'INDUS ':'non-retailLandProptn',\ 'CHAS':'riverDummy',\ 'NOX':'nitrixOxide_pp10m',\ 'RM':'AvgNo.RoomsPerDwelling',\ 'AGE':'ProptnOwnerOccupied',\ 'DIS':'weightedDist',\ 'RAD':'radialHighwaysAccess',\ 'TAX':'propTaxRate_per10K',\ 'PTRATIO':'pupilTeacherRatio',\ 'LSTAT':'pctLowerStatus',\ 'MEDV':'medianValue_Ks'}) renamedBostonData.head()You should get the following output:
Caption: DataFrames being renamed
-
Inspect the data types of the columns in your DataFrame using the
.info()function:renamedBostonData.info()You should get the following output:
Caption: The different data types in the dataset
-
Now, calculate basic statistics for the numeric columns in the DataFrame:
renamedBostonData.describe(include=[np.number]).TYou should get the following output:
Caption: Basic statistics of the numeric column
-
Divide the DataFrame into training and test sets, as shown in the following code snippet:
X = renamedBostonData.drop('crimeRatePerCapita', axis = 1) y = renamedBostonData[['crimeRatePerCapita']] seed = 10 test_data_size = 0.3 X_train, X_test, \ y_train, y_test = train_test_split(X, y, \ test_size = test_data_size, \ random_state = seed) train_data = pd.concat([X_train, y_train], axis = 1) test_data = pd.concat([X_test, y_test], axis = 1) -
Calculate and plot a correlation matrix for the
train_dataset:corrMatrix = train_data.corr(method = 'pearson') xnames=list(train_data.columns) ynames=list(train_data.columns) plot_corr(corrMatrix, xnames=xnames, ynames=ynames,\ title=None, normcolor=False, cmap='RdYlBu_r')You should get the following output:
Exercise 2.02: Graphical Investigation of Linear Relationships Using Python
Scatter graphs fitted with a regression line are a quick way by which a data scientist can visualize a possible correlation between a dependent variable and an independent variable.
The goal of the exercise is to use this technique to investigate any linear relationship that may exist between crime rate per capita and the median value of owner-occupied homes in towns in the city of Boston.
The following steps will help you complete the exercise:
-
Open a new Jupyter notebook file and execute the steps up to and including Step 11 from Exercise 2.01, Loading and Preparing the Data for Analysis. This is shown in the code blocks below, starting with the import statements:
%matplotlib inline import matplotlib as mpl import seaborn as sns import matplotlib.pyplot as plt import statsmodels.formula.api as smf import statsmodels.graphics.api as smg import pandas as pd import numpy as np import patsy from statsmodels.graphics.correlation import plot_corr from sklearn.model_selection import train_test_split plt.style.use('seaborn')Loading and preprocessing the data:
rawBostonData = pd.read_csv\ ('https://raw.githubusercontent.com/'\ 'fenago/data-science'\ '/master/Lab02/' 'Dataset/Boston.csv') rawBostonData = rawBostonData.dropna() rawBostonData = rawBostonData.drop_duplicates() renamedBostonData = rawBostonData.rename\ (columns = {\ 'CRIM':'crimeRatePerCapita',\ ' ZN ':'landOver25K_sqft',\ 'INDUS ':'non-retailLandProptn',\ 'CHAS':'riverDummy',\ 'NOX':'nitrixOxide_pp10m',\ 'RM':'AvgNo.RoomsPerDwelling',\ 'AGE':'ProptnOwnerOccupied',\ 'DIS':'weightedDist',\ 'RAD':'radialHighwaysAccess',\ 'TAX':'propTaxRate_per10K',\ 'PTRATIO':'pupilTeacherRatio',\ 'LSTAT':'pctLowerStatus',\ 'MEDV':'medianValue_Ks'})Setting up the test and train data:
X = renamedBostonData.drop('crimeRatePerCapita', axis = 1) y = renamedBostonData[['crimeRatePerCapita']] seed = 10 test_data_size = 0.3 X_train, X_test, y_train, y_test = train_test_split\ (X, y, \ test_size = test_data_size,\ random_state = seed) train_data = pd.concat([X_train, y_train], axis = 1) test_data = pd.concat([X_test, y_test], axis = 1) -
Now use the
subplotsfunction inmatplotlibto define a canvas (assigned the variable namefigin the following code) and a graph object (assigned the variable nameaxin the following code) in Python. You can set the size of the graph by setting thefigsize(width =10, height =6) argument of the function:fig, ax = plt.subplots(figsize=(10, 6))Do not execute the code yet.
-
Use the
seabornfunctionregplotto create the scatter plot. Do not execute this code cell yet; we will add more code to style the plot in the next step:sns.regplot(x='medianValue_Ks', y='crimeRatePerCapita', \ ci=None, data=train_data, ax=ax, color='k', \ scatter_kws={"s": 20,"color": "royalblue", \ "alpha":1})The function accepts arguments for the independent variable (
x), the dependent variable (y), the confidence interval of the regression parameters (ci), which takes values from 0 to 100, the DataFrame that hasxandy(data), a matplotlib graph object (ax), and others to control the aesthetics of the points on the graph. (In this case, the confidence interval is set toNone-- we will see more on confidence intervals later in the lab.) -
In the same cell as step 3, set the
xandylabels, thefontsizeandnamelabels, thexandylimits, and thetickparameters of the matplotlib graph object (ax). Also, set the layout of the canvas totight:ax.set_ylabel('Crime rate per Capita', fontsize=15, \ fontname='DejaVu Sans') ax.set_xlabel("Median value of owner-occupied homes "\ "in $1000's", fontsize=15, \ fontname='DejaVu Sans') ax.set_xlim(left=None, right=None) ax.set_ylim(bottom=None, top=30) ax.tick_params(axis='both', which='major', labelsize=12) fig.tight_layout()Now execute the cell. You should get the following output:
Exercise 2.03: Examining a Possible Log-Linear Relationship Using Python
In this exercise, we will use the logarithm function to transform variables and investigate whether this helps provide a better fit of the regression line to the data. We will also look at how to use confidence intervals by including a 95% confidence interval of the regression coefficients on the plot.
The following steps will help you to complete this exercise:
-
Open a new Jupyter notebook file and execute all the steps up to Step 11 from Exercise 2.01, Loading and Preparing the Data for Analysis.
-
Use the
subplotsfunction inmatplotlibto define a canvas and a graph object in Python:fig, ax = plt.subplots(figsize=(10, 6))Do not execute this code yet.
-
In the same code cell, use the logarithm function in
numpy(np.log) to transform the dependent variable (y). This essentially creates a new variable,log(y):y = np.log(train_data['crimeRatePerCapita'])Do not execute this code yet.
-
Use the seaborn
regplotfunction to create the scatter plot. Set theregplotfunction confidence interval argument (ci) to95%. This will calculate a95%confidence interval for the regression coefficients and have them plotted on the graph as a shaded area along the regression line.Note
A confidence interval gives an estimated range that is likely to contain the true value that you're looking for. So, a
95%confidence interval indicates we can be95%certain that the true regression coefficients lie in that shaded area.Parse the
yargument with the new variable we defined in the preceding step. Thexargument is the original variable from the DataFrame without any transformation. Continue in the same code cell. Do not execute this cell yet; we will add in more styling code in the next step.sns.regplot(x='medianValue_Ks', y=y, ci=95, \ data=train_data, ax=ax, color='k', \ scatter_kws={"s": 20,"color": "royalblue", \ "alpha":1}) -
Continuing in the same cell, set the
xandylabels, thefontsizeandnamelabels, thexandylimits, and thetickparameters of thematplotlibgraph object (ax). Also, set the layout of the canvas totight:ax.set_ylabel('log of Crime rate per Capita', \ fontsize=15, fontname='DejaVu Sans') ax.set_xlabel("Median value of owner-occupied homes "\ "in $1000's", fontsize=15, \ fontname='DejaVu Sans') ax.set_xlim(left=None, right=None) ax.set_ylim(bottom=None, top=None) ax.tick_params(axis='both', which='major', labelsize=12) fig.tight_layout()Now execute this cell. The output is as follows:
Exercise 2.04: Fitting a Simple Linear Regression Model Using the Statsmodels formula API
In this exercise, we examine a simple linear regression model where the crime rate per capita is the dependent variable and the median value of owner-occupied homes is the independent variable. We use the statsmodels formula API to create a linear regression model for Python to analyze.
The following steps will help you complete this exercise:
-
Open a new Jupyter notebook file and import the required packages.
import pandas as pd import statsmodels.formula.api as smf from sklearn.model_selection import train_test_split -
Execute Step 2 to 11 from Exercise 2.01, Loading and Preparing the Data for Analysis.
-
Define a linear regression model and assign it to a variable named
linearModel:linearModel = smf.ols\ (formula='crimeRatePerCapita ~ medianValue_Ks',\ data=train_data)As you can see, we call the
olsfunction of the statsmodels API and set its formula argument by defining apatsyformula string that uses the tilde (~) symbol to relate the dependent variable to the independent variable. Tell the function where to find the variables named, in the string, by assigning the data argument of theolsfunction to the DataFrame that contains your variables (train_data). -
Call the .
fitmethod of the model instance and assign the results of the method to alinearModelResultvariable, as shown in the following code snippet:linearModelResult = linearModel.fit() -
Print a summary of the results stored the
linearModelResultvariable by running the following code:print(linearModelResult.summary())You should get the following output:
Activity 2.01: Fitting a Log-Linear Model Using the Statsmodels Formula API
You have seen how to use the statsmodels API to fit a linear regression model. In this activity, you are asked to fit a log-linear model. Your model should represent the relationship between the log-transformed dependent variable (log of crime rate per capita) and the median value of owner-occupied homes.
The steps to complete this activity are as follows:
- Define a linear regression model and assign it to a variable.
Remember to use the
logfunction to transform the dependent variable in the formula string. - Call the
fitmethod of the log-linear model instance and assign the results of the method to a variable. - Print a summary of the results and analyze the output.
Your output should look like the following figure:
Exercise 2.05: Fitting a Multiple Linear Regression Model Using the Statsmodels Formula API
In this exercise, we will be using the plus operator (+) in
the patsy formula string to define a linear regression model
that includes more than one independent variable.
To complete this activity, run the code in the following steps in your Jupyter notebook:
-
Open a new Jupyter notebook file and import the required packages.
import statsmodels.formula.api as smf import pandas as pd from sklearn.model_selection import train_test_split -
Execute Step 2 to 11 from Exercise 2.01, Loading and Preparing the Data for Analysis.
-
Use the plus operator (
+) of the Patsy formula language to define a linear model that regressescrimeRatePerCapitaonpctLowerStatus,radialHighwaysAccess,medianValue_Ks, andnitrixOxide_pp10mand assign it to a variable namedmultiLinearModel. Use the Python line continuation symbol (\) to continue your code on a new line should you run out of space:multiLinearModel = smf.ols\ (formula = 'crimeRatePerCapita \ ~ pctLowerStatus \ + radialHighwaysAccess \ + medianValue_Ks \ + nitrixOxide_pp10m', \ data=train_data) -
Call the
fitmethod of the model instance and assign the results of the method to a variable:multiLinearModResult = multiLinearModel.fit() -
Print a summary of the results stored the variable created in Step 3:
print(multiLinearModResult.summary())
The output is as follows:
Activity 2.02: Fitting a Multiple Log-Linear Regression Model
A log-linear regression model you developed earlier was able to explain about 24% of the variability in the transformed crime rate per capita variable (see the values in Figure 2.17). You are now asked to develop a log-linear multiple regression model that will likely explain 80% or more of the variability in the transformed dependent variable. You should use independent variables from the Boston Housing dataset that have a correlation coefficient of 0.4 or more.
You are also encouraged to include the interaction of these variables to order two in your model. You should produce graphs and data that show that your model satisfies the assumptions of linear regression.
The steps are as follows:
-
Define a linear regression model and assign it to a variable. Remember to use the
logfunction to transform the dependent variable in the formula string, and also include more than one independent variable in your analysis. -
Call the
fitmethod of the model instance and assign the results of the method to a new variable. -
Print a summary of the results and analyze your model.
Your output should appear as shown:
Summary
This lab introduced the topic of linear regression analysis using Python. We learned that regression analysis, in general, is a supervised machine learning or data science problem. We learned about the fundamentals of linear regression analysis, including the ideas behind the method of least squares. We also learned about how to use the pandas Python module to load and prepare data for exploration and analysis.











