25 KiB
- Introduction to Data Science in Python =========================================
Overview
This very first lab will introduce you to the field of data science and walk you through an overview of Python's core concepts and their application in the world of data science.
Numeric Variables
Let's use an integer variable called var1 that will take
the value 8 and another one called var2 with the
value 160.88, and add them together with the +
operator, as shown here:
var1 = 8
var2 = 160.88
var1 + var2
You should get the following output:
Caption: Output of the addition of two variables
Text Variables
Another interesting type of variable is string, which
contains textual information. You can create a variable with some
specific text using the single or double quote, as shown in the
following example:
var3 = 'Hello, '
var4 = 'World'
In order to display the content of a variable, you can call the
print() function:
print(var3)
print(var4)
You should get the following output:
Caption: Printing the two text variables
For instance, if we want to print Text: before the values of
var3 and var4, we will write the following code:
print(f"Text: {var3} {var4}!")
You should get the following output:
Caption: Printing with f-strings
You can also perform some text-related transformations with string
variables, such as capitalizing or replacing characters. For instance,
you can concatenate the two variables together with the +
operator:
var3 + var4
You should get the following output:
Caption: Concatenation of the two text variables
Python List
Another very useful type of variable is the list. It is a collection of
items that can be changed (you can add, update, or remove items). To
declare a list, you will need to use square brackets, [],
like this:
var5 = ['I', 'love', 'data', 'science']
print(var5)
You should get the following output:
Caption: List containing only string items
A list can have different item types, so you can mix numerical and text variables in it:
var6 = ['Fenago', 15019, 2020, 'Data Science']
print(var6)
An item in a list can be accessed by its index (its position in the list). To access the first (index 0) and third elements (index 2) of a list, you do the following:
print(var6[0])
print(var6[2])
If you want to get the first three items (index 0 to 2), you should do as follows:
print(var6[0:3])
You can also iterate through every item of a list using a
for loop. If you want to print every item of the
var6 list, you should do this:
for item in var6:
print(item)
You should get the following output:
You can add an item at the end of the list using the
.append() method:
var6.append('Python')
print(var6)
To delete an item from the list, you use the .remove()
method:
var6.remove(15019)
print(var6)
Python Dictionary
To define a dictionary in Python, you
will use curly brackets, {}, and specify the keys and values
separated by :, as shown here:
var7 = {'Topic': 'Data Science', 'Language': 'Python'}
print(var7)
You should get the following output:
Caption: Output of var7
To access a specific value, you need to provide the corresponding key
name. For instance, if you want to get the value Python, you
do this:
var7['Language']
You should get the following output:
Note
Each key-value pair in a dictionary needs to be unique.
Python provides a method to access all the key names from a dictionary,
.keys(), which is used as shown in the following code
snippet:
var7.keys()
You should get the following output:
Caption: List of key names
There is also a method called .values(), which is used to
access all the values of a dictionary:
var7.values()
You should get the following output:
Caption: List of values
You can iterate through all items from a dictionary using a
for loop and the .items() method, as shown in
the following code snippet:
for key, value in var7.items():
print(key)
print(value)
You should get the following output:
You can add a new element in a dictionary by providing the key name like this:
var7['Publisher'] = 'Fenago'
print(var7)
You can delete an item from a dictionary with the del
command:
del var7['Publisher']
print(var7)
You should get the following output:
Caption: Output of a dictionary after removing an item
Exercise 1.01: Creating a Dictionary That Will Contain Machine Learning Algorithms
In this exercise, we will create a dictionary using Python that will contain a collection of different machine learning algorithms that will be covered in this course.
The following steps will help you complete the exercise:
-
Open on a new Jupyter notebook.
-
Create a list called
algorithmthat will contain the following elements:Linear Regression,Logistic Regression,RandomForest, anda3c:algorithm = ['Linear Regression', 'Logistic Regression', \ 'RandomForest', 'a3c']Note
The code snippet shown above uses a backslash (
\) to split the logic across multiple lines. When the code is executed, Python will ignore the backslash, and treat the code on the next line as a direct continuation of the current line. -
Now, create a list called
learningthat will contain the following elements:Supervised,Supervised,Supervised, andReinforcement:learning = ['Supervised', 'Supervised', 'Supervised', \ 'Reinforcement'] -
Create a list called
algorithm_typethat will contain the following elements:Regression,Classification,Regression or Classification, andGame AI:algorithm_type = ['Regression', 'Classification', \ 'Regression or Classification', 'Game AI'] -
Add an item called
k-meansinto thealgorithmlist using the.append()method:algorithm.append('k-means') -
Display the content of
algorithmusing theprint()function:print(algorithm)You should get the following output:
Caption: Output of \'algorithm\'
From the preceding output, we can see that we added the
`k-means` item to the list.
-
Now, add the
Unsuperviseditem into thelearninglist using the.append()method:learning.append('Unsupervised') -
Display the content of
learningusing theprint()function:print(learning)You should get the following output:
Caption: Output of \'learning\'
From the preceding output, we can see that we added the
`Unsupervised` item into the list.
-
Add the
Clusteringitem into thealgorithm_typelist using the.append()method:algorithm_type.append('Clustering') -
Display the content of
algorithm_typeusing theprint()function:print(algorithm_type)You should get the following output:
Caption: Output of \'algorithm\_type\'
From the preceding output, we can see that we added the
`Clustering` item into the list.
-
Create an empty dictionary called
machine_learningusing curly brackets,{}:machine_learning = {} -
Create a new item in
machine_learningwith the key asalgorithmand the value as all the items from thealgorithmlist:machine_learning['algorithm'] = algorithm -
Display the content of
machine_learningusing theprint()function.print(machine_learning)You should get the following output:
Caption: Output of \'machine\_learning\'
From the preceding output, we notice that we have created a
dictionary from the `algorithm` list.
-
Create a new item in
machine_learningwith the key aslearningand the value as all the items from thelearninglist:machine_learning['learning'] = learning -
Now, create a new item in
machine_learningwith the key asalgorithm_typeand the value as all the items from the algorithm_type list:machine_learning['algorithm_type'] = algorithm_type -
Display the content of
machine_learningusing theprint()function.print(machine_learning)You should get the following output:
Caption: Output of \'machine\_learning\'
-
Remove the
a3citem from thealgorithmkey using the.remove()method:machine_learning['algorithm'].remove('a3c') -
Display the content of the
algorithmitem from themachine_learningdictionary using theprint()function:print(machine_learning['algorithm'])You should get the following output:
Caption: Output of \'algorithm\' from \'machine\_learning\'
-
Remove the
Reinforcementitem from thelearningkey using the.remove()method:machine_learning['learning'].remove('Reinforcement') -
Remove the
Game AIitem from thealgorithm_typekey using the.remove()method:machine_learning['algorithm_type'].remove('Game AI') -
Display the content of
machine_learningusing theprint()function:print(machine_learning)You should get the following output:
Caption: Output of 'machine_learning'
Python for Data Science
In this section, we will present to you two of the most popular ones:
pandas and scikit-learn.
The pandas Package
The pandas package provides an incredible amount of APIs for
manipulating data structures. The two main data structures defined in
the pandas package are DataFrame and
Series.
DataFrame and Series
Caption: Components of a DataFrame
In pandas, a DataFrame is represented by the DataFrame
class. A pandas DataFrame is composed of pandas
Series, which are 1-dimensional arrays. A pandas Series is
basically a single column in a DataFrame.
CSV Files
CSV files use the comma character---,---to separate columns
and newlines for a new row. The previous example of a DataFrame would
look like this in a CSV file:
algorithm,learning,type
Linear Regression,Supervised,Regression
Logistic Regression,Supervised,Classification
RandomForest,Supervised,Regression or Classification
k-means,Unsupervised,Clustering
In Python, you need to first import the packages you require before
being able to use them. To do so, you will have to use the
import command. You can create an alias of each imported
package using the as keyword. It is quite common to import
the pandas package with the alias pd:
import pandas as pd
pandas provides a .read_csv() method to easily
load a CSV file directly into a DataFrame. You just need to provide the
path or the URL to the CSV file, as shown below.
Note
Watch out for the slashes in the string below. Remember that the
backslashes ( \ ) are used to split the code across multiple
lines, while the forward slashes ( / ) are part of the URL.
pd.read_csv('https://raw.githubusercontent.com/fenago'\
'/data-science/master/Lab01/'\
'Dataset/csv_example.csv')
You should get the following output:
Excel Spreadsheets
Excel is a Microsoft tool and is very popular in the industry. It has
its own internal structure for recording additional information, such as
the data type of each cell or even Excel formulas. There is a specific
method in pandas to load Excel spreadsheets called
.read_excel():
pd.read_excel('https://github.com/fenago'\
'/data-science/blob/master'\
'/Lab01/Dataset/excel_example.xlsx?raw=true')
You should get the following output:
Caption: Dataframe after loading an Excel spreadsheet
JSON
JSON is a very popular file format, mainly used for transferring data from web APIs. Its structure is very similar to that of a Python dictionary with key-value pairs. The example DataFrame we used before would look like this in JSON format:
{
"algorithm":{
"0":"Linear Regression",
"1":"Logistic Regression",
"2":"RandomForest",
"3":"k-means"
},
"learning":{
"0":"Supervised",
"1":"Supervised",
"2":"Supervised",
"3":"Unsupervised"
},
"type":{
"0":"Regression",
"1":"Classification",
"2":"Regression or Classification",
"3":"Clustering"
}
}
As you may have guessed, there is a pandas method for
reading JSON data as well, and it is called .read_json():
pd.read_json('https://raw.githubusercontent.com/fenago'\
'/data-science/master/Lab01'\
'/Dataset/json_example.json')
You should get the following output:
Caption: Dataframe after loading JSON data
Exercise 1.02: Loading Data of Different Formats into a pandas DataFrame
In this exercise, we will practice loading different data formats, such as CSV, TSV, and XLSX, into pandas DataFrames. The dataset we will use is the Top 10 Postcodes for the First Home Owner Grants dataset (this is a grant provided by the Australian government to help first-time real estate buyers). It lists the 10 postcodes (also known as zip codes) with the highest number of First Home Owner grants.
In this dataset, you will find the number of First Home Owner grant applications for each postcode and the corresponding suburb.
The following steps will help you complete the exercise:
-
Open a new Jupyter notebook.
-
Import the pandas package, as shown in the following code snippet:
import pandas as pd -
Create a new variable called
csv_urlcontaining the URL to the raw CSV file:csv_url = 'https://raw.githubusercontent.com/fenago'\ '/data-science/master/Lab01'\ '/Dataset/overall_topten_2012-2013.csv' -
Load the CSV file into a DataFrame using the pandas
.read_csv()method. The first row of this CSV file contains the name of the file, which you can see if you open the file directly. You will need to exclude this row by using theskiprows=1parameter. Save the result in a variable calledcsv_dfand print it:csv_df = pd.read_csv(csv_url, skiprows=1) csv_dfYou should get the following output:
Caption: The DataFrame after loading the CSV file
-
Create a new variable called
tsv_urlcontaining the URL to the raw TSV file:tsv_url = 'https://raw.githubusercontent.com/fenago'\ '/data-science/master/Lab01'\ '/Dataset/overall_topten_2012-2013.tsv'Note
A TSV file is similar to a CSV file but instead of using the comma character (
,) as a separator, it uses the tab character (\t). -
Load the TSV file into a DataFrame using the pandas .
read_csv()method and specify theskiprows=1andsep='\t'parameters. Save the result in a variable calledtsv_dfand print it:tsv_df = pd.read_csv(tsv_url, skiprows=1, sep='\t') tsv_dfYou should get the following output:
Caption: The DataFrame after loading the TSV file
-
Create a new variable called
xlsx_urlcontaining the URL to the raw Excel spreadsheet:xlsx_url = 'https://github.com/fenago'\ '/data-science/blob/master/'\ 'Lab01/Dataset'\ '/overall_topten_2012-2013.xlsx?raw=true' -
Load the Excel spreadsheet into a DataFrame using the pandas
.read_excel()method. Save the result in a variable calledxlsx_dfand print it:xlsx_df = pd.read_excel(xlsx_url) xlsx_dfYou should get the following output:
By default, `.read_excel()` loads the first sheet of an
Excel spreadsheet. In this example, the data we\'re looking for is
actually stored in the second sheet.
-
Load the Excel spreadsheet into a Dataframe using the pandas
.read_excel()method and specify theskiprows=1andsheet_name=1parameters. (Note that thesheet_nameparameter is zero-indexed, sosheet_name=0returns the first sheet, whilesheet_name=1returns the second sheet.) Save the result in a variable calledxlsx_df1and print it:xlsx_df1 = pd.read_excel(xlsx_url, skiprows=1, sheet_name=1) xlsx_df1You should get the following output:
The sklearn API
sklearn groups algorithms by family. For instance,
RandomForest and GradientBoosting are part of
the ensemble module. In order to make use of an algorithm,
you will need to import it first like this:
from sklearn.ensemble import RandomForestClassifier
It is recommended to at least set the random_state
hyperparameter in order to get reproducible results every time that you
have to run the same code:
rf_model = RandomForestClassifier(random_state=1)
The second step is to train the model with some data. In this example,
we will use a simple dataset that classifies 178 instances of Italian
wines into 3 categories based on 13 features. This dataset is part of
the few examples that sklearn provides within its API. We
need to load the data first:
from sklearn.datasets import load_wine
features, target = load_wine(return_X_y=True)
Then using the .fit() method to train the model, you will
provide the features and the target variable as input:
rf_model.fit(features, target)
You should get the following output:
Caption: Logs of the trained Random Forest model
In the preceding output, we can see a Random Forest model with the default hyperparameters. You will be introduced to some of them in Lab 4, Multiclass Classification with RandomForest.
Once trained, we can use the .predict() method to predict
the target for one or more observations. Here we will use the same data
as for the training step:
preds = rf_model.predict(features)
preds
You should get the following output:
Caption: Predictions of the trained Random Forest model
Finally, we want to assess the model's performance by comparing its predictions to the actual values of the target variable. There are a lot of different metrics that can be used for assessing model performance, and you will learn more about them later in this course. For now, though, we will just use a metric called accuracy. This metric calculates the ratio of correct predictions to the total number of observations:
from sklearn.metrics import accuracy_score
accuracy_score(target, preds)
You should get the following output
Caption: Accuracy of the trained Random Forest model
Exercise 1.03: Predicting Breast Cancer from a Dataset Using sklearn
In this exercise, we will build a machine learning classifier using
RandomForest from sklearn to predict whether the
breast cancer of a patient is malignant (harmful) or benign (not
harmful).
The following steps will help you complete the exercise:
-
Open a new Jupyter notebook.
-
Import the
load_breast_cancerfunction fromsklearn.datasets:from sklearn.datasets import load_breast_cancer -
Load the dataset from the
load_breast_cancerfunction with thereturn_X_y=Trueparameter to return the features and response variable only:features, target = load_breast_cancer(return_X_y=True) -
Print the variable features:
print(features)You should get the following output:
Caption: Output of the variable features
The preceding output shows the values of the features. (You can
learn more about the features from the link given previously.)
-
Print the
targetvariable:print(target)You should get the following output:
Caption: Output of the variable target
The preceding output shows the values of the target variable. There
are two classes shown for each instance in the dataset. These
classes are `0` and `1`, representing whether
the cancer is malignant or benign.
-
Import the
RandomForestClassifierclass fromsklearn.ensemble:from sklearn.ensemble import RandomForestClassifier -
Create a new variable called
seed, which will take the value888(chosen arbitrarily):seed = 888 -
Instantiate
RandomForestClassifierwith therandom_state=seedparameter and save it into a variable calledrf_model:rf_model = RandomForestClassifier(random_state=seed) -
Train the model with the
.fit()method withfeaturesandtargetas parameters:rf_model.fit(features, target)You should get the following output:
Caption: Logs of RandomForestClassifier
-
Make predictions with the trained model using the
.predict()method andfeaturesas a parameter and save the results into a variable calledpreds:preds = rf_model.predict(features) -
Print the
predsvariable:print(preds)You should get the following output:
Caption: Predictions of the Random Forest model
The preceding output shows the predictions for the training set. You
can compare this with the actual target variable values shown in
*Figure 1.48*.
-
Import the
accuracy_scoremethod fromsklearn.metrics:from sklearn.metrics import accuracy_score -
Calculate
accuracy_score()withtargetandpredsas parameters:accuracy_score(target, preds)You should get the following output:
Activity 1.01: Train a Spam Detector Algorithm
You are working for an email service provider and have been tasked with training an algorithm that recognizes whether an email is spam or not from a given dataset and checking its performance.
In this dataset, the authors have already created 57 different features based on some statistics for relevant keywords in order to classify whether an email is spam or not.
The following steps will help you to complete this activity:
-
Import the required libraries.
-
Load the dataset using
.pd.read_csv(). -
Extract the response variable using .
pop()frompandas. This method will extract the column provided as a parameter from the DataFrame. You can then assign it a variable name, for example,target = df.pop('class'). -
Instantiate
RandomForestClassifier. -
Train a Random Forest model to predict the outcome with .
fit(). -
Predict the outcomes from the input data using
.predict(). -
Calculate the accuracy score using
accuracy_score.The output will be similar to the following:
Summary
This lab provided you with an overview of what data science is in general. We also learned the different types of machine learning algorithms, including supervised and unsupervised, as well as regression and classification. We had a quick introduction to Python and how to manipulate the main data structures (lists and dictionaries) that will be used in this course.
Then we walked you through what a DataFrame is and how to create one by loading data from different file formats using the famous pandas package. Finally, we learned how to use the sklearn package to train a machine learning model and make predictions with it.


































