mirror of
https://github.com/fenago/data-science.git
synced 2026-05-12 20:44:29 +00:00
1062 lines
23 KiB
Markdown
1062 lines
23 KiB
Markdown
<img align="right" src="./logo.png">
|
||
|
||
|
||
Lab 1. Introduction to Data Science in Python
|
||
=========================================
|
||
|
||
|
||
|
||
Overview
|
||
|
||
This very first lab will introduce you to the field of data science
|
||
and walk you through an overview of Python\'s core concepts and their
|
||
application in the world of data science.
|
||
|
||
|
||
|
||
### Numeric Variables
|
||
|
||
|
||
```
|
||
var1 = 8
|
||
var2 = 160.88
|
||
var1 + var2
|
||
```
|
||
|
||
You should get the following output:
|
||
|
||

|
||
|
||
Caption: Output of the addition of two variables
|
||
|
||
|
||
|
||
### Text Variables
|
||
|
||
|
||
```
|
||
var3 = 'Hello, '
|
||
var4 = 'World'
|
||
```
|
||
|
||
In order to display the content of a variable, you can call the
|
||
`print()` function:
|
||
|
||
```
|
||
print(var3)
|
||
print(var4)
|
||
```
|
||
|
||
You should get the following output:
|
||
|
||

|
||
|
||
|
||
|
||
For instance, if we want to print `Text:` before the values of
|
||
`var3` and `var4`, we will write the following code:
|
||
|
||
```
|
||
print(f"Text: {var3} {var4}!")
|
||
```
|
||
|
||
You should get the following output:
|
||
|
||

|
||
|
||
Caption: Printing with f-strings
|
||
|
||
|
||
You can concatenate the two variables together with the `+`
|
||
operator:
|
||
|
||
```
|
||
var3 + var4
|
||
```
|
||
|
||
You should get the following output:
|
||
|
||

|
||
|
||
Caption: Concatenation of the two text variables
|
||
|
||
|
||
|
||
### Python List
|
||
|
||
|
||
```
|
||
var5 = ['I', 'love', 'data', 'science']
|
||
print(var5)
|
||
```
|
||
|
||
You should get the following output:
|
||
|
||

|
||
|
||
Caption: List containing only string items
|
||
|
||
A list can have different item types, so you can mix numerical and text
|
||
variables in it:
|
||
|
||
```
|
||
var6 = ['Fenago', 15019, 2020, 'Data Science']
|
||
print(var6)
|
||
```
|
||
|
||
|
||
An item in a list can be accessed by its index (its position in the
|
||
list). To access the first (index 0) and third elements (index 2) of a
|
||
list, you do the following:
|
||
|
||
```
|
||
print(var6[0])
|
||
print(var6[2])
|
||
```
|
||
|
||
|
||
If you want to get the first three items (index 0 to 2), you should do as follows:
|
||
|
||
```
|
||
print(var6[0:3])
|
||
```
|
||
|
||
You can also iterate through every item of a list using a
|
||
`for` loop. If you want to print every item of the
|
||
`var6` list, you should do this:
|
||
|
||
```
|
||
for item in var6:
|
||
print(item)
|
||
```
|
||
|
||
You should get the following output:
|
||
|
||
|
||
|
||
You can add an item at the end of the list using the
|
||
`.append()` method:
|
||
|
||
```
|
||
var6.append('Python')
|
||
print(var6)
|
||
```
|
||
|
||
|
||
|
||
To delete an item from the list, you use the `.remove()`
|
||
method:
|
||
|
||
```
|
||
var6.remove(15019)
|
||
print(var6)
|
||
```
|
||
|
||
|
||
### Python Dictionary
|
||
|
||
To define a dictionary in Python, you
|
||
will use curly brackets, `{}`, and specify the keys and values
|
||
separated by `:`, as shown here:
|
||
|
||
```
|
||
var7 = {'Topic': 'Data Science', 'Language': 'Python'}
|
||
print(var7)
|
||
```
|
||
|
||
|
||
You should get the following output:
|
||
|
||

|
||
|
||
Caption: Output of var7
|
||
|
||
To access a specific value, you need to provide the corresponding key
|
||
name. For instance, if you want to get the value `Python`, you
|
||
do this:
|
||
|
||
```
|
||
var7['Language']
|
||
```
|
||
|
||
You should get the following output:
|
||
|
||

|
||
|
||
|
||
Python provides a method to access all the key names from a dictionary,
|
||
`.keys()`
|
||
|
||
```
|
||
var7.keys()
|
||
```
|
||
|
||
You should get the following output:
|
||
|
||

|
||
|
||
Caption: List of key names
|
||
|
||
There is also a method called `.values()`, which is used to
|
||
access all the values of a dictionary:
|
||
|
||
```
|
||
var7.values()
|
||
```
|
||
|
||
You should get the following output:
|
||
|
||

|
||
|
||
Caption: List of values
|
||
|
||
You can iterate through all items from a dictionary using a
|
||
`for` loop and the `.items()` method, as shown in
|
||
the following code snippet:
|
||
|
||
```
|
||
for key, value in var7.items():
|
||
print(key)
|
||
print(value)
|
||
```
|
||
|
||
You should get the following output:
|
||
|
||

|
||
|
||
|
||
You can add a new element in a dictionary by providing the key name like
|
||
this:
|
||
|
||
```
|
||
var7['Publisher'] = 'Fenago'
|
||
print(var7)
|
||
```
|
||
|
||
|
||
You can delete an item from a dictionary with the `del`
|
||
command:
|
||
|
||
```
|
||
del var7['Publisher']
|
||
print(var7)
|
||
```
|
||
|
||
You should get the following output:
|
||
|
||

|
||
|
||
Caption: Output of a dictionary after removing an item
|
||
|
||
|
||
|
||
|
||
|
||
Exercise 1.01: Creating a Dictionary That Will Contain Machine Learning Algorithms
|
||
----------------------------------------------------------------------------------
|
||
|
||
In this exercise, we will create a dictionary using Python that will
|
||
contain a collection of different machine learning algorithms that will
|
||
be covered in this course.
|
||
|
||
The following steps will help you complete the exercise:
|
||
|
||
|
||
1. Open on a new Jupyter notebook.
|
||
|
||
2. Create a list called `algorithm` that will contain the
|
||
following elements: `Linear Regression`,
|
||
`Logistic Regression`, `RandomForest`, and
|
||
`a3c`:
|
||
|
||
```
|
||
algorithm = ['Linear Regression', 'Logistic Regression', \
|
||
'RandomForest', 'a3c']
|
||
```
|
||
|
||
|
||
Note
|
||
|
||
The code snippet shown above uses a backslash ( `\` ) to
|
||
split the logic across multiple lines. When the code is executed,
|
||
Python will ignore the backslash, and treat the code on the next
|
||
line as a direct continuation of the current line.
|
||
|
||
3. Now, create a list called `learning` that will contain the
|
||
following elements: `Supervised`, `Supervised`,
|
||
`Supervised`, and `Reinforcement`:
|
||
```
|
||
learning = ['Supervised', 'Supervised', 'Supervised', \
|
||
'Reinforcement']
|
||
```
|
||
|
||
|
||
4. Create a list called `algorithm_type` that will contain
|
||
the following elements: `Regression`,
|
||
`Classification`,
|
||
`Regression or Classification`, and `Game AI`:
|
||
```
|
||
algorithm_type = ['Regression', 'Classification', \
|
||
'Regression or Classification', 'Game AI']
|
||
```
|
||
|
||
|
||
5. Add an item called `k-means` into the
|
||
`algorithm` list using the `.append()` method:
|
||
```
|
||
algorithm.append('k-means')
|
||
```
|
||
|
||
|
||
6. Display the content of `algorithm` using the
|
||
`print()` function:
|
||
|
||
```
|
||
print(algorithm)
|
||
```
|
||
|
||
|
||
You should get the following output:
|
||
|
||
|
||

|
||
|
||
|
||
Caption: Output of 'algorithm'
|
||
|
||
From the preceding output, we can see that we added the
|
||
`k-means` item to the list.
|
||
|
||
7. Now, add the `Unsupervised` item into the
|
||
`learning` list using the `.append()` method:
|
||
```
|
||
learning.append('Unsupervised')
|
||
```
|
||
|
||
|
||
8. Display the content of `learning` using the
|
||
`print()` function:
|
||
|
||
```
|
||
print(learning)
|
||
```
|
||
|
||
|
||
You should get the following output:
|
||
|
||
|
||

|
||
|
||
|
||
Caption: Output of 'learning'
|
||
|
||
From the preceding output, we can see that we added the
|
||
`Unsupervised` item into the list.
|
||
|
||
9. Add the `Clustering` item into the
|
||
`algorithm_type` list using the `.append()`
|
||
method:
|
||
```
|
||
algorithm_type.append('Clustering')
|
||
```
|
||
|
||
|
||
10. Display the content of `algorithm_type` using the
|
||
`print()` function:
|
||
|
||
```
|
||
print(algorithm_type)
|
||
```
|
||
|
||
|
||
You should get the following output:
|
||
|
||
|
||

|
||
|
||
|
||
Caption: Output of \'algorithm\_type\'
|
||
|
||
From the preceding output, we can see that we added the
|
||
`Clustering` item into the list.
|
||
|
||
11. Create an empty dictionary called `machine_learning` using
|
||
curly brackets, `{}`:
|
||
```
|
||
machine_learning = {}
|
||
```
|
||
|
||
|
||
12. Create a new item in `machine_learning` with the key as
|
||
`algorithm` and the value as all the items from the
|
||
`algorithm` list:
|
||
```
|
||
machine_learning['algorithm'] = algorithm
|
||
```
|
||
|
||
|
||
13. Display the content of `machine_learning` using the
|
||
`print()` function.
|
||
|
||
```
|
||
print(machine_learning)
|
||
```
|
||
|
||
|
||
You should get the following output:
|
||
|
||
|
||

|
||
|
||
|
||
Caption: Output of machine_learning
|
||
|
||
From the preceding output, we notice that we have created a
|
||
dictionary from the `algorithm` list.
|
||
|
||
14. Create a new item in `machine_learning` with the key as
|
||
`learning` and the value as all the items from the
|
||
`learning` list:
|
||
```
|
||
machine_learning['learning'] = learning
|
||
```
|
||
|
||
|
||
15. Now, create a new item in `machine_learning` with the key
|
||
as `algorithm_type` and the value as all the items from
|
||
the algorithm\_type list:
|
||
```
|
||
machine_learning['algorithm_type'] = algorithm_type
|
||
```
|
||
|
||
|
||
16. Display the content of `machine_learning` using the
|
||
`print()` function.
|
||
|
||
```
|
||
print(machine_learning)
|
||
```
|
||
|
||
|
||
You should get the following output:
|
||
|
||
|
||

|
||
|
||
|
||
Caption: Output of machine_learning
|
||
|
||
17. Remove the `a3c` item from the `algorithm` key
|
||
using the `.remove()` method:
|
||
```
|
||
machine_learning['algorithm'].remove('a3c')
|
||
```
|
||
|
||
|
||
18. Display the content of the `algorithm` item from the
|
||
`machine_learning` dictionary using the
|
||
`print()` function:
|
||
|
||
```
|
||
print(machine_learning['algorithm'])
|
||
```
|
||
|
||
|
||
You should get the following output:
|
||
|
||
|
||

|
||
|
||
|
||
Caption: Output of 'algorithm' from machine_learning
|
||
|
||
19. Remove the `Reinforcement` item from the
|
||
`learning` key using the `.remove()` method:
|
||
```
|
||
machine_learning['learning'].remove('Reinforcement')
|
||
```
|
||
|
||
|
||
20. Remove the `Game AI` item from the
|
||
`algorithm_type` key using the `.remove()`
|
||
method:
|
||
```
|
||
machine_learning['algorithm_type'].remove('Game AI')
|
||
```
|
||
|
||
|
||
21. Display the content of `machine_learning` using the
|
||
`print()` function:
|
||
|
||
```
|
||
print(machine_learning)
|
||
```
|
||
|
||
|
||
You should get the following output:
|
||
|
||
|
||

|
||
|
||
|
||
Caption: Output of machine_learning
|
||
|
||
|
||
|
||
Python for Data Science
|
||
=======================
|
||
|
||
|
||
In this section, we will present to you two of the most popular ones:
|
||
`pandas` and `scikit-learn`.
|
||
|
||
|
||
|
||
The pandas Package
|
||
------------------
|
||
|
||
The pandas package provides an incredible amount of APIs for
|
||
manipulating data structures. The two main data structures defined in
|
||
the `pandas` package are `DataFrame` and
|
||
`Series`.
|
||
|
||
|
||
|
||
### CSV Files
|
||
|
||
The previous example of a DataFrame would look like this in a CSV file:
|
||
|
||
```
|
||
algorithm,learning,type
|
||
Linear Regression,Supervised,Regression
|
||
Logistic Regression,Supervised,Classification
|
||
RandomForest,Supervised,Regression or Classification
|
||
k-means,Unsupervised,Clustering
|
||
```
|
||
|
||
In Python, you need to first import the packages you require before
|
||
being able to use them. To do so, you will have to use the
|
||
`import` command. You can create an alias of each imported
|
||
package using the `as` keyword. It is quite common to import
|
||
the `pandas` package with the alias `pd`:
|
||
|
||
```
|
||
import pandas as pd
|
||
```
|
||
|
||
|
||
`pandas` provides a `.read_csv()` method to easily
|
||
load a CSV file directly into a DataFrame. You just need to provide the
|
||
path or the URL to the CSV file, as shown below.
|
||
|
||
|
||
```
|
||
pd.read_csv('https://raw.githubusercontent.com/fenago'\
|
||
'/data-science/master/Lab01/'\
|
||
'Dataset/csv_example.csv')
|
||
```
|
||
|
||
You should get the following output:
|
||
|
||

|
||
|
||
|
||
|
||
|
||
### Excel Spreadsheets
|
||
|
||
There is a specific method in `pandas` to load Excel spreadsheets called
|
||
`.read_excel()`:
|
||
|
||
```
|
||
pd.read_excel('https://github.com/fenago'\
|
||
'/data-science/blob/master'\
|
||
'/Lab01/Dataset/excel_example.xlsx?raw=true')
|
||
```
|
||
|
||
You should get the following output:
|
||
|
||

|
||
|
||
Caption: Dataframe after loading an Excel spreadsheet
|
||
|
||
|
||
|
||
### JSON
|
||
|
||
The example DataFrame we used before
|
||
would look like this in JSON format:
|
||
|
||
```
|
||
{
|
||
"algorithm":{
|
||
"0":"Linear Regression",
|
||
"1":"Logistic Regression",
|
||
"2":"RandomForest",
|
||
"3":"k-means"
|
||
},
|
||
"learning":{
|
||
"0":"Supervised",
|
||
"1":"Supervised",
|
||
"2":"Supervised",
|
||
"3":"Unsupervised"
|
||
},
|
||
"type":{
|
||
"0":"Regression",
|
||
"1":"Classification",
|
||
"2":"Regression or Classification",
|
||
"3":"Clustering"
|
||
}
|
||
}
|
||
```
|
||
As you may have guessed, there is a `pandas` method for
|
||
reading JSON data as well, and it is called `.read_json()`:
|
||
|
||
```
|
||
pd.read_json('https://raw.githubusercontent.com/fenago'\
|
||
'/data-science/master/Lab01'\
|
||
'/Dataset/json_example.json')
|
||
```
|
||
|
||
You should get the following output:
|
||
|
||

|
||
|
||
Caption: Dataframe after loading JSON data
|
||
|
||
|
||
|
||
Exercise 1.02: Loading Data of Different Formats into a pandas DataFrame
|
||
------------------------------------------------------------------------
|
||
|
||
In this exercise, we will practice loading different data formats, such
|
||
as CSV, TSV, and XLSX, into pandas DataFrames. The dataset we will use
|
||
is the Top 10 Postcodes for the First Home Owner Grants dataset (this is
|
||
a grant provided by the Australian government to help first-time real
|
||
estate buyers). It lists the 10 postcodes (also known as zip codes) with
|
||
the highest number of First Home Owner grants.
|
||
|
||
In this dataset, you will find the number of First Home Owner grant
|
||
applications for each postcode and the corresponding suburb.
|
||
|
||
|
||
The following steps will help you complete the exercise:
|
||
|
||
1. Open a new Jupyter notebook.
|
||
|
||
2. Import the pandas package, as shown in the following code snippet:
|
||
```
|
||
import pandas as pd
|
||
```
|
||
|
||
|
||
3. Create a new variable called `csv_url` containing the URL
|
||
to the raw CSV file:
|
||
```
|
||
csv_url = 'https://raw.githubusercontent.com/fenago'\
|
||
'/data-science/master/Lab01'\
|
||
'/Dataset/overall_topten_2012-2013.csv'
|
||
```
|
||
|
||
|
||
4. Load the CSV file into a DataFrame using the pandas
|
||
`.read_csv()` method. The first row of this CSV file
|
||
contains the name of the file, which you can see if you open the
|
||
file directly. You will need to exclude this row by using the
|
||
`skiprows=1` parameter. Save the result in a variable
|
||
called `csv_df` and print it:
|
||
|
||
```
|
||
csv_df = pd.read_csv(csv_url, skiprows=1)
|
||
csv_df
|
||
```
|
||
|
||
|
||
You should get the following output:
|
||
|
||
|
||

|
||
|
||
|
||
Caption: The DataFrame after loading the CSV file
|
||
|
||
5. Create a new variable called `tsv_url` containing the URL
|
||
to the raw TSV file:
|
||
|
||
```
|
||
tsv_url = 'https://raw.githubusercontent.com/fenago'\
|
||
'/data-science/master/Lab01'\
|
||
'/Dataset/overall_topten_2012-2013.tsv'
|
||
```
|
||
|
||
|
||
Note
|
||
|
||
A TSV file is similar to a CSV file but instead of using the comma
|
||
character (`,`) as a separator, it uses the tab character
|
||
(`\t`).
|
||
|
||
6. Load the TSV file into a DataFrame using the pandas
|
||
.`read_csv()` method and specify the
|
||
`skiprows=1` and `sep='\t'` parameters. Save the
|
||
result in a variable called `tsv_df` and print it:
|
||
|
||
```
|
||
tsv_df = pd.read_csv(tsv_url, skiprows=1, sep='\t')
|
||
tsv_df
|
||
```
|
||
|
||
|
||
You should get the following output:
|
||
|
||
|
||

|
||
|
||
|
||
Caption: The DataFrame after loading the TSV file
|
||
|
||
7. Create a new variable called `xlsx_url` containing the URL
|
||
to the raw Excel spreadsheet:
|
||
```
|
||
xlsx_url = 'https://github.com/fenago'\
|
||
'/data-science/blob/master/'\
|
||
'Lab01/Dataset'\
|
||
'/overall_topten_2012-2013.xlsx?raw=true'
|
||
```
|
||
|
||
|
||
8. Load the Excel spreadsheet into a DataFrame using the pandas
|
||
`.read_excel()` method. Save the result in a variable
|
||
called `xlsx_df` and print it:
|
||
|
||
```
|
||
xlsx_df = pd.read_excel(xlsx_url)
|
||
xlsx_df
|
||
```
|
||
|
||
|
||
You should get the following output:
|
||
|
||
|
||

|
||
|
||
|
||
|
||
By default, `.read_excel()` loads the first sheet of an
|
||
Excel spreadsheet. In this example, the data we\'re looking for is
|
||
actually stored in the second sheet.
|
||
|
||
9. Load the Excel spreadsheet into a Dataframe using the pandas
|
||
`.read_excel()` method and specify the
|
||
`skiprows=1` and `sheet_name=1` parameters.
|
||
(Note that the `sheet_name` parameter is zero-indexed, so
|
||
`sheet_name=0` returns the first sheet, while
|
||
`sheet_name=1` returns the second sheet.) Save the result
|
||
in a variable called `xlsx_df1` and print it:
|
||
|
||
```
|
||
xlsx_df1 = pd.read_excel(xlsx_url, skiprows=1, sheet_name=1)
|
||
xlsx_df1
|
||
```
|
||
|
||
|
||
You should get the following output:
|
||
|
||
|
||

|
||
|
||
|
||
|
||
### The sklearn API
|
||
|
||
|
||
`sklearn` groups algorithms by family. For instance,
|
||
`RandomForest` and `GradientBoosting` are part of
|
||
the `ensemble` module. In order to make use of an algorithm,
|
||
you will need to import it first like this:
|
||
|
||
```
|
||
from sklearn.ensemble import RandomForestClassifier
|
||
```
|
||
|
||
|
||
It is recommended to at least set the `random_state`
|
||
hyperparameter in order to get reproducible results every time that you
|
||
have to run the same code:
|
||
|
||
```
|
||
rf_model = RandomForestClassifier(random_state=1)
|
||
```
|
||
|
||
The second step is to train the model with some data. In this example,
|
||
we will use a simple dataset that classifies 178 instances of Italian
|
||
wines into 3 categories based on 13 features. This dataset is part of
|
||
the few examples that `sklearn` provides within its API. We
|
||
need to load the data first:
|
||
|
||
```
|
||
from sklearn.datasets import load_wine
|
||
features, target = load_wine(return_X_y=True)
|
||
```
|
||
|
||
Then using the `.fit()` method to train the model, you will
|
||
provide the features and the target variable as input:
|
||
|
||
```
|
||
rf_model.fit(features, target)
|
||
```
|
||
|
||
You should get the following output:
|
||
|
||

|
||
|
||
Caption: Logs of the trained Random Forest model
|
||
|
||
|
||
Once trained, we can use the `.predict()` method to predict
|
||
the target for one or more observations. Here we will use the same data
|
||
as for the training step:
|
||
|
||
```
|
||
preds = rf_model.predict(features)
|
||
preds
|
||
```
|
||
|
||
You should get the following output:
|
||
|
||

|
||
|
||
Caption: Predictions of the trained Random Forest model
|
||
|
||
|
||
|
||
Finally, we want to assess the model\'s performance by comparing its
|
||
predictions to the actual values of the target variable. There are a lot
|
||
of different metrics that can be used for assessing model performance,
|
||
and you will learn more about them later in this course. For now, though,
|
||
we will just use a metric called **accuracy**. This metric calculates
|
||
the ratio of correct predictions to the total number of observations:
|
||
|
||
```
|
||
from sklearn.metrics import accuracy_score
|
||
accuracy_score(target, preds)
|
||
```
|
||
|
||
You should get the following output
|
||
|
||

|
||
|
||
Caption: Accuracy of the trained Random Forest model
|
||
|
||
|
||
|
||
Exercise 1.03: Predicting Breast Cancer from a Dataset Using sklearn
|
||
--------------------------------------------------------------------
|
||
|
||
In this exercise, we will build a machine learning classifier using
|
||
`RandomForest` from `sklearn` to predict whether the
|
||
breast cancer of a patient is malignant (harmful) or benign (not
|
||
harmful).
|
||
|
||
|
||
The following steps will help you complete the exercise:
|
||
|
||
1. Open a new Jupyter notebook.
|
||
|
||
2. Import the `load_breast_cancer` function from
|
||
`sklearn.datasets`:
|
||
```
|
||
from sklearn.datasets import load_breast_cancer
|
||
```
|
||
|
||
|
||
3. Load the dataset from the `load_breast_cancer` function
|
||
with the `return_X_y=True` parameter to return the
|
||
features and response variable only:
|
||
```
|
||
features, target = load_breast_cancer(return_X_y=True)
|
||
```
|
||
|
||
|
||
4. Print the variable features:
|
||
|
||
```
|
||
print(features)
|
||
```
|
||
|
||
|
||
You should get the following output:
|
||
|
||
|
||

|
||
|
||
|
||
Caption: Output of the variable features
|
||
|
||
The preceding output shows the values of the features. (You can
|
||
learn more about the features from the link given previously.)
|
||
|
||
5. Print the `target` variable:
|
||
|
||
```
|
||
print(target)
|
||
```
|
||
|
||
|
||
You should get the following output:
|
||
|
||
|
||

|
||
|
||
|
||
Caption: Output of the variable target
|
||
|
||
The preceding output shows the values of the target variable. There
|
||
are two classes shown for each instance in the dataset. These
|
||
classes are `0` and `1`, representing whether
|
||
the cancer is malignant or benign.
|
||
|
||
6. Import the `RandomForestClassifier` class from
|
||
`sklearn.ensemble`:
|
||
```
|
||
from sklearn.ensemble import RandomForestClassifier
|
||
```
|
||
|
||
|
||
7. Create a new variable called `seed`, which will take the
|
||
value `888` (chosen arbitrarily):
|
||
```
|
||
seed = 888
|
||
```
|
||
|
||
|
||
8. Instantiate `RandomForestClassifier` with the
|
||
`random_state=seed` parameter and save it into a variable
|
||
called `rf_model`:
|
||
```
|
||
rf_model = RandomForestClassifier(random_state=seed)
|
||
```
|
||
|
||
|
||
9. Train the model with the `.fit()` method with
|
||
`features` and `target` as parameters:
|
||
|
||
```
|
||
rf_model.fit(features, target)
|
||
```
|
||
|
||
|
||
You should get the following output:
|
||
|
||
|
||

|
||
|
||
|
||
Caption: Logs of RandomForestClassifier
|
||
|
||
10. Make predictions with the trained model using the
|
||
`.predict()` method and `features` as a
|
||
parameter and save the results into a variable called
|
||
`preds`:
|
||
```
|
||
preds = rf_model.predict(features)
|
||
```
|
||
|
||
|
||
11. Print the `preds` variable:
|
||
|
||
```
|
||
print(preds)
|
||
```
|
||
|
||
|
||
You should get the following output:
|
||
|
||
|
||

|
||
|
||
|
||
Caption: Predictions of the Random Forest model
|
||
|
||
The preceding output shows the predictions for the training set. You
|
||
can compare this with the actual target variable values shown in
|
||
*Figure 1.48*.
|
||
|
||
12. Import the `accuracy_score` method from
|
||
`sklearn.metrics`:
|
||
```
|
||
from sklearn.metrics import accuracy_score
|
||
```
|
||
|
||
|
||
13. Calculate `accuracy_score()` with `target` and
|
||
`preds` as parameters:
|
||
|
||
```
|
||
accuracy_score(target, preds)
|
||
```
|
||
|
||
|
||
You should get the following output:
|
||
|
||
|
||

|
||
|
||
|
||
|
||
Activity 1.01: Train a Spam Detector Algorithm
|
||
----------------------------------------------
|
||
|
||
You are working for an email service provider and have been tasked with
|
||
training an algorithm that recognizes whether an email is spam or not
|
||
from a given dataset and checking its performance.
|
||
|
||
In this dataset, the authors have already created 57 different features
|
||
based on some statistics for relevant keywords in order to classify
|
||
whether an email is spam or not.
|
||
|
||
|
||
The following steps will help you to complete this activity:
|
||
|
||
1. Import the required libraries.
|
||
|
||
2. Load the dataset using `.pd.read_csv()`.
|
||
|
||
3. Extract the response variable using .`pop()` from
|
||
`pandas`. This method will extract the column provided as
|
||
a parameter from the DataFrame. You can then assign it a variable
|
||
name, for example, `target = df.pop('class')`.
|
||
|
||
4. Instantiate `RandomForestClassifier`.
|
||
|
||
5. Train a Random Forest model to predict the outcome with
|
||
.`fit()`.
|
||
|
||
6. Predict the outcomes from the input data using
|
||
`.predict()`.
|
||
|
||
7. Calculate the accuracy score using `accuracy_score`.
|
||
|
||
The output will be similar to the following:
|
||
|
||
|
||

|
||
|
||
|
||
|
||
Summary
|
||
=======
|
||
|
||
|
||
This lab provided you with an overview of what data science is in
|
||
general. We also learned the different types of machine learning
|
||
algorithms, including supervised and unsupervised, as well as regression
|
||
and classification. We had a quick introduction to Python and how to
|
||
manipulate the main data structures (lists and dictionaries) that will
|
||
be used in this course.
|
||
|
||
Then we walked you through what a DataFrame is and how to create one by
|
||
loading data from different file formats using the famous pandas
|
||
package. Finally, we learned how to use the sklearn package to train a
|
||
machine learning model and make predictions with it.
|