Files
fenago 39139f2b0e added
2021-02-09 03:17:43 +05:00

1062 lines
23 KiB
Markdown
Raw Permalink Blame History

This file contains invisible Unicode characters
This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
<img align="right" src="./logo.png">
Lab 1. Introduction to Data Science in Python
=========================================
Overview
This very first lab will introduce you to the field of data science
and walk you through an overview of Python\'s core concepts and their
application in the world of data science.
### Numeric Variables
```
var1 = 8
var2 = 160.88
var1 + var2
```
You should get the following output:
![](./images/B15019_01_03.jpg)
Caption: Output of the addition of two variables
### Text Variables
```
var3 = 'Hello, '
var4 = 'World'
```
In order to display the content of a variable, you can call the
`print()` function:
```
print(var3)
print(var4)
```
You should get the following output:
![](./images/B15019_01_04.jpg)
For instance, if we want to print `Text:` before the values of
`var3` and `var4`, we will write the following code:
```
print(f"Text: {var3} {var4}!")
```
You should get the following output:
![](./images/B15019_01_05.jpg)
Caption: Printing with f-strings
You can concatenate the two variables together with the `+`
operator:
```
var3 + var4
```
You should get the following output:
![](./images/B15019_01_06.jpg)
Caption: Concatenation of the two text variables
### Python List
```
var5 = ['I', 'love', 'data', 'science']
print(var5)
```
You should get the following output:
![](./images/B15019_01_07.jpg)
Caption: List containing only string items
A list can have different item types, so you can mix numerical and text
variables in it:
```
var6 = ['Fenago', 15019, 2020, 'Data Science']
print(var6)
```
An item in a list can be accessed by its index (its position in the
list). To access the first (index 0) and third elements (index 2) of a
list, you do the following:
```
print(var6[0])
print(var6[2])
```
If you want to get the first three items (index 0 to 2), you should do as follows:
```
print(var6[0:3])
```
You can also iterate through every item of a list using a
`for` loop. If you want to print every item of the
`var6` list, you should do this:
```
for item in var6:
print(item)
```
You should get the following output:
You can add an item at the end of the list using the
`.append()` method:
```
var6.append('Python')
print(var6)
```
To delete an item from the list, you use the `.remove()`
method:
```
var6.remove(15019)
print(var6)
```
### Python Dictionary
To define a dictionary in Python, you
will use curly brackets, `{}`, and specify the keys and values
separated by `:`, as shown here:
```
var7 = {'Topic': 'Data Science', 'Language': 'Python'}
print(var7)
```
You should get the following output:
![](./images/B15019_01_14.jpg)
Caption: Output of var7
To access a specific value, you need to provide the corresponding key
name. For instance, if you want to get the value `Python`, you
do this:
```
var7['Language']
```
You should get the following output:
![](./images/B15019_01_15.jpg)
Python provides a method to access all the key names from a dictionary,
`.keys()`
```
var7.keys()
```
You should get the following output:
![](./images/B15019_01_16.jpg)
Caption: List of key names
There is also a method called `.values()`, which is used to
access all the values of a dictionary:
```
var7.values()
```
You should get the following output:
![](./images/B15019_01_17.jpg)
Caption: List of values
You can iterate through all items from a dictionary using a
`for` loop and the `.items()` method, as shown in
the following code snippet:
```
for key, value in var7.items():
print(key)
print(value)
```
You should get the following output:
![](./images/B15019_01_18.jpg)
You can add a new element in a dictionary by providing the key name like
this:
```
var7['Publisher'] = 'Fenago'
print(var7)
```
You can delete an item from a dictionary with the `del`
command:
```
del var7['Publisher']
print(var7)
```
You should get the following output:
![](./images/B15019_01_20.jpg)
Caption: Output of a dictionary after removing an item
Exercise 1.01: Creating a Dictionary That Will Contain Machine Learning Algorithms
----------------------------------------------------------------------------------
In this exercise, we will create a dictionary using Python that will
contain a collection of different machine learning algorithms that will
be covered in this course.
The following steps will help you complete the exercise:
1. Open on a new Jupyter notebook.
2. Create a list called `algorithm` that will contain the
following elements: `Linear Regression`,
`Logistic Regression`, `RandomForest`, and
`a3c`:
```
algorithm = ['Linear Regression', 'Logistic Regression', \
'RandomForest', 'a3c']
```
Note
The code snippet shown above uses a backslash ( `\` ) to
split the logic across multiple lines. When the code is executed,
Python will ignore the backslash, and treat the code on the next
line as a direct continuation of the current line.
3. Now, create a list called `learning` that will contain the
following elements: `Supervised`, `Supervised`,
`Supervised`, and `Reinforcement`:
```
learning = ['Supervised', 'Supervised', 'Supervised', \
'Reinforcement']
```
4. Create a list called `algorithm_type` that will contain
the following elements: `Regression`,
`Classification`,
`Regression or Classification`, and `Game AI`:
```
algorithm_type = ['Regression', 'Classification', \
'Regression or Classification', 'Game AI']
```
5. Add an item called `k-means` into the
`algorithm` list using the `.append()` method:
```
algorithm.append('k-means')
```
6. Display the content of `algorithm` using the
`print()` function:
```
print(algorithm)
```
You should get the following output:
![](./images/B15019_01_21.jpg)
Caption: Output of 'algorithm'
From the preceding output, we can see that we added the
`k-means` item to the list.
7. Now, add the `Unsupervised` item into the
`learning` list using the `.append()` method:
```
learning.append('Unsupervised')
```
8. Display the content of `learning` using the
`print()` function:
```
print(learning)
```
You should get the following output:
![](./images/B15019_01_22.jpg)
Caption: Output of 'learning'
From the preceding output, we can see that we added the
`Unsupervised` item into the list.
9. Add the `Clustering` item into the
`algorithm_type` list using the `.append()`
method:
```
algorithm_type.append('Clustering')
```
10. Display the content of `algorithm_type` using the
`print()` function:
```
print(algorithm_type)
```
You should get the following output:
![](./images/B15019_01_23.jpg)
Caption: Output of \'algorithm\_type\'
From the preceding output, we can see that we added the
`Clustering` item into the list.
11. Create an empty dictionary called `machine_learning` using
curly brackets, `{}`:
```
machine_learning = {}
```
12. Create a new item in `machine_learning` with the key as
`algorithm` and the value as all the items from the
`algorithm` list:
```
machine_learning['algorithm'] = algorithm
```
13. Display the content of `machine_learning` using the
`print()` function.
```
print(machine_learning)
```
You should get the following output:
![](./images/B15019_01_24.jpg)
Caption: Output of machine_learning
From the preceding output, we notice that we have created a
dictionary from the `algorithm` list.
14. Create a new item in `machine_learning` with the key as
`learning` and the value as all the items from the
`learning` list:
```
machine_learning['learning'] = learning
```
15. Now, create a new item in `machine_learning` with the key
as `algorithm_type` and the value as all the items from
the algorithm\_type list:
```
machine_learning['algorithm_type'] = algorithm_type
```
16. Display the content of `machine_learning` using the
`print()` function.
```
print(machine_learning)
```
You should get the following output:
![](./images/B15019_01_25.jpg)
Caption: Output of machine_learning
17. Remove the `a3c` item from the `algorithm` key
using the `.remove()` method:
```
machine_learning['algorithm'].remove('a3c')
```
18. Display the content of the `algorithm` item from the
`machine_learning` dictionary using the
`print()` function:
```
print(machine_learning['algorithm'])
```
You should get the following output:
![](./images/B15019_01_26.jpg)
Caption: Output of 'algorithm' from machine_learning
19. Remove the `Reinforcement` item from the
`learning` key using the `.remove()` method:
```
machine_learning['learning'].remove('Reinforcement')
```
20. Remove the `Game AI` item from the
`algorithm_type` key using the `.remove()`
method:
```
machine_learning['algorithm_type'].remove('Game AI')
```
21. Display the content of `machine_learning` using the
`print()` function:
```
print(machine_learning)
```
You should get the following output:
![](./images/B15019_01_27.jpg)
Caption: Output of machine_learning
Python for Data Science
=======================
In this section, we will present to you two of the most popular ones:
`pandas` and `scikit-learn`.
The pandas Package
------------------
The pandas package provides an incredible amount of APIs for
manipulating data structures. The two main data structures defined in
the `pandas` package are `DataFrame` and
`Series`.
### CSV Files
The previous example of a DataFrame would look like this in a CSV file:
```
algorithm,learning,type
Linear Regression,Supervised,Regression
Logistic Regression,Supervised,Classification
RandomForest,Supervised,Regression or Classification
k-means,Unsupervised,Clustering
```
In Python, you need to first import the packages you require before
being able to use them. To do so, you will have to use the
`import` command. You can create an alias of each imported
package using the `as` keyword. It is quite common to import
the `pandas` package with the alias `pd`:
```
import pandas as pd
```
`pandas` provides a `.read_csv()` method to easily
load a CSV file directly into a DataFrame. You just need to provide the
path or the URL to the CSV file, as shown below.
```
pd.read_csv('https://raw.githubusercontent.com/fenago'\
'/data-science/master/Lab01/'\
'Dataset/csv_example.csv')
```
You should get the following output:
![](./images/B15019_01_29.jpg)
### Excel Spreadsheets
There is a specific method in `pandas` to load Excel spreadsheets called
`.read_excel()`:
```
pd.read_excel('https://github.com/fenago'\
'/data-science/blob/master'\
'/Lab01/Dataset/excel_example.xlsx?raw=true')
```
You should get the following output:
![](./images/B15019_01_31.jpg)
Caption: Dataframe after loading an Excel spreadsheet
### JSON
The example DataFrame we used before
would look like this in JSON format:
```
{
"algorithm":{
"0":"Linear Regression",
"1":"Logistic Regression",
"2":"RandomForest",
"3":"k-means"
},
"learning":{
"0":"Supervised",
"1":"Supervised",
"2":"Supervised",
"3":"Unsupervised"
},
"type":{
"0":"Regression",
"1":"Classification",
"2":"Regression or Classification",
"3":"Clustering"
}
}
```
As you may have guessed, there is a `pandas` method for
reading JSON data as well, and it is called `.read_json()`:
```
pd.read_json('https://raw.githubusercontent.com/fenago'\
'/data-science/master/Lab01'\
'/Dataset/json_example.json')
```
You should get the following output:
![](./images/B15019_01_32.jpg)
Caption: Dataframe after loading JSON data
Exercise 1.02: Loading Data of Different Formats into a pandas DataFrame
------------------------------------------------------------------------
In this exercise, we will practice loading different data formats, such
as CSV, TSV, and XLSX, into pandas DataFrames. The dataset we will use
is the Top 10 Postcodes for the First Home Owner Grants dataset (this is
a grant provided by the Australian government to help first-time real
estate buyers). It lists the 10 postcodes (also known as zip codes) with
the highest number of First Home Owner grants.
In this dataset, you will find the number of First Home Owner grant
applications for each postcode and the corresponding suburb.
The following steps will help you complete the exercise:
1. Open a new Jupyter notebook.
2. Import the pandas package, as shown in the following code snippet:
```
import pandas as pd
```
3. Create a new variable called `csv_url` containing the URL
to the raw CSV file:
```
csv_url = 'https://raw.githubusercontent.com/fenago'\
'/data-science/master/Lab01'\
'/Dataset/overall_topten_2012-2013.csv'
```
4. Load the CSV file into a DataFrame using the pandas
`.read_csv()` method. The first row of this CSV file
contains the name of the file, which you can see if you open the
file directly. You will need to exclude this row by using the
`skiprows=1` parameter. Save the result in a variable
called `csv_df` and print it:
```
csv_df = pd.read_csv(csv_url, skiprows=1)
csv_df
```
You should get the following output:
![](./images/B15019_01_33.jpg)
Caption: The DataFrame after loading the CSV file
5. Create a new variable called `tsv_url` containing the URL
to the raw TSV file:
```
tsv_url = 'https://raw.githubusercontent.com/fenago'\
'/data-science/master/Lab01'\
'/Dataset/overall_topten_2012-2013.tsv'
```
Note
A TSV file is similar to a CSV file but instead of using the comma
character (`,`) as a separator, it uses the tab character
(`\t`).
6. Load the TSV file into a DataFrame using the pandas
.`read_csv()` method and specify the
`skiprows=1` and `sep='\t'` parameters. Save the
result in a variable called `tsv_df` and print it:
```
tsv_df = pd.read_csv(tsv_url, skiprows=1, sep='\t')
tsv_df
```
You should get the following output:
![](./images/B15019_01_34.jpg)
Caption: The DataFrame after loading the TSV file
7. Create a new variable called `xlsx_url` containing the URL
to the raw Excel spreadsheet:
```
xlsx_url = 'https://github.com/fenago'\
'/data-science/blob/master/'\
'Lab01/Dataset'\
'/overall_topten_2012-2013.xlsx?raw=true'
```
8. Load the Excel spreadsheet into a DataFrame using the pandas
`.read_excel()` method. Save the result in a variable
called `xlsx_df` and print it:
```
xlsx_df = pd.read_excel(xlsx_url)
xlsx_df
```
You should get the following output:
![](./images/B15019_01_35.jpg)
By default, `.read_excel()` loads the first sheet of an
Excel spreadsheet. In this example, the data we\'re looking for is
actually stored in the second sheet.
9. Load the Excel spreadsheet into a Dataframe using the pandas
`.read_excel()` method and specify the
`skiprows=1` and `sheet_name=1` parameters.
(Note that the `sheet_name` parameter is zero-indexed, so
`sheet_name=0` returns the first sheet, while
`sheet_name=1` returns the second sheet.) Save the result
in a variable called `xlsx_df1` and print it:
```
xlsx_df1 = pd.read_excel(xlsx_url, skiprows=1, sheet_name=1)
xlsx_df1
```
You should get the following output:
![](./images/B15019_01_36.jpg)
### The sklearn API
`sklearn` groups algorithms by family. For instance,
`RandomForest` and `GradientBoosting` are part of
the `ensemble` module. In order to make use of an algorithm,
you will need to import it first like this:
```
from sklearn.ensemble import RandomForestClassifier
```
It is recommended to at least set the `random_state`
hyperparameter in order to get reproducible results every time that you
have to run the same code:
```
rf_model = RandomForestClassifier(random_state=1)
```
The second step is to train the model with some data. In this example,
we will use a simple dataset that classifies 178 instances of Italian
wines into 3 categories based on 13 features. This dataset is part of
the few examples that `sklearn` provides within its API. We
need to load the data first:
```
from sklearn.datasets import load_wine
features, target = load_wine(return_X_y=True)
```
Then using the `.fit()` method to train the model, you will
provide the features and the target variable as input:
```
rf_model.fit(features, target)
```
You should get the following output:
![](./images/B15019_01_44.jpg)
Caption: Logs of the trained Random Forest model
Once trained, we can use the `.predict()` method to predict
the target for one or more observations. Here we will use the same data
as for the training step:
```
preds = rf_model.predict(features)
preds
```
You should get the following output:
![](./images/B15019_01_45.jpg)
Caption: Predictions of the trained Random Forest model
Finally, we want to assess the model\'s performance by comparing its
predictions to the actual values of the target variable. There are a lot
of different metrics that can be used for assessing model performance,
and you will learn more about them later in this course. For now, though,
we will just use a metric called **accuracy**. This metric calculates
the ratio of correct predictions to the total number of observations:
```
from sklearn.metrics import accuracy_score
accuracy_score(target, preds)
```
You should get the following output
![](./images/B15019_01_46.jpg)
Caption: Accuracy of the trained Random Forest model
Exercise 1.03: Predicting Breast Cancer from a Dataset Using sklearn
--------------------------------------------------------------------
In this exercise, we will build a machine learning classifier using
`RandomForest` from `sklearn` to predict whether the
breast cancer of a patient is malignant (harmful) or benign (not
harmful).
The following steps will help you complete the exercise:
1. Open a new Jupyter notebook.
2. Import the `load_breast_cancer` function from
`sklearn.datasets`:
```
from sklearn.datasets import load_breast_cancer
```
3. Load the dataset from the `load_breast_cancer` function
with the `return_X_y=True` parameter to return the
features and response variable only:
```
features, target = load_breast_cancer(return_X_y=True)
```
4. Print the variable features:
```
print(features)
```
You should get the following output:
![](./images/B15019_01_47.jpg)
Caption: Output of the variable features
The preceding output shows the values of the features. (You can
learn more about the features from the link given previously.)
5. Print the `target` variable:
```
print(target)
```
You should get the following output:
![](./images/B15019_01_48.jpg)
Caption: Output of the variable target
The preceding output shows the values of the target variable. There
are two classes shown for each instance in the dataset. These
classes are `0` and `1`, representing whether
the cancer is malignant or benign.
6. Import the `RandomForestClassifier` class from
`sklearn.ensemble`:
```
from sklearn.ensemble import RandomForestClassifier
```
7. Create a new variable called `seed`, which will take the
value `888` (chosen arbitrarily):
```
seed = 888
```
8. Instantiate `RandomForestClassifier` with the
`random_state=seed` parameter and save it into a variable
called `rf_model`:
```
rf_model = RandomForestClassifier(random_state=seed)
```
9. Train the model with the `.fit()` method with
`features` and `target` as parameters:
```
rf_model.fit(features, target)
```
You should get the following output:
![](./images/B15019_01_49.jpg)
Caption: Logs of RandomForestClassifier
10. Make predictions with the trained model using the
`.predict()` method and `features` as a
parameter and save the results into a variable called
`preds`:
```
preds = rf_model.predict(features)
```
11. Print the `preds` variable:
```
print(preds)
```
You should get the following output:
![](./images/B15019_01_50.jpg)
Caption: Predictions of the Random Forest model
The preceding output shows the predictions for the training set. You
can compare this with the actual target variable values shown in
*Figure 1.48*.
12. Import the `accuracy_score` method from
`sklearn.metrics`:
```
from sklearn.metrics import accuracy_score
```
13. Calculate `accuracy_score()` with `target` and
`preds` as parameters:
```
accuracy_score(target, preds)
```
You should get the following output:
![](./images/B15019_01_51.jpg)
Activity 1.01: Train a Spam Detector Algorithm
----------------------------------------------
You are working for an email service provider and have been tasked with
training an algorithm that recognizes whether an email is spam or not
from a given dataset and checking its performance.
In this dataset, the authors have already created 57 different features
based on some statistics for relevant keywords in order to classify
whether an email is spam or not.
The following steps will help you to complete this activity:
1. Import the required libraries.
2. Load the dataset using `.pd.read_csv()`.
3. Extract the response variable using .`pop()` from
`pandas`. This method will extract the column provided as
a parameter from the DataFrame. You can then assign it a variable
name, for example, `target = df.pop('class')`.
4. Instantiate `RandomForestClassifier`.
5. Train a Random Forest model to predict the outcome with
.`fit()`.
6. Predict the outcomes from the input data using
`.predict()`.
7. Calculate the accuracy score using `accuracy_score`.
The output will be similar to the following:
![](./images/B15019_01_52.jpg)
Summary
=======
This lab provided you with an overview of what data science is in
general. We also learned the different types of machine learning
algorithms, including supervised and unsupervised, as well as regression
and classification. We had a quick introduction to Python and how to
manipulate the main data structures (lists and dictionaries) that will
be used in this course.
Then we walked you through what a DataFrame is and how to create one by
loading data from different file formats using the famous pandas
package. Finally, we learned how to use the sklearn package to train a
machine learning model and make predictions with it.