<img align="right" src="./logo.png">


Lab 1. Introduction to Data Science in Python
=========================================


Overview

This very first lab will introduce you to the field of data science
and walk you through an overview of Python\'s core concepts and their
application in the world of data science.


### Numeric Variables


```
var1 = 8
var2 = 160.88
var1 + var2
```

You should get the following output:

![](./images/B15019_01_03.jpg)

Caption: Output of the addition of two variables


### Text Variables


```
var3 = 'Hello, '
var4 = 'World'
```

In order to display the content of a variable, you can call the
`print()` function:

```
print(var3)
print(var4)
```

You should get the following output:

![](./images/B15019_01_04.jpg)


For instance, if we want to print `Text:` before the values of
`var3` and `var4`, we will write the following code:

```
print(f"Text: {var3} {var4}!")
```

You should get the following output:

![](./images/B15019_01_05.jpg)

Caption: Printing with f-strings


You can concatenate the two variables together with the `+`
operator:

```
var3 + var4
```

You should get the following output:

![](./images/B15019_01_06.jpg)

Caption: Concatenation of the two text variables


### Python List


```
var5 = ['I', 'love', 'data', 'science']
print(var5)
```

You should get the following output:

![](./images/B15019_01_07.jpg)

Caption: List containing only string items

A list can have different item types, so you can mix numerical and text
variables in it:

```
var6 = ['Fenago', 15019, 2020, 'Data Science']
print(var6)
```


An item in a list can be accessed by its index (its position in the
list). To access the first (index 0) and third elements (index 2) of a
list, you do the following:

```
print(var6[0])
print(var6[2])
```


If you want to get the first three items (index 0 to 2), you should do as follows:

```
print(var6[0:3])
```

You can also iterate through every item of a list using a
`for` loop. If you want to print every item of the
`var6` list, you should do this:

```
for item in var6:
    print(item)
```

You should get the following output:


You can add an item at the end of the list using the
`.append()` method:

```
var6.append('Python')
print(var6)
```


To delete an item from the list, you use the `.remove()`
method:

```
var6.remove(15019)
print(var6)
```


### Python Dictionary

To define a dictionary in Python, you
will use curly brackets, `{}`, and specify the keys and values
separated by `:`, as shown here:

```
var7 = {'Topic': 'Data Science', 'Language': 'Python'}
print(var7)
```


You should get the following output:

![](./images/B15019_01_14.jpg)

Caption: Output of var7

To access a specific value, you need to provide the corresponding key
name. For instance, if you want to get the value `Python`, you
do this:

```
var7['Language']
```

You should get the following output:

![](./images/B15019_01_15.jpg)


Python provides a method to access all the key names from a dictionary,
`.keys()`

```
var7.keys()
```

You should get the following output:

![](./images/B15019_01_16.jpg)

Caption: List of key names

There is also a method called `.values()`, which is used to
access all the values of a dictionary:

```
var7.values()
```

You should get the following output:

![](./images/B15019_01_17.jpg)

Caption: List of values

You can iterate through all items from a dictionary using a
`for` loop and the `.items()` method, as shown in
the following code snippet:

```
for key, value in var7.items():
    print(key)
    print(value)
```

You should get the following output:

![](./images/B15019_01_18.jpg)


You can add a new element in a dictionary by providing the key name like
this:

```
var7['Publisher'] = 'Fenago'
print(var7)
```


You can delete an item from a dictionary with the `del`
command:

```
del var7['Publisher']
print(var7)
```

You should get the following output:

![](./images/B15019_01_20.jpg)

Caption: Output of a dictionary after removing an item


Exercise 1.01: Creating a Dictionary That Will Contain Machine Learning Algorithms
----------------------------------------------------------------------------------

In this exercise, we will create a dictionary using Python that will
contain a collection of different machine learning algorithms that will
be covered in this course.

The following steps will help you complete the exercise:


1.  Open on a new Jupyter notebook.

2.  Create a list called `algorithm` that will contain the
    following elements: `Linear Regression`,
    `Logistic Regression`, `RandomForest`, and
    `a3c`:

    ```
    algorithm = ['Linear Regression', 'Logistic Regression', \
                 'RandomForest', 'a3c']
    ```


    Note

    The code snippet shown above uses a backslash ( `\` ) to
    split the logic across multiple lines. When the code is executed,
    Python will ignore the backslash, and treat the code on the next
    line as a direct continuation of the current line.

3.  Now, create a list called `learning` that will contain the
    following elements: `Supervised`, `Supervised`,
    `Supervised`, and `Reinforcement`:
    ```
    learning = ['Supervised', 'Supervised', 'Supervised', \
                'Reinforcement']
    ```


4.  Create a list called `algorithm_type` that will contain
    the following elements: `Regression`,
    `Classification`,
    `Regression or Classification`, and `Game AI`:
    ```
    algorithm_type = ['Regression', 'Classification', \
                      'Regression or Classification', 'Game AI']
    ```


5.  Add an item called `k-means` into the
    `algorithm` list using the `.append()` method:
    ```
    algorithm.append('k-means')
    ```


6.  Display the content of `algorithm` using the
    `print()` function:

    ```
    print(algorithm)
    ```


    You should get the following output:

    
![](./images/B15019_01_21.jpg)


    Caption: Output of 'algorithm'

    From the preceding output, we can see that we added the
    `k-means` item to the list.

7.  Now, add the `Unsupervised` item into the
    `learning` list using the `.append()` method:
    ```
    learning.append('Unsupervised')
    ```


8.  Display the content of `learning` using the
    `print()` function:

    ```
    print(learning)
    ```


    You should get the following output:

    
![](./images/B15019_01_22.jpg)


    Caption: Output of 'learning'

    From the preceding output, we can see that we added the
    `Unsupervised` item into the list.

9.  Add the `Clustering` item into the
    `algorithm_type` list using the `.append()`
    method:
    ```
    algorithm_type.append('Clustering')
    ```


10. Display the content of `algorithm_type` using the
    `print()` function:

    ```
    print(algorithm_type)
    ```


    You should get the following output:

    
![](./images/B15019_01_23.jpg)


    Caption: Output of \'algorithm\_type\'

    From the preceding output, we can see that we added the
    `Clustering` item into the list.

11. Create an empty dictionary called `machine_learning` using
    curly brackets, `{}`:
    ```
    machine_learning = {}
    ```


12. Create a new item in `machine_learning` with the key as
    `algorithm` and the value as all the items from the
    `algorithm` list:
    ```
    machine_learning['algorithm'] = algorithm
    ```


13. Display the content of `machine_learning` using the
    `print()` function.

    ```
    print(machine_learning)
    ```


    You should get the following output:

    
![](./images/B15019_01_24.jpg)


    Caption: Output of machine_learning

    From the preceding output, we notice that we have created a
    dictionary from the `algorithm` list.

14. Create a new item in `machine_learning` with the key as
    `learning` and the value as all the items from the
    `learning` list:
    ```
    machine_learning['learning'] = learning
    ```


15. Now, create a new item in `machine_learning` with the key
    as `algorithm_type` and the value as all the items from
    the algorithm\_type list:
    ```
    machine_learning['algorithm_type'] = algorithm_type
    ```


16. Display the content of `machine_learning` using the
    `print()` function.

    ```
    print(machine_learning)
    ```


    You should get the following output:

    
![](./images/B15019_01_25.jpg)


    Caption: Output of machine_learning

17. Remove the `a3c` item from the `algorithm` key
    using the `.remove()` method:
    ```
    machine_learning['algorithm'].remove('a3c')
    ```


18. Display the content of the `algorithm` item from the
    `machine_learning` dictionary using the
    `print()` function:

    ```
    print(machine_learning['algorithm'])
    ```


    You should get the following output:

    
![](./images/B15019_01_26.jpg)


    Caption: Output of 'algorithm' from machine_learning

19. Remove the `Reinforcement` item from the
    `learning` key using the `.remove()` method:
    ```
    machine_learning['learning'].remove('Reinforcement')
    ```


20. Remove the `Game AI` item from the
    `algorithm_type` key using the `.remove()`
    method:
    ```
    machine_learning['algorithm_type'].remove('Game AI')
    ```


21. Display the content of `machine_learning` using the
    `print()` function:

    ```
    print(machine_learning)
    ```


    You should get the following output:

    
![](./images/B15019_01_27.jpg)


Caption: Output of machine_learning


Python for Data Science
=======================


In this section, we will present to you two of the most popular ones:
`pandas` and `scikit-learn`.


The pandas Package
------------------

The pandas package provides an incredible amount of APIs for
manipulating data structures. The two main data structures defined in
the `pandas` package are `DataFrame` and
`Series`.


### CSV Files

The previous example of a DataFrame would look like this in a CSV file:

```
algorithm,learning,type
Linear Regression,Supervised,Regression
Logistic Regression,Supervised,Classification
RandomForest,Supervised,Regression or Classification
k-means,Unsupervised,Clustering
```

In Python, you need to first import the packages you require before
being able to use them. To do so, you will have to use the
`import` command. You can create an alias of each imported
package using the `as` keyword. It is quite common to import
the `pandas` package with the alias `pd`:

```
import pandas as pd
```


`pandas` provides a `.read_csv()` method to easily
load a CSV file directly into a DataFrame. You just need to provide the
path or the URL to the CSV file, as shown below.


```
pd.read_csv('https://raw.githubusercontent.com/fenago'\
            '/data-science/master/Lab01/'\
            'Dataset/csv_example.csv')
```

You should get the following output:

![](./images/B15019_01_29.jpg)


### Excel Spreadsheets

There is a specific method in `pandas` to load Excel spreadsheets called
`.read_excel()`:

```
pd.read_excel('https://github.com/fenago'\
              '/data-science/blob/master'\
              '/Lab01/Dataset/excel_example.xlsx?raw=true')
```

You should get the following output:

![](./images/B15019_01_31.jpg)

Caption: Dataframe after loading an Excel spreadsheet


### JSON

The example DataFrame we used before
would look like this in JSON format:

```
{
  "algorithm":{
     "0":"Linear Regression",
     "1":"Logistic Regression",
     "2":"RandomForest",
     "3":"k-means"
  },
  "learning":{
     "0":"Supervised",
     "1":"Supervised",
     "2":"Supervised",
     "3":"Unsupervised"
  },
  "type":{
     "0":"Regression",
     "1":"Classification",
     "2":"Regression or Classification",
     "3":"Clustering"
  }
}
```
As you may have guessed, there is a `pandas` method for
reading JSON data as well, and it is called `.read_json()`:

```
pd.read_json('https://raw.githubusercontent.com/fenago'\
             '/data-science/master/Lab01'\
             '/Dataset/json_example.json')
```

You should get the following output:

![](./images/B15019_01_32.jpg)

Caption: Dataframe after loading JSON data


Exercise 1.02: Loading Data of Different Formats into a pandas DataFrame
------------------------------------------------------------------------

In this exercise, we will practice loading different data formats, such
as CSV, TSV, and XLSX, into pandas DataFrames. The dataset we will use
is the Top 10 Postcodes for the First Home Owner Grants dataset (this is
a grant provided by the Australian government to help first-time real
estate buyers). It lists the 10 postcodes (also known as zip codes) with
the highest number of First Home Owner grants.

In this dataset, you will find the number of First Home Owner grant
applications for each postcode and the corresponding suburb.


The following steps will help you complete the exercise:

1.  Open a new Jupyter notebook.

2.  Import the pandas package, as shown in the following code snippet:
    ```
    import pandas as pd
    ```


3.  Create a new variable called `csv_url` containing the URL
    to the raw CSV file:
    ```
    csv_url = 'https://raw.githubusercontent.com/fenago'\
              '/data-science/master/Lab01'\
              '/Dataset/overall_topten_2012-2013.csv'
    ```


4.  Load the CSV file into a DataFrame using the pandas
    `.read_csv()` method. The first row of this CSV file
    contains the name of the file, which you can see if you open the
    file directly. You will need to exclude this row by using the
    `skiprows=1` parameter. Save the result in a variable
    called `csv_df` and print it:

    ```
    csv_df = pd.read_csv(csv_url, skiprows=1)
    csv_df
    ```


    You should get the following output:

    
![](./images/B15019_01_33.jpg)


    Caption: The DataFrame after loading the CSV file

5.  Create a new variable called `tsv_url` containing the URL
    to the raw TSV file:

    ```
    tsv_url = 'https://raw.githubusercontent.com/fenago'\
              '/data-science/master/Lab01'\
              '/Dataset/overall_topten_2012-2013.tsv'
    ```


    Note

    A TSV file is similar to a CSV file but instead of using the comma
    character (`,`) as a separator, it uses the tab character
    (`\t`).

6.  Load the TSV file into a DataFrame using the pandas
    .`read_csv()` method and specify the
    `skiprows=1` and `sep='\t'` parameters. Save the
    result in a variable called `tsv_df` and print it:

    ```
    tsv_df = pd.read_csv(tsv_url, skiprows=1, sep='\t')
    tsv_df
    ```


    You should get the following output:

    
![](./images/B15019_01_34.jpg)


    Caption: The DataFrame after loading the TSV file

7.  Create a new variable called `xlsx_url` containing the URL
    to the raw Excel spreadsheet:
    ```
    xlsx_url = 'https://github.com/fenago'\
               '/data-science/blob/master/'\
               'Lab01/Dataset'\
               '/overall_topten_2012-2013.xlsx?raw=true'
    ```


8.  Load the Excel spreadsheet into a DataFrame using the pandas
    `.read_excel()` method. Save the result in a variable
    called `xlsx_df` and print it:

    ```
    xlsx_df = pd.read_excel(xlsx_url)
    xlsx_df
    ```


    You should get the following output:

    
![](./images/B15019_01_35.jpg)


    By default, `.read_excel()` loads the first sheet of an
    Excel spreadsheet. In this example, the data we\'re looking for is
    actually stored in the second sheet.

9.  Load the Excel spreadsheet into a Dataframe using the pandas
    `.read_excel()` method and specify the
    `skiprows=1` and `sheet_name=1` parameters.
    (Note that the `sheet_name` parameter is zero-indexed, so
    `sheet_name=0` returns the first sheet, while
    `sheet_name=1` returns the second sheet.) Save the result
    in a variable called `xlsx_df1` and print it:

    ```
    xlsx_df1 = pd.read_excel(xlsx_url, skiprows=1, sheet_name=1)
    xlsx_df1
    ```


    You should get the following output:

    
![](./images/B15019_01_36.jpg)


### The sklearn API


`sklearn` groups algorithms by family. For instance,
`RandomForest` and `GradientBoosting` are part of
the `ensemble` module. In order to make use of an algorithm,
you will need to import it first like this:

```
from sklearn.ensemble import RandomForestClassifier
```


It is recommended to at least set the `random_state`
hyperparameter in order to get reproducible results every time that you
have to run the same code:

```
rf_model = RandomForestClassifier(random_state=1)
```

The second step is to train the model with some data. In this example,
we will use a simple dataset that classifies 178 instances of Italian
wines into 3 categories based on 13 features. This dataset is part of
the few examples that `sklearn` provides within its API. We
need to load the data first:

```
from sklearn.datasets import load_wine
features, target = load_wine(return_X_y=True)
```

Then using the `.fit()` method to train the model, you will
provide the features and the target variable as input:

```
rf_model.fit(features, target)
```

You should get the following output:

![](./images/B15019_01_44.jpg)

Caption: Logs of the trained Random Forest model


Once trained, we can use the `.predict()` method to predict
the target for one or more observations. Here we will use the same data
as for the training step:

```
preds = rf_model.predict(features)
preds
```

You should get the following output:

![](./images/B15019_01_45.jpg)

Caption: Predictions of the trained Random Forest model


Finally, we want to assess the model\'s performance by comparing its
predictions to the actual values of the target variable. There are a lot
of different metrics that can be used for assessing model performance,
and you will learn more about them later in this course. For now, though,
we will just use a metric called **accuracy**. This metric calculates
the ratio of correct predictions to the total number of observations:

```
from sklearn.metrics import accuracy_score
accuracy_score(target, preds)
```

You should get the following output

![](./images/B15019_01_46.jpg)

Caption: Accuracy of the trained Random Forest model


Exercise 1.03: Predicting Breast Cancer from a Dataset Using sklearn
--------------------------------------------------------------------

In this exercise, we will build a machine learning classifier using
`RandomForest` from `sklearn` to predict whether the
breast cancer of a patient is malignant (harmful) or benign (not
harmful).


The following steps will help you complete the exercise:

1.  Open a new Jupyter notebook.

2.  Import the `load_breast_cancer` function from
    `sklearn.datasets`:
    ```
    from sklearn.datasets import load_breast_cancer
    ```


3.  Load the dataset from the `load_breast_cancer` function
    with the `return_X_y=True` parameter to return the
    features and response variable only:
    ```
    features, target = load_breast_cancer(return_X_y=True)
    ```


4.  Print the variable features:

    ```
    print(features)
    ```


    You should get the following output:

    
![](./images/B15019_01_47.jpg)


    Caption: Output of the variable features

    The preceding output shows the values of the features. (You can
    learn more about the features from the link given previously.)

5.  Print the `target` variable:

    ```
    print(target)
    ```


    You should get the following output:

    
![](./images/B15019_01_48.jpg)


    Caption: Output of the variable target

    The preceding output shows the values of the target variable. There
    are two classes shown for each instance in the dataset. These
    classes are `0` and `1`, representing whether
    the cancer is malignant or benign.

6.  Import the `RandomForestClassifier` class from
    `sklearn.ensemble`:
    ```
    from sklearn.ensemble import RandomForestClassifier
    ```


7.  Create a new variable called `seed`, which will take the
    value `888` (chosen arbitrarily):
    ```
    seed = 888
    ```


8.  Instantiate `RandomForestClassifier` with the
    `random_state=seed` parameter and save it into a variable
    called `rf_model`:
    ```
    rf_model = RandomForestClassifier(random_state=seed)
    ```


9.  Train the model with the `.fit()` method with
    `features` and `target` as parameters:

    ```
    rf_model.fit(features, target)
    ```


    You should get the following output:

    
![](./images/B15019_01_49.jpg)


    Caption: Logs of RandomForestClassifier

10. Make predictions with the trained model using the
    `.predict()` method and `features` as a
    parameter and save the results into a variable called
    `preds`:
    ```
    preds = rf_model.predict(features)
    ```


11. Print the `preds` variable:

    ```
    print(preds)
    ```


    You should get the following output:

    
![](./images/B15019_01_50.jpg)


    Caption: Predictions of the Random Forest model

    The preceding output shows the predictions for the training set. You
    can compare this with the actual target variable values shown in
    *Figure 1.48*.

12. Import the `accuracy_score` method from
    `sklearn.metrics`:
    ```
    from sklearn.metrics import accuracy_score
    ```


13. Calculate `accuracy_score()` with `target` and
    `preds` as parameters:

    ```
    accuracy_score(target, preds)
    ```


    You should get the following output:

    
![](./images/B15019_01_51.jpg)


Activity 1.01: Train a Spam Detector Algorithm
----------------------------------------------

You are working for an email service provider and have been tasked with
training an algorithm that recognizes whether an email is spam or not
from a given dataset and checking its performance.

In this dataset, the authors have already created 57 different features
based on some statistics for relevant keywords in order to classify
whether an email is spam or not.


The following steps will help you to complete this activity:

1.  Import the required libraries.

2.  Load the dataset using `.pd.read_csv()`.

3.  Extract the response variable using .`pop()` from
    `pandas`. This method will extract the column provided as
    a parameter from the DataFrame. You can then assign it a variable
    name, for example, `target = df.pop('class')`.

4.  Instantiate `RandomForestClassifier`.

5.  Train a Random Forest model to predict the outcome with
    .`fit()`.

6.  Predict the outcomes from the input data using
    `.predict()`.

7.  Calculate the accuracy score using `accuracy_score`.

    The output will be similar to the following:

    
![](./images/B15019_01_52.jpg)


Summary
=======


This lab provided you with an overview of what data science is in
general. We also learned the different types of machine learning
algorithms, including supervised and unsupervised, as well as regression
and classification. We had a quick introduction to Python and how to
manipulate the main data structures (lists and dictionaries) that will
be used in this course.

Then we walked you through what a DataFrame is and how to create one by
loading data from different file formats using the famous pandas
package. Finally, we learned how to use the sklearn package to train a
machine learning model and make predictions with it.