mlessentials/lab_guides/Lab_5.md


<img align="right" src="./logo.png">


Lab 5. Performing Your First Cluster Analysis
=========================================


Overview

This lab will introduce you to unsupervised learning tasks, where
algorithms have to automatically learn patterns from data by themselves
as no target variables are defined beforehand. We will focus
specifically on the k-means algorithm, and see how to standardize and
process data for use in cluster analysis.


Exercise 5.01: Performing Your First Clustering Analysis on the ATO Dataset
---------------------------------------------------------------------------

In this exercise, we will be using k-means clustering on the ATO dataset
and observing the different clusters that the dataset divides itself
into, after which we will conclude by analyzing the output:

1.  Open a new Jupyter notebook.

2.  Next, load the required Python packages: `pandas` and
    `KMeans` from `sklearn.cluster`.

    We will be using the `import` function from Python:

    Note

    You can create short aliases for the packages you will be calling
    quite often in your script with the function mentioned in the
    following code snippet.

    ```
    import pandas as pd
    from sklearn.cluster import KMeans
    ```


    Note

    We will be looking into `KMeans` (from
    `sklearn.cluster`), which you have used in the code here,
    later in the lab for a more detailed explanation of it.

3.  Next, create a variable containing the link to the file. We will
    call this variable `file_url`:

    ```
    file_url = 'https://raw.githubusercontent.com'\
               '/fenago/data-science'\
               '/master/Lab05/DataSet/taxstats2015.csv'
    ```


4.  Use the `usecols` parameter to subset only the columns we
    need rather than loading the entire dataset. We just need to provide
    a list of the column names we are interested in, which are mentioned
    in the following code snippet:

    ```
    df = pd.read_csv(file_url, \
                     usecols=['Postcode', \
                              'Average net tax', \
                              'Average total deductions'])
    ```


    Now we have loaded the data into a `pandas` DataFrame.

5.  Next, let\'s display the first 5 rows of this DataFrame , using the
    method `.head()`:

    ```
    df.head()
    ```


    You should get the following output:


![](./images/B15019_05_04.jpg)


    Caption: The first five rows of the ATO DataFrame

6.  Now, to output the last 5 rows, we use `.tail()`:

    ```
    df.tail()
    ```


    You should get the following output:


![](./images/B15019_05_05.jpg)


7.  Instantiate k-means with a random state of `42` and save
    it into a variable called `kmeans`:
    ```
    kmeans = KMeans(random_state=42)
    ```


8.  Now feed k-means with our training data. To do so, we need to get
    only the variables (or columns) used for fitting the model. In our
    case, the variables are `'Average net tax'` and
    `'Average total deductions'`, and they are saved in a new
    variable called `X`:
    ```
    X = df[['Average net tax', 'Average total deductions']]
    ```


9.  Now fit `kmeans` with this training data:

    ```
    kmeans.fit(X)
    ```


    You should get the following output:


![](./images/B15019_05_06.jpg)


10. See which cluster each data point belongs to by using the
    `.predict()` method:

    ```
    y_preds = kmeans.predict(X)
    y_preds
    ```


    You should get the following output:


![](./images/B15019_05_07.jpg)


    `import sklearn`

    `sklearn.__version__`

11. Now, add these predictions into the original DataFrame and take a
    look at the first five postcodes:

    ```
    df['cluster'] = y_preds
    df.head()
    ```


    Note

    The predictions from the sklearn `predict()` method are in
    the exact same order as the input data. So, the first prediction
    will correspond to the first row of your DataFrame.

    You should get the following output:


![](./images/B15019_05_08.jpg)


Caption: Cluster number assigned to the first five postcodes


Interpreting k-means Results
============================

To create a pivot table similar to an Excel one, we will be using the
`pivot_table()` method from `pandas`. Run the code below in the same notebook as you used for the previous exercise.

```
import numpy as np
df.pivot_table(values=['Average net tax', \
                       'Average total deductions'], \
               index='cluster', aggfunc=np.mean)
```

Note

We will be using the `numpy` implementation of
`mean()` as it is more optimized for pandas DataFrames.

![](./images/B15019_05_09.jpg)

Caption: Output of the pivot\_table function


You may have heard of different visualization packages, such as
`matplotlib`, `seaborn`, and `bokeh`, but
in this lab, we will be using the `altair` package because
it is quite simple to use (its API is very similar to
`sklearn`). Let\'s import it first:

```
import altair as alt
```

Then, we will instantiate a `Chart()` object with our
DataFrame and save it into a variable called `chart`:

```
chart = alt.Chart(df)
```
Now we will specify the type of graph we want, a scatter plot, with the
`.mark_circle()` method and will save it into a new variable
called `scatter_plot`:

```
scatter_plot = chart.mark_circle()
```
Finally, we need to configure our scatter plot by specifying the names
of the columns that will be our `x`- and `y`-axes on
the graph. We also tell the scatter plot to color each point according
to its cluster value with the `color` option:

```
scatter_plot.encode(x='Average net tax', \
                    y='Average total deductions', \
                    color='cluster:N')
```


You should get the following output:

![](./images/B15019_05_10.jpg)

Caption: Scatter plot of the clusters


Let\'s say we want to add a tooltip that will display the values for the
two columns of interest: the postcode and the assigned cluster. With
`altair`, we just need to add a parameter called
`tooltip` in the `encode()` method with a list of
corresponding column names and call the `interactive()` method
just after, as seen in the following code snippet:

```
scatter_plot.encode(x='Average net tax', \
                    y='Average total deductions', \
                    color='cluster:N', \
                    tooltip=['Postcode', \
                             'cluster', 'Average net tax', \
                             'Average total deductions'])\
                    .interactive()
```

You should get the following output:

![](./images/B15019_05_11.jpg)

Caption: Interactive scatter plot of the clusters with tooltip


Exercise 5.02: Clustering Australian Postcodes by Business Income and Expenses
------------------------------------------------------------------------------

In this exercise, we will learn how to perform clustering analysis with
k-means and visualize its results based on postcode values sorted by
business income and expenses. The following steps will help you complete
this exercise:

1.  Open a new Jupyter notebook for this exercise.

2.  Now `import` the required packages (`pandas`,
    `sklearn`, `altair`, and `numpy`):
    ```
    import pandas as pd
    from sklearn.cluster import KMeans
    import altair as alt
    import numpy as np
    ```


3.  Assign the link to the ATO dataset to a variable called
    `file_url`:
    ```
    file_url = 'https://raw.githubusercontent.com'\
               '/fenago/data-science'\
               '/master/Lab05/DataSet/taxstats2015.csv'
    ```


4.  Using the `read_csv` method from the pandas package, load
    the dataset with only the following columns with the
    `use_cols` parameter: `'Postcode'`,
    `'Average total business income'`, and
    `'Average total business expenses'`:
    ```
    df = pd.read_csv(file_url, \
                     usecols=['Postcode', \
                              'Average total business income', \
                              'Average total business expenses'])
    ```


5.  Display the last 10 rows from the ATO dataset using the
    `.tail()` method from pandas:

    ```
    df.tail(10)
    ```


    You should get the following output:


![](./images/B15019_05_12.jpg)


    Caption: The last 10 rows of the ATO dataset

6.  Extract the `'Average total business income'` and
    `'Average total business expenses'` columns using the
    following pandas column subsetting syntax:
    `dataframe_name[<list_of_columns>]`. Then, save them into
    a new variable called `X`:
    ```
    X = df[['Average total business income', \
            'Average total business expenses']]
    ```


7.  Now fit `kmeans` with this new variable using a value of
    `8` for the `random_state` hyperparameter:

    ```
    kmeans = KMeans(random_state=8)
    kmeans.fit(X)
    ```


    You should get the following output:


![](./images/B15019_05_13.jpg)


    Caption: Summary of the fitted kmeans and its hyperparameters

8.  Using the `predict` method from the `sklearn`
    package, predict the clustering assignment from the input variable,
    `(X)`, save the results into a new variable called
    `y_preds`, and display the last `10`
    predictions:

    ```
    y_preds = kmeans.predict(X)
    y_preds[-10:]
    ```


    You should get the following output:


![](./images/B15019_05_14.jpg)


9.  Save the predicted clusters back to the DataFrame by creating a new
    column called `'cluster'` and print the last
    `10` rows of the DataFrame using the `.tail()`
    method from the `pandas` package:

    ```
    df['cluster'] = y_preds
    df.tail(10)
    ```


    You should get the following output:


![](./images/B15019_05_15.jpg)


10. Generate a pivot table with the averages of the two columns for each
    cluster value using the `pivot_table` method from the
    `pandas` package with the following parameters:

    Provide the names of the columns to be aggregated,
    `'Average total business income'`
    and` 'Average total business expenses'`, to the parameter
    values.

    Provide the name of the column to be grouped, `'cluster'`,
    to the parameter index.

    Use the `.mean` method from NumPy (`np`) as the
    aggregation function for the `aggfunc` parameter:

    ```
    df.pivot_table(values=['Average total business income', \
                           'Average total business expenses'], \
                   index='cluster', aggfunc=np.mean)
    ```


    You should get the following output:


![](./images/B15019_05_16.jpg)


    Caption: Output of the pivot\_table function

11. Now let\'s plot the clusters using an interactive scatter plot.
    First, use `Chart()` and `mark_circle()` from
    the `altair` package to instantiate a scatter plot graph:
    ```
    scatter_plot = alt.Chart(df).mark_circle()
    ```


12. Use the `encode` and `interactive` methods from
    `altair` to specify the display of the scatter plot and
    its interactivity options with the following parameters:

    Provide the name of the `'Average total business income'`
    column to the `x` parameter (the x-axis).

    Provide the name of the
    `'Average total business expenses'` column to the
    `y` parameter (the y-axis).

    Provide the name of the `cluster:N` column to the
    `color` parameter (providing a different color for each
    group).

    Provide these column names -- `'Postcode'`,
    `'cluster'`, `'Average total business income'`,
    and `'Average total business expenses'` -- to the
    `'tooltip'` parameter (this being the information
    displayed by the tooltip):

    ```
    scatter_plot.encode(x='Average total business income', \
                        y='Average total business expenses', \
                        color='cluster:N', tooltip = ['Postcode', \
                                                      'cluster', \
                        'Average total business income', \
                        'Average total business expenses'])\
                        .interactive()
    ```


    You should get the following output:


![](./images/B15019_05_17.jpg)


Caption: Interactive scatter plot of the clusters


Choosing the Number of Clusters
===============================


Note

Open the notebook you were using for *Exercise 5.01*, *Performing Your
First Clustering Analysis on the ATO Dataset*, execute the code you
already entered, and then continue at the end of the notebook with the
following code.

```
clusters = pd.DataFrame()
clusters['cluster_range'] = range(1, 10)
inertia = []
```
Next, we will create a `for` loop that will iterate over the
range, fit a k-means model with the specified number of
`clusters`, extract the `inertia` value, and store
it in our list, as in the following code snippet:

```
for k in clusters['cluster_range']:
    kmeans = KMeans(n_clusters=k, random_state=8).fit(X)
    inertia.append(kmeans.inertia_)
```
Now we can use our list of `inertia` values in the
`clusters` DataFrame:

```
clusters['inertia'] = inertia
clusters
```

You should get the following output:

![](./images/B15019_05_18.jpg)

Caption: Dataframe containing inertia values for our clusters

Then, we need to plot a line chart using `altair` with the
`mark_line()` method. We will specify the
`'cluster_range'` column as our x-axis and
`'inertia'` as our y-axis, as in the following code snippet:

```
alt.Chart(clusters).mark_line()\
                   .encode(x='cluster_range', y='inertia')
```

You should get the following output:

![](./images/B15019_05_19.jpg)

Caption: Plotting the Elbow method

Note

You don\'t have to save each of the `altair` objects in a
separate variable; you can just append the methods one after the other
with \"`.".`


Now let\'s retrain our `Kmeans` with this hyperparameter and
plot the clusters as shown in the following code snippet:

```
kmeans = KMeans(random_state=42, n_clusters=3)
kmeans.fit(X)
df['cluster2'] = kmeans.predict(X)
scatter_plot.encode(x='Average net tax', \
                    y='Average total deductions', \
                    color='cluster2:N', \
                    tooltip=['Postcode', 'cluster', \
                             'Average net tax', \
                             'Average total deductions'])\
                    .interactive()
```

You should get the following output:

![](./images/B15019_05_20.jpg)


Exercise 5.03: Finding the Optimal Number of Clusters
-----------------------------------------------------

In this exercise, we will apply the Elbow method to the same data as in
*Exercise 5.02*, *Clustering Australian Postcodes by Business Income and
Expenses*, to find the optimal number of clusters, before fitting a
k-means model:

1.  Open a new Jupyter notebook for this exercise.

2.  Now `import` the required packages (`pandas`,
    `sklearn`, and `altair`):

    ```
    import pandas as pd
    from sklearn.cluster import KMeans
    import altair as alt
    ```


    Next, we will load the dataset and select the same columns as in
    *Exercise 5.02*, *Clustering Australian Postcodes by Business Income
    and Expenses*, and print the first five rows.

3.  Assign the link to the ATO dataset to a variable called
    `file_url`:
    ```
    file_url = 'https://raw.githubusercontent.com'\
               '/fenago/data-science'\
               '/master/Lab05/DataSet/taxstats2015.csv'
    ```


4.  Using the `.read_csv()` method from the pandas package,
    load the dataset with only the following columns using the
    `use_cols` parameter: `'Postcode'`,
    `'Average total business income'`, and
    `'Average total business expenses'`:
    ```
    df = pd.read_csv(file_url, \
                     usecols=['Postcode', \
                              'Average total business income', \
                              'Average total business expenses'])
    ```


5.  Display the first five rows of the DataFrame with the
    `.head()` method from the pandas package:

    ```
    df.head()
    ```


    You should get the following output:


![](./images/B15019_05_21.jpg)


    Caption: The first five rows of the ATO DataFrame

6.  Assign the `'Average total business income'` and
    `'Average total business expenses'` columns to a new
    variable called `X`:
    ```
    X = df[['Average total business income', \
            'Average total business expenses']]
    ```


7.  Create an empty pandas DataFrame called `clusters` and an
    empty list called `inertia`:

    ```
    clusters = pd.DataFrame()
    inertia = []
    ```


    Now, use the `range` function to generate a list
    containing the range of cluster numbers, from `1` to
    `15`, and assign it to a new column called
    `'cluster_range'` from the `'clusters'`
    DataFrame:

    ```
    clusters['cluster_range'] = range(1, 15)
    ```


8.  Create a `for` loop to go through each cluster number and
    fit a k-means model accordingly, then append the `inertia`
    values using the `'inertia_'` parameter with the
    `'inertia'` list:
    ```
    for k in clusters['cluster_range']:
        kmeans = KMeans(n_clusters=k).fit(X)
        inertia.append(kmeans.inertia_)
    ```


9.  Assign the `inertia` list to a new column called
    `'inertia'` from the `clusters` DataFrame and
    display its content:

    ```
    clusters['inertia'] = inertia
    clusters
    ```


    You should get the following output:


![](./images/B15019_05_22.jpg)


    Caption: Plotting the Elbow method

10. Now use `mark_line()` and `encode()` from the
    `altair` package to plot the Elbow graph with
    `'cluster_range'` as the x-axis and `'inertia'`
    as the y-axis:

    ```
    alt.Chart(clusters).mark_line()\
       .encode(alt.X('cluster_range'), alt.Y('inertia'))
    ```


    You should get the following output:


![](./images/B15019_05_23.jpg)


    Caption: Plotting the Elbow method

11. Looking at the Elbow plot, identify the optimal number of clusters,
    and assign this value to a variable called
    `optim_cluster`:
    ```
    optim_cluster = 4
    ```


12. Train a k-means model with this number of clusters and a
    `random_state` value of `42` using the
    `fit` method from `sklearn`:
    ```
    kmeans = KMeans(random_state=42, n_clusters=optim_cluster)
    kmeans.fit(X)
    ```


13. Now, using the `predict` method from `sklearn`,
    get the predicted assigned cluster for each data point contained in
    the `X` variable and save the results into a new column
    called `'cluster2'` from the `df` DataFrame:
    ```
    df['cluster2'] = kmeans.predict(X)
    ```


14. Display the first five rows of the `df` DataFrame using
    the `head` method from the `pandas` package:

    ```
    df.head()
    ```


    You should get the following output:


![](./images/B15019_05_24.jpg)


    Caption: The first five rows with the cluster predictions

15. Now plot the scatter plot using the `mark_circle()` and
    `encode()` methods from the `altair` package.
    Also, to add interactiveness, use the `tooltip` parameter
    and the `interactive()` method from the `altair`
    package as shown in the following code snippet:

    ```
    alt.Chart(df).mark_circle()\
                 .encode\
                  (x='Average total business income', \
                   y='Average total business expenses', \
                   color='cluster2:N', \
                   tooltip=['Postcode', 'cluster2', \
                            'Average total business income',\
                            'Average total business expenses'])\
                 .interactive()
    ```


    You should get the following output:


![](./images/B15019_05_25.jpg)


Initializing Clusters
=====================


Let\'s try this out on our ATO dataset by having a look at the following
example.

Note

Open the notebook you were using for *Exercise 5.01*, *Performing Your
First Clustering Analysis on the ATO Dataset,* and earlier examples.
Execute the code you already entered, and then continue at the end of
the notebook with the following code.

First, let\'s run only one iteration using random initialization:

```
kmeans = KMeans(random_state=14, n_clusters=3, \
                init='random', n_init=1)
kmeans.fit(X)
```
As usual, we want to visualize our clusters with a scatter plot, as
defined in the following code snippet:

```
df['cluster3'] = kmeans.predict(X)
alt.Chart(df).mark_circle()\
             .encode(x='Average net tax', \
                     y='Average total deductions', \
                     color='cluster3:N', \
                     tooltip=['Postcode', 'cluster', \
                              'Average net tax', \
                              'Average total deductions']) \
             .interactive()
```

You should get the following output:

![](./images/B15019_05_27.jpg)

Caption: Clustering results with n\_init as 1 and init as random

Overall, the result is very close to that of our previous run. It is
worth noticing that the boundaries between the clusters are slightly
different.

Now let\'s try with five iterations (using the `n_init`
hyperparameter) and k-means++ initialization (using the `init`
hyperparameter):

```
kmeans = KMeans(random_state=14, n_clusters=3, \
                init='k-means++', n_init=5)
kmeans.fit(X)
df['cluster4'] = kmeans.predict(X)
alt.Chart(df).mark_circle()\
             .encode(x='Average net tax', \
                     y='Average total deductions', \
                     color='cluster4:N', \
                     tooltip=['Postcode', 'cluster', \
                              'Average net tax', \
                              'Average total deductions'])\
                    .interactive()
```

You should get the following output:

![Caption: Clustering results with n\_init as 5 and init as
k-means++ ](./images/B15019_05_28.jpg)

Caption: Clustering results with n\_init as 5 and init as k-means++

Here, the results are very close to the original run with 10 iterations.
This means that we didn\'t have to run so many iterations for k-means to
converge and could have saved some time with a lower number.


Exercise 5.04: Using Different Initialization Parameters to Achieve a Suitable Outcome
--------------------------------------------------------------------------------------

In this exercise, we will use the same data as in *Exercise 5.02*,
*Clustering Australian Postcodes by Business Income and Expenses*, and
try different values for the `init` and `n_init`
hyperparameters and see how they affect the final clustering result:

1.  Open a new Jupyter notebook.

2.  Import the required packages, which are `pandas`,
    `sklearn`, and `altair`:
    ```
    import pandas as pd
    from sklearn.cluster import KMeans
    import altair as alt
    ```


3.  Assign the link to the ATO dataset to a variable called
    `file_url`:
    ```
    file_url = 'https://raw.githubusercontent.com'\
               '/fenago/data-science'\
               '/master/Lab05/DataSet/taxstats2015.csv'
    ```


4.  Load the dataset and select the same columns as in *Exercise 5.02*,
    *Clustering Australian Postcodes by Business Income and Expenses*,
    and *Exercise 5.03*, *Finding the Optimal Number of Clusters*, using
    the `read_csv()` method from the `pandas`
    package:
    ```
    df = pd.read_csv(file_url, \
                     usecols=['Postcode', \
                              'Average total business income', \
                              'Average total business expenses'])
    ```


5.  Assign the `'Average total business income'` and
    `'Average total business expenses'` columns to a new
    variable called `X`:
    ```
    X = df[['Average total business income', \
            'Average total business expenses']]
    ```


6.  Fit a k-means model with `n_init` equal to `1`
    and a random `init`:
    ```
    kmeans = KMeans(random_state=1, n_clusters=4, \
                    init='random', n_init=1)
    kmeans.fit(X)
    ```


7.  Using the `predict` method from the `sklearn`
    package, predict the clustering assignment from the input variable,
    `(X)`, and save the results into a new column called
    `'cluster3'` in the DataFrame:
    ```
    df['cluster3'] = kmeans.predict(X)
    ```


8.  Plot the clusters using an interactive scatter plot. First, use
    `Chart()` and `mark_circle()` from the
    `altair` package to instantiate a scatter plot graph, as
    shown in the following code snippet:
    ```
    scatter_plot = alt.Chart(df).mark_circle()
    ```


9.  Use the `encode` and `interactive` methods from
    `altair` to specify the display of the scatter plot and
    its interactivity options with the following parameters:

    Provide the name of the `'Average total business income'`
    column to the `x` parameter (x-axis).

    Provide the name of the
    `'Average total business expenses'` column to the
    `y` parameter (y-axis).

    Provide the name of the `'cluster3:N'` column to the
    `color` parameter (which defines the different colors for
    each group).

    Provide these column names -- `'Postcode'`,
    `'cluster3'`, `'Average total business income'`,
    and `'Average total business expenses'` -- to the
    `tooltip` parameter:

    ```
    scatter_plot.encode(x='Average total business income', \
                        y='Average total business expenses', \
                        color='cluster3:N', \
                        tooltip=['Postcode', 'cluster3', \
                                 'Average total business income', \
                                 'Average total business expenses'])\
                       .interactive()
    ```


    You should get the following output:


![Caption: Clustering results with n\_init as 1 and init as
    random ](./images/B15019_05_29.jpg)


    Caption: Clustering results with n\_init as 1 and init as random

10. Repeat *Steps 5* to *8* but with different k-means hyperparameters,
    `n_init=10` and random `init`, as shown in the
    following code snippet:

    ```
    kmeans = KMeans(random_state=1, n_clusters=4, \
                    init='random', n_init=10)
    kmeans.fit(X)
    df['cluster4'] = kmeans.predict(X)
    scatter_plot = alt.Chart(df).mark_circle()
    scatter_plot.encode(x='Average total business income', \
                        y='Average total business expenses', \
                        color='cluster4:N',
                        tooltip=['Postcode', 'cluster4', \
                                 'Average total business income', \
                                 'Average total business expenses'])\
                       .interactive()
    ```


    You should get the following output:


![Caption: Clustering results with n\_init as 10 and init as
    random ](./images/B15019_05_30.jpg)


    Caption: Clustering results with n\_init as 10 and init as
    random

11. Again, repeat *Steps 5* to *8* but with different k-means
    hyperparameters -- `n_init=100` and random
    `init`:

    ```
    kmeans = KMeans(random_state=1, n_clusters=4, \
                    init='random', n_init=100)
    kmeans.fit(X)
    df['cluster5'] = kmeans.predict(X)
    scatter_plot = alt.Chart(df).mark_circle()
    scatter_plot.encode(x='Average total business income', \
                        y='Average total business expenses', \
                        color='cluster5:N', \
                        tooltip=['Postcode', 'cluster5', \
                        'Average total business income', \
                        'Average total business expenses'])\
                .interactive()
    ```


    You should get the following output:

![](./images/B15019_05_31.jpg)

Caption: Clustering results with n\_init as 10 and init as random


Calculating the Distance to the Centroid
========================================


Note

Open the notebook you were using for *Exercise 5.01*, *Performing Your
First Clustering Analysis on the ATO Dataset*, and earlier examples.
Execute the code you already entered, and then continue at the end of
the notebook with the following code.

```
x = X.iloc[0,].values
y = X.iloc[1,].values
print(x)
print(y)
```

You should get the following output:

![](./images/B15019_05_33.jpg)


The coordinates for `x` are `(27555, 2071)` and the
coordinates for `y` are `(28142, 3804)`. Here, the
formula is telling us to calculate the squared difference between each
axis of the two data points and sum them:

```
squared_euclidean = (x[0] - y[0])**2 + (x[1] - y[1])**2
print(squared_euclidean)
```

You should get the following output:

```
3347858
```


Let\'s see how we can plot the centroids in an example.

First, we fit a k-means model as shown in the following code snippet:

```
kmeans = KMeans(random_state=42, n_clusters=3, \
                init='k-means++', n_init=5)
kmeans.fit(X)
df['cluster6'] = kmeans.predict(X)
```
Now extract the `centroids` into a DataFrame and print them:

```
centroids = kmeans.cluster_centers_
centroids = pd.DataFrame(centroids, \
                         columns=['Average net tax', \
                                  'Average total deductions'])
print(centroids)
```

You should get the following output:

![](./images/B15019_05_34.jpg)

Caption: Coordinates of the three centroids

We will plot the usual scatter plot but will assign it to a variable
called `chart1`:

```
chart1 = alt.Chart(df).mark_circle()\
            .encode(x='Average net tax', \
                    y='Average total deductions', \
                    color='cluster6:N', \
                    tooltip=['Postcode', 'cluster6', \
                             'Average net tax', \
                             'Average total deductions'])\
                   .interactive()
chart1
```

You should get the following output:

![](./images/B15019_05_35.jpg)

Caption: Scatter plot of the clusters

Now, to create a second scatter plot only for the centroids called
`chart2`:

```
chart2 = alt.Chart(centroids).mark_circle(size=100)\
            .encode(x='Average net tax', \
                    y='Average total deductions', \
                    color=alt.value('black'), \
                    tooltip=['Average net tax', \
                             'Average total deductions'])\
                   .interactive()
chart2
```

You should get the following output:

![](./images/B15019_05_36.jpg)

Caption: Scatter plot of the centroids

And now we combine the two charts, which is extremely easy with
`altair`:

```
chart1 + chart2
```

You should get the following output:

![](./images/B15019_05_37.jpg)

Caption: Scatter plot of the clusters and their centroids

Now we can easily see which centroids the observations are closest to.


Exercise 5.05: Finding the Closest Centroids in Our Dataset
-----------------------------------------------------------

In this exercise, we will be coding the first iteration of k-means in
order to assign data points to their closest cluster centroids. The
following steps will help you complete the exercise:

1.  Open a new Jupyter notebook.

2.  Now `import` the required packages, which are
    `pandas`, `sklearn`, and `altair`:
    ```
    import pandas as pd
    from sklearn.cluster import KMeans
    import altair as alt
    ```


3.  Load the dataset and select the same columns as in *Exercise 5.02*,
    *Clustering Australian Postcodes by Business Income and Expenses*,
    using the `read_csv()` method from the `pandas`
    package:
    ```
    file_url = 'https://raw.githubusercontent.com/'\
               'fenago/data-science/'\
               'master/Lab05/DataSet/taxstats2015.csv'
    df = pd.read_csv(file_url, \
                     usecols=['Postcode', \
                              'Average total business income', \
                              'Average total business expenses'])
    ```


4.  Assign the `'Average total business income'` and
    `'Average total business expenses'` columns to a new
    variable called `X`:
    ```
    X = df[['Average total business income', \
            'Average total business expenses']]
    ```


5.  Now, calculate the minimum and maximum using the `min()`
    and `max()` values of the
    `'Average total business income'` and
    `'Average total business income'` variables, as shown in
    the following code snippet:
    ```
    business_income_min = df['Average total business income'].min()
    business_income_max = df['Average total business income'].max()
    business_expenses_min = df['Average total business expenses']\
                            .min()
    business_expenses_max = df['Average total business expenses']\
                            .max()
    ```


6.  Print the values of these four variables, which are the minimum and
    maximum values of the two variables:

    ```
    print(business_income_min)
    print(business_income_max)
    print(business_expenses_min)
    print(business_expenses_max)
    ```


    You should get the following output:

    ```
    0
    876324
    0
    884659
    ```


7.  Now import the `random` package and use the
    `seed()` method to set a seed of `42`, as shown
    in the following code snippet:
    ```
    import random
    random.seed(42)
    ```


8.  Create an empty pandas DataFrame and assign it to a variable called
    `centroids`:
    ```
    centroids = pd.DataFrame()
    ```


9.  Generate four random values using the `sample()` method
    from the `random` package with possible values between the
    minimum and maximum values of the
    `'Average total business expenses'` column using
    `range()` and store the results in a new column called
    `'Average total business income'` from the
    `centroids` DataFrame:
    ```
    centroids\
    ['Average total business income'] = random.sample\
                                        (range\
                                        (business_income_min, \
                                         business_income_max), 4)
    ```


10. Repeat the same process to generate `4` random values for
    `'Average total business expenses'`:
    ```
    centroids\
    ['Average total business expenses'] = random.sample\
                                          (range\
                                          (business_expenses_min,\
                                           business_expenses_max), 4)
    ```


11. Create a new column called `'cluster'` from the
    `centroids` DataFrame using the
    `.index `attributes from the pandas package and print this
    DataFrame:

    ```
    centroids['cluster'] = centroids.index
    centroids
    ```


    You should get the following output:


![](./images/B15019_05_38.jpg)


    Caption: Coordinates of the four random centroids

12. Create a scatter plot with the `altair` package to display
    the data contained in the `df` DataFrame and save it in a
    variable called `'chart1'`:
    ```
    chart1 = alt.Chart(df.head()).mark_circle()\
                .encode(x='Average total business income', \
                        y='Average total business expenses', \
                        color=alt.value('orange'), \
                        tooltip=['Postcode', \
                                 'Average total business income', \
                                 'Average total business expenses'])\
                       .interactive()
    ```


13. Now create a second scatter plot using the `altair`
    package to display the centroids and save it in a variable called
    `'chart2'`:
    ```
    chart2 = alt.Chart(centroids).mark_circle(size=100)\
                .encode(x='Average total business income', \
                        y='Average total business expenses', \
                        color=alt.value('black'), \
                        tooltip=['cluster', \
                                 'Average total business income',\
                                 'Average total business expenses'])\
                       .interactive()
    ```


14. Display the two charts together using the altair syntax:
    `<chart> + <chart>`:

    ```
    chart1 + chart2
    ```


    You should get the following output:


![Caption: Scatter plot of the random centroids and the first
    five observations ](./images/B15019_05_39.jpg)


    Caption: Scatter plot of the random centroids and the first five
    observations

15. Define a function that will calculate the
    `squared_euclidean` distance and return its value. This
    function will take the `x` and `y` coordinates
    of a data point and a centroid:
    ```
    def squared_euclidean(data_x, data_y, \
                          centroid_x, centroid_y, ):
        return (data_x - centroid_x)**2 + (data_y - centroid_y)**2
    ```


16. Using the `.at` method from the pandas package, extract
    the first row\'s `x` and `y` coordinates and
    save them in two variables called `data_x` and
    `data_y`:
    ```
    data_x = df.at[0, 'Average total business income']
    data_y = df.at[0, 'Average total business expenses']
    ```


17. Using a `for` loop or list comprehension, calculate the
    `squared_euclidean` distance of the first observation
    (using its `data_x` and `data_y` coordinates)
    against the `4` different centroids contained in
    `centroids`, save the result in a variable called
    `distance`, and display it:

    ```
    distances = [squared_euclidean\
                 (data_x, data_y, centroids.at\
                  [i, 'Average total business income'], \
                  centroids.at[i, \
                  'Average total business expenses']) \
                  for i in range(4)]
    distances
    ```


    You should get the following output:

    ```
    [215601466600, 10063365460, 34245932020, 326873037866]
    ```


18. Use the `index` method from the list containing the
    `squared_euclidean` distances to find the cluster with the
    shortest distance, as shown in the following code snippet:
    ```
    cluster_index = distances.index(min(distances))
    ```


19. Save the `cluster` index in a column called
    `'cluster'` from the `df` DataFrame for the
    first observation using the `.at` method from the pandas
    package:
    ```
    df.at[0, 'cluster'] = cluster_index
    ```


20. Display the first five rows of `df` using the
    `head()` method from the `pandas` package:

    ```
    df.head()
    ```


    You should get the following output:


![](./images/B15019_05_40.jpg)


21. Repeat *Steps 15* to *19* for the next `4` rows to
    calculate their distances from the centroids and find the cluster
    with the smallest distance value:

    ```
    distances = [squared_euclidean\
                 (df.at[1, 'Average total business income'], \
                  df.at[1, 'Average total business expenses'], \
                  centroids.at[i, 'Average total business income'],\
                  centroids.at[i, \
                               'Average total business expenses'])\
                 for i in range(4)]
    df.at[1, 'cluster'] = distances.index(min(distances))
    distances = [squared_euclidean\
                 (df.at[2, 'Average total business income'], \
                  df.at[2, 'Average total business expenses'], \
                  centroids.at[i, 'Average total business income'],\
                  centroids.at[i, \
                               'Average total business expenses'])\
                 for i in range(4)]
    df.at[2, 'cluster'] = distances.index(min(distances))
    distances = [squared_euclidean\
                 (df.at[3, 'Average total business income'], \
                  df.at[3, 'Average total business expenses'], \
                  centroids.at[i, 'Average total business income'],\
                  centroids.at[i, \
                               'Average total business expenses'])\
                 for i in range(4)]
    df.at[3, 'cluster'] = distances.index(min(distances))
    distances = [squared_euclidean\
                 (df.at[4, 'Average total business income'], \
                  df.at[4, 'Average total business expenses'], \
                  centroids.at[i, \
                  'Average total business income'], \
                  centroids.at[i, \
                  'Average total business expenses']) \
                 for i in range(4)]
    df.at[4, 'cluster'] = distances.index(min(distances))
    df.head()
    ```


    You should get the following output:


![](./images/B15019_05_41.jpg)


    Caption: The first five rows of the ATO DataFrame and their
    assigned clusters

22. Finally, plot the centroids and the first `5` rows of the
    dataset using the `altair` package as in *Steps 12* to
    *13*:

    ```
    chart1 = alt.Chart(df.head()).mark_circle()\
                .encode(x='Average total business income', \
                        y='Average total business expenses', \
                        color='cluster:N', \
                        tooltip=['Postcode', 'cluster', \
                                 'Average total business income', \
                                 'Average total business expenses'])\
                       .interactive()
    chart2 = alt.Chart(centroids).mark_circle(size=100)\
                .encode(x='Average total business income', \
                        y='Average total business expenses', \
                        color=alt.value('black'), \
                        tooltip=['cluster', \
                                 'Average total business income',\
                                 'Average total business expenses'])\
                       .interactive()
    chart1 + chart2
    ```


    You should get the following output:

![Caption: Scatter plot of the random centroids and the first five](./images/B15019_05_42.jpg)


**Note:** Open the notebook you were using for *Exercise 5.01*, *Performing Your
First Clustering Analysis on the ATO Dataset*, and earlier examples.
Execute the code you already entered, and then continue at the end of
the notebook with the following code.

First, we import the relevant class and instantiate an object:

```
from sklearn.preprocessing import MinMaxScaler
min_max_scaler = MinMaxScaler()
```

Then, we fit it to our dataset:

```
min_max_scaler.fit(X)
```

You should get the following output:

![](./images/B15019_05_44.jpg)

Caption: Min-max scaling summary

And finally, call the `transform()` method to standardize the
data:

```
X_min_max = min_max_scaler.transform(X)
X_min_max
```

You should get the following output:

![](./images/B15019_05_45.jpg)

Caption: Min-max-scaled data

Now we print the minimum and maximum values of the min-max-scaled data
for both axes:

```
X_min_max[:,0].min(), X_min_max[:,0].max(), \
X_min_max[:,1].min(), X_min_max[:,1].max()
```

You should get the following output:

![](./images/B15019_05_46.jpg)

Caption: Minimum and maximum values of the min-max-scaled data


To apply **z-score** with `sklearn`, first, we have to import the
relevant `StandardScaler` class and instantiate an object:

```
from sklearn.preprocessing import StandardScaler
standard_scaler = StandardScaler()
```
This time, instead of calling `fit()` and then
`transform()`, we use the `fit_transform()` method:

```
X_scaled = standard_scaler.fit_transform(X)
X_scaled
```

You should get the following output:

![](./images/B15019_05_48.jpg)

Caption: Z-score-standardized data

Now we\'ll look at the minimum and maximum values for each axis:

```
X_scaled[:,0].min(), X_scaled[:,0].max(), \
X_scaled[:,1].min(), X_scaled[:,1].max()
```

You should get the following output:

![Caption: Minimum and maximum values of the z-score-standardized
data ](./images/B15019_05_49.jpg)

Caption: Minimum and maximum values of the z-score-standardized data

The value ranges for both axes are much lower now and we can see that
their maximum values are around 9 and 18, which indicates that there are
some extreme outliers in the data.

Now, to fit a k-means model and plot a scatter plot on the
z-score-standardized data with the following code snippet:

```
kmeans = KMeans(random_state=42, n_clusters=3, \
                init='k-means++', n_init=5)
kmeans.fit(X_scaled)
df['cluster7'] = kmeans.predict(X_scaled)
alt.Chart(df).mark_circle()\
             .encode(x='Average net tax', \
                     y='Average total deductions', \
                     color='cluster7:N', \
                     tooltip=['Postcode', 'cluster7', \
                              'Average net tax', \
                              'Average total deductions'])\
                    .interactive()
```

You should get the following output:

![](./images/B15019_05_50.jpg)

Caption: Scatter plot of the standardized data


Exercise 5.06: Standardizing the Data from Our Dataset
------------------------------------------------------

In this final exercise, we will standardize the data using min-max
scaling and the z-score and fit a k-means model for each method and see
their impact on k-means:

1.  Open a new Jupyter notebook.

2.  Now import the required `pandas`, `sklearn`, and
    `altair` packages:
    ```
    import pandas as pd
    from sklearn.cluster import KMeans
    import altair as alt
    ```


3.  Load the dataset and select the same columns as in *Exercise 5.02*,
    *Clustering Australian Postcodes by Business Income and Expenses*,
    using the `read_csv()` method from the `pandas`
    package:
    ```
    file_url = 'https://raw.githubusercontent.com'\
               '/fenago/data-science'\
               '/master/Lab05/DataSet/taxstats2015.csv'
    df = pd.read_csv(file_url, \
                     usecols=['Postcode', \
                              'Average total business income', \
                              'Average total business expenses'])
    ```


4.  Assign the `'Average total business income'` and
    `'Average total business expenses'` columns to a new
    variable called `X`:
    ```
    X = df[['Average total business income', \
            'Average total business expenses']]
    ```


5.  Import the `MinMaxScaler` and `StandardScaler`
    classes from `sklearn`:
    ```
    from sklearn.preprocessing import MinMaxScaler
    from sklearn.preprocessing import StandardScaler
    ```


6.  Instantiate and fit `MinMaxScaler` with the data:

    ```
    min_max_scaler = MinMaxScaler()
    min_max_scaler.fit(X)
    ```


    You should get the following output:


![](./images/B15019_05_51.jpg)


    Caption: Summary of the min-max scaler

7.  Perform the min-max scaling transformation and save the data into a
    new variable called `X_min_max`:

    ```
    X_min_max = min_max_scaler.transform(X)
    X_min_max
    ```


    You should get the following output:


![](./images/B15019_05_52.jpg)


    Caption: Min-max-scaled data

8.  Fit a k-means model on the scaled data with the following
    hyperparameters: `random_state=1`,
    `n_clusters=4, init='k-means++', n_init=5`, as shown in
    the following code snippet:
    ```
    kmeans = KMeans(random_state=1, n_clusters=4, \
                    init='k-means++', n_init=5)
    kmeans.fit(X_min_max)
    ```


9.  Assign the k-means predictions of each value of `X` in a
    new column called `'cluster8'` in the `df`
    DataFrame:
    ```
    df['cluster8'] = kmeans.predict(X_min_max)
    ```


10. Plot the k-means results into a scatter plot using the
    `altair` package:

    ```
    scatter_plot = alt.Chart(df).mark_circle()
    scatter_plot.encode(x='Average total business income', \
                        y='Average total business expenses',\
                        color='cluster8:N',\
                        tooltip=['Postcode', 'cluster8', \
                                 'Average total business income',\
                                 'Average total business expenses'])\
                       .interactive()
    ```


    You should get the following output:


![Caption: Scatter plot of k-means results using the
    min-max-scaled data ](./images/B15019_05_53.jpg)


    Caption: Scatter plot of k-means results using the
    min-max-scaled data

11. Re-train the k-means model but on the z-score-standardized data with
    the same hyperparameter values,
    `random_state=1, n_clusters=4, init='k-means++', n_init=5`:
    ```
    standard_scaler = StandardScaler()
    X_scaled = standard_scaler.fit_transform(X)
    kmeans = KMeans(random_state=1, n_clusters=4, \
                    init='k-means++', n_init=5)
    kmeans.fit(X_scaled)
    ```


12. Assign the k-means predictions of each value of `X_scaled`
    in a new column called `'cluster9' `in the `df`
    DataFrame:
    ```
    df['cluster9'] = kmeans.predict(X_scaled)
    ```


13. Plot the k-means results in a scatter plot using the
    `altair` package:

    ```
    scatter_plot = alt.Chart(df).mark_circle()
    scatter_plot.encode(x='Average total business income', \
                        y='Average total business expenses', \
                        color='cluster9:N', \
                        tooltip=['Postcode', 'cluster9', \
                                 'Average total business income',\
                                 'Average total business expenses'])\
                       .interactive()
    ```


    You should get the following output:


![](./images/B15019_05_54.jpg)


Activity 5.01: Perform Customer Segmentation Analysis in a Bank Using k-means
-----------------------------------------------------------------------------

You are working for an international bank. The credit department is
reviewing its offerings and wants to get a better understanding of its
current customers. You have been tasked with performing customer
segmentation analysis. You will perform cluster analysis with k-means to
identify groups of similar customers.

The following steps will help you complete this activity:

1.  Download the dataset and load it into Python.

2.  Read the CSV file using the `read_csv()` method.

    Note

    This dataset is in the `.dat` file format. You can still
    load the file using `read_csv()` but you will need to
    specify the following parameter:
    `header=None, sep= '\s\s+' and prefix='X'`.

3.  You will be using the fourth and tenth columns (`X3` and
    `X9`). Extract these.

4.  Perform data standardization by instantiating a
    `StandardScaler` object.

5.  Analyze and define the optimal number of clusters.

6.  Fit a k-means algorithm with the number of clusters you\'ve defined.

7.  Create a scatter plot of the clusters.


You should get something similar to the following output:

![](./images/B15019_05_55.jpg)

Caption: Scatter plot of the four clusters found


Summary
=======

We learned about a lot of different concepts, such as centroids and
squared Euclidean distance. We went through the main k-means
hyperparameters: `init` (initialization method),
`n_init` (number of initialization runs),
`n_clusters` (number of clusters), and
`random_state` (specified seed). We also discussed the
importance of choosing the optimal number of clusters, initializing
centroids properly, and standardizing data. You have learned how to use
the following Python packages: `pandas`, `altair`,
`sklearn`, and `KMeans`.

Next, you will see how we can assess the performance of these models and
what tools can be used to make them even better.