46 KiB
- Feature Engineering =======================
Overview
By the end of this lab, you will be able to merge multiple datasets
together; bin categorical and numerical variables; perform aggregation
on data; and manipulate dates using pandas.
This lab will introduce you to some of the key techniques for creating new variables on an existing dataset.
Merging Datasets
First, we need to import the Online Retail dataset into a
pandas DataFrame:
import pandas as pd
file_url = 'https://github.com/fenago/'\
'data-science/blob/'\
'master/Lab12/Dataset/'\
'Online%20Retail.xlsx?raw=true'
df = pd.read_excel(file_url)
df.head()
You should get the following output.
Caption: First five rows of the Online Retail dataset
Next, we are going to load all the public holidays in the UK into
another pandas DataFrame. From Lab 10, Analyzing a
Dataset we know the records of this dataset are only for the years 2010
and 2011. So we are going to extract public holidays for those two
years, but we need to do so in two different steps as the API provided
by date.nager is split into single years only.
Let's focus on 2010 first:
uk_holidays_2010 = pd.read_csv\
('https://date.nager.at/PublicHoliday/'\
'Country/GB/2010/CSV')
We can print its shape to see how many rows and columns it has:
uk_holidays_2010.shape
You should get the following output.
(13, 8)
We can see there were 13 public holidays in that year and
there are 8 different columns.
Let's print the first five rows of this DataFrame:
uk_holidays_2010.head()
You should get the following output:
Caption: First five rows of the UK 2010 public holidays DataFrame
Now that we have the list of public holidays for 2010, let's extract the ones for 2011:
uk_holidays_2011 = pd.read_csv\
('https://date.nager.at/PublicHoliday/'\
'Country/GB/2011/CSV')
uk_holidays_2011.shape
You should get the following output.
(15, 8)
There were 15 public holidays in 2011. Now we need to
combine the records of these two DataFrames. We will use the
.append() method from pandas and assign the
results into a new DataFrame:
uk_holidays = uk_holidays_2010.append(uk_holidays_2011)
Let's check we have the right number of rows after appending the two DataFrames:
uk_holidays.shape
You should get the following output:
(28, 8)
We got 28 records, which corresponds with the total number
of public holidays in 2010 and 2011.
In order to merge two DataFrames together, we need to have at least one
common column between them, meaning the two DataFrames should have at
least one column that contains the same type of information. In our
example, we are going to merge this DataFrame using the Date
column with the Online Retail DataFrame on the InvoiceDate
column. We can see that the data format of these two columns is
different: one is a date (yyyy-mm-dd) and the other is a
datetime (yyyy-mm-dd hh:mm:ss).
So, we need to transform the InvoiceDate column into date
format (yyyy-mm-dd). One way to do it (we will see another
one later in this lab) is to transform this column into text and
then extract the first 10 characters for each cell using the
.str.slice() method.
For example, the date 2010-12-01 08:26:00 will first be converted into a
string and then we will keep only the first 10 characters, which will be
2010-12-01. We are going to save these results into a new column called
InvoiceDay:
df['InvoiceDay'] = df['InvoiceDate'].astype(str)\
.str.slice(stop=10)
df.head()
The output is as follows:
Caption: First five rows after creating InvoiceDay
Now InvoiceDay from the online retail DataFrame and
Date from the UK public holidays DataFrame have similar
information, so we can merge these two DataFrames together using
.merge() from pandas.
There are multiple ways to join two tables together:
- The left join
- The right join
- The inner join
- The outer join
The Left Join
The left join will keep all the rows from the first DataFrame, which is the Online Retail dataset (the left-hand side) and join it to the matching rows from the second DataFrame, which is the UK Public Holidays dataset (the right-hand side), as shown in Figure 12.04:
Caption: Venn diagram for left join
To perform a left join, we need to specify to the .merge() method the following parameters:
how = 'left'for a left joinleft_on = InvoiceDayto specify the column used for merging from the left-hand side (here, theInvoicedaycolumn from the Online Retail DataFrame)right_on = Dateto specify the column used for merging from the right-hand side (here, theDatecolumn from the UK Public Holidays DataFrame)
These parameters are clubbed together as shown in the following code snippet:
df_left = pd.merge(df, uk_holidays, left_on='InvoiceDay', \
right_on='Date', how='left')
df_left.shape
You should get the following output:
(541909, 17)
We got the exact same number of rows as the original Online Retail DataFrame, which is expected for a left join. Let's have a look at the first five rows:
df_left.head()
You should get the following output:
Caption: First five rows of the left-merged DataFrame
We can see that the eight columns from the public holidays DataFrame
have been merged to the original one. If no row has been matched from
the second DataFrame (in this case, the public holidays one),
pandas will fill all the cells with missing values
(NaT or NaN), as shown in Figure 12.05.
The Right Join
The right join is similar to the left join except it will keep all the rows from the second DataFrame (the right-hand side) and tries to match it with the first one (the left-hand side), as shown in Figure 12.06:
Caption: Venn diagram for right join
We just need to specify the parameters:
how= 'right' to the.merge()method to perform this type of join.- We will use the exact same columns used for merging as the previous
example, which is
InvoiceDayfor the Online Retail DataFrame andDatefor the UK Public Holidays one.
These parameters are clubbed together as shown in the following code snippet:
df_right = df.merge(uk_holidays, left_on='InvoiceDay', \
right_on='Date', how='right')
df_right.shape
You should get the following output:
(9602, 17)
We can see there are fewer rows as a result of the right join, but it doesn't get the same number as for the Public Holidays DataFrame. This is because there are multiple rows from the Online Retail DataFrame that match one single date in the public holidays one.
For instance, looking at the first rows of the merged DataFrame, we can see there were multiple purchases on January 4, 2011, so all of them have been matched with the corresponding public holiday. Have a look at the following code snippet:
df_right.head()
You should get the following output:
Caption: First five rows of the right-merged DataFrame
There are two other types of merging: inner and outer.
An inner join will only keep the rows that match between the two tables:
Caption: Venn diagram for inner join
You just need to specify the how = 'inner' parameter in the
.merge() method.
These parameters are clubbed together as shown in the following code snippet:
df_inner = df.merge(uk_holidays, left_on='InvoiceDay', \
right_on='Date', how='inner')
df_inner.shape
You should get the following output:
(9579, 17)
We can see there are only 9,579 observations that happened during a public holiday in the UK.
The outer join will keep all rows from both tables (matched and unmatched), as shown in Figure 12.09:
Caption: Venn diagram for outer join
As you may have guessed, you just need to specify the
how == 'outer' parameter in the .merge() method:
df_outer = df.merge(uk_holidays, left_on='InvoiceDay', \
right_on='Date', how='outer')
df_outer.shape
You should get the following output:
(541932, 17)
Before merging two tables, it is extremely important for you to know what your focus is. If your objective is to expand the number of features from an original dataset by adding the columns from another one, then you will probably use a left or right join. But be aware you may end up with more observations due to potentially multiple matches between the two tables. On the other hand, if you are interested in knowing which observations matched or didn't match between the two tables, you will either use an inner or outer join.
Exercise 12.01: Merging the ATO Dataset with the Postcode Data
In this exercise, we will merge the ATO dataset (28 columns) with the Postcode dataset (150 columns) to get a richer dataset with an increased number of columns.
The following steps will help you complete the exercise:
-
Open up a new Jupyter notebook.
-
Now, begin with the
importof thepandaspackage:import pandas as pd -
Assign the link to the ATO dataset to a variable called
file_url:file_url = 'https://raw.githubusercontent.com/'\ 'fenago/data-science/'\ 'master/Lab12/Dataset/taxstats2015.csv' -
Using the
.read_csv()method from thepandaspackage, load the dataset into a new DataFrame calleddf:df = pd.read_csv(file_url) -
Display the dimensions of this DataFrame using the
.shapeattribute:df.shapeYou should get the following output:
(2473, 28)The ATO dataset contains
2471rows and28columns. -
Display the first five rows of the ATO DataFrame using the
.head()method:df.head()You should get the following output:
Caption: First five rows of the ATO dataset
Both DataFrames have a column called `Postcode` containing
postcodes, so we will use it to merge them together.
Note
Postcode is the name used in Australia for zip code. It is an
identifier for postal areas.
We are interested in learning more about each of these postcodes.
Let\'s make sure they are all unique in this dataset.
-
Display the number of unique values for the
Postcodevariable using the.nunique()method:df['Postcode'].nunique()You should get the following output:
2473There are
2473unique values in this column and the DataFrame has2473rows, so we are sure thePostcodevariable contains only unique values. -
Now, assign the link to the second Postcode dataset to a variable called
postcode_df:postcode_url = 'https://github.com/fenago/'\ 'data-science/blob/'\ 'master/Lab12/Dataset/'\ 'taxstats2016individual06taxablestatusstate'\ 'territorypostcodetaxableincome%20(2).xlsx?'\ 'raw=true' -
Load the second Postcode dataset into a new DataFrame called
postcode_dfusing the.read_excel()method.We will only load the Individuals Table 6B sheet as this is where the data is located so we need to provide this name to the
sheet_nameparameter. Also, the header row (containing the name of the variables) in this spreadsheet is located on the third row so we need to specify it to the header parameter.Note
Don't forget the
indexstarts with 0 in Python.Have a look at the following code snippet:
postcode_df = pd.read_excel(postcode_url, \ sheet_name='Individuals Table 6B', \ header=2) -
Print the dimensions of
postcode_dfusing the.shapeattribute:postcode_df.shapeYou should get the following output:
(2567, 150)This DataFrame contains
2567rows for150columns. By merging it with the ATO dataset, we will get additional information for each postcode. -
Print the first five rows of
postcode_dfusing the.head()method:postcode_df.head()You should get the following output:
Caption: First five rows of the Postcode dataset
We can see that the second column contains the postcode value, and
this is the one we will use to merge on with the ATO dataset. Let\'s
check if they are unique.
-
Print the number of unique values in this column using the
.nunique()method as shown in the following code snippet:postcode_df['Postcode'].nunique()You should get the following output:
2567There are
2567unique values, and this corresponds exactly to the number of rows of this DataFrame, so we're absolutely sure this column contains unique values. This also means that after merging the two tables, there will be only one-to-one matches. We won't have a case where we get multiple rows from one of the datasets matching with only one row of the other one. For instance, postcode2029from the ATO DataFrame will have exactly one match in the second Postcode DataFrame. -
Perform a left join on the two DataFrames using the
.merge()method and save the results into a new DataFrame calledmerged_df. Specify thehow='left'andon='Postcode'parameters:merged_df = pd.merge(df, postcode_df, \ how='left', on='Postcode') -
Print the dimensions of the new merged DataFrame using the
.shapeattribute:merged_df.shapeYou should get the following output:
(2473, 177)We got exactly
2473rows after merging, which is what we expect as we used a left join and there was a one-to-one match on thePostcodecolumn from both original DataFrames. Also, we now have177columns, which is the objective of this exercise. But before concluding it, we want to see whether there are any postcodes that didn't match between the two datasets. To do so, we will be looking at one column from the right-hand side DataFrame (the Postcode dataset) and see if there are any missing values. -
Print the total number of missing values from the
'State/Territory1'column by combining the.isna()and.sum()methods:merged_df['State/ Territory1'].isna().sum()You should get the following output:
4There are four postcodes from the ATO dataset that didn't match the Postcode code.
Let's see which ones they are.
-
Print the missing postcodes using the
.iloc()method, as shown in the following code snippet:merged_df.loc[merged_df['State/ Territory1'].isna(), \ 'Postcode']You should get the following output:
Caption: List of unmatched postcodes
The missing postcodes from the Postcode dataset are 3010,
4462, 6068, and 6758. In a real
project, you would have to get in touch with your stakeholders or the
data team to see if you are able to get this data.
We have successfully merged the two datasets of interest and have
expanded the number of features from 28 to 177.
We now have a much richer dataset and will be able to perform a more
detailed analysis of it.
In the next topic, you will be introduced to the binning variables.
Binning Variables
As mentioned earlier, feature engineering is not only about getting information not present in a dataset. Quite often, you will have to create new features from existing ones. One example of this is consolidating values from an existing column to a new list of values.
For instance, you may have a very high number of unique values for some of the categorical columns in your dataset, let's say over 1,000 values for each variable. This is actually quite a lot of information that will require extra computation power for an algorithm to process and learn the patterns from. This can have a significant impact on the project cost if you are using cloud computing services or on the delivery time of the project.
One possible solution is to not use these columns and drop them, but in that case, you may lose some very important and critical information for the business. Another solution is to create a more consolidated version of these columns by reducing the number of unique values to a smaller number, let's say 100. This would drastically speed up the training process for the algorithm without losing too much information. This kind of transformation is called binning and, traditionally, it refers to numerical variables, but the same logic can be applied to categorical variables as well.
Let's see how we can achieve this on the Online Retail dataset. First, we need to load the data:
import pandas as pd
file_url = 'https://github.com/fenago/'\
'data-science/blob/'\
'master/Lab12/Dataset/'\
'Online%20Retail.xlsx?raw=true'
df = pd.read_excel(file_url)
In Lab 10, Analyzing a Dataset we learned that the
Country column contains 38 different unique
values:
df['Country'].unique()
You should get the following output:
Caption: List of unique values for the Country column
We are going to group some of the countries together into regions such as Asia, the Middle East, and America. We will leave the European countries as is.
First, let's create a new column called Country_bin by
copying the Country column:
df['Country_bin'] = df['Country']
Then, we are going to create a list called asian_countries
containing the name of Asian countries from the list of unique values
for the Country column:
asian_countries = ['Japan', 'Hong Kong', 'Singapore']
And finally, using the .loc() and .isin()
methods from pandas, we are going to change the value of
Country_bin to Asia for all of the countries
that are present in the asian_countries list:
df.loc[df['Country'].isin(asian_countries), \
'Country_bin'] = 'Asia'
Now, if we print the list of unique values for this new column, we will
see the three Asian countries (Japan, Hong Kong,
and Singapore) have been replaced by the value
Asia:
df['Country_bin'].unique()
You should get the following output:
Caption: List of unique values for the Country_bin column after binning Asian countries
Let's perform the same process for Middle Eastern countries:
m_east_countries = ['Israel', 'Bahrain', 'Lebanon', \
'United Arab Emirates', 'Saudi Arabia']
df.loc[df['Country'].isin(m_east_countries), \
'Country_bin'] = 'Middle East'
df['Country_bin'].unique()
You should get the following output:
Finally, let's group all countries from North and South America together:
american_countries = ['Canada', 'Brazil', 'USA']
df.loc[df['Country'].isin(american_countries), \
'Country_bin'] = 'America'
df['Country_bin'].unique()
You should get the following output:
Caption: List of unique values for the Country_bin column after binning countries from North and South America
df['Country_bin'].nunique()
You should get the following output:
30
30 is the number of unique values for the
Country_bin column. So we reduced the number of unique
values in this column from 38 to 30:
We just saw how to group categorical values together, but the same process can be applied to numerical values as well. For instance, it is quite common to group people's ages into bins such as 20s (20 to 29 years old), 30s (30 to 39), and so on.
Have a look at Exercise 12.02, Binning the YearBuilt variable from the AMES Housing dataset.
Exercise 12.02: Binning the YearBuilt Variable from the AMES Housing Dataset
In this exercise, we will create a new feature by binning an existing
numerical column in order to reduce the number of unique values from
112 to 15.
Note
The dataset we will be using in this exercise is the Ames Housing dataset. This dataset contains the list of residential home sales in the city of Ames, Iowa between 2010 and 2016.
-
Open up a new Jupyter notebook.
-
Import the
pandasandaltairpackages:import pandas as pd import altair as alt -
Assign the link to the dataset to a variable called
file_url:file_url = 'https://raw.githubusercontent.com/'\ 'fenago/data-science/'\ 'master/Lab12/Dataset/ames_iowa_housing.csv' -
Using the
.read_csv()method from thepandaspackage, load the dataset into a new DataFrame calleddf:df = pd.read_csv(file_url) -
Display the first five rows using the
.head()method:df.head()You should get the following output:
Caption: First five rows of the AMES housing DataFrame
-
Display the number of unique values on the column using
.nunique():df['YearBuilt'].nunique()You should get the following output:
112There are
112different or unique values in theYearBuiltcolumn: -
Print a scatter plot using
altairto visualize the number of records built per year. SpecifyYearBuilt:Oas the x-axis andcount()as the y-axis in the.encode()method:alt.Chart(df).mark_circle().encode(alt.X('YearBuilt:O'),\ y='count()')You should get the following output:
There weren\'t many properties sold in some of the years. So, you
can group them by decades (groups of 10 years).
-
Create a list called
year_builtcontaining all the unique values in theYearBuiltcolumn:year_built = df['YearBuilt'].unique() -
Create another list that will compute the decade for each year in
year_built. Use list comprehension to loop through each year and apply the following formula:year - (year % 10).For example, this formula applied to the year 2015 will give 2015 - (2015 % 10), which is 2015 -- 5 equals 2010.
Note
% corresponds to the modulo operator and will return the last digit of each year.
Have a look at the following code snippet:
decade_list = [year - (year % 10) for year in year_built] -
Create a sorted list of unique values from
decade_listand save the result into a new variable calleddecade_built. To do so, transformdecade_listinto a set (this will exclude all duplicates) and then use thesorted()function as shown in the following code snippet:decade_built = sorted(set(decade_list)) -
Print the values of
decade_built:decade_builtYou should get the following output:
Caption: List of decades
Now we have the list of decades we are going to bin the
`YearBuilt` column with.
-
Create a new column on the
dfDataFrame calledDecadeBuiltthat will bin each value fromYearBuiltinto a decade. You will use the.cut()method frompandasand specify thebins=decade_builtparameter:df['DecadeBuilt'] = pd.cut(df['YearBuilt'], \ bins=decade_built) -
Print the first five rows of the DataFrame but only for the
'YearBuilt'and'DecadeBuilt'columns:df[['YearBuilt', 'DecadeBuilt']].head()You should get the following output:
Manipulating Dates
In Lab 10, Analyzing a Dataset you were introduced to the
concept of data types in pandas. At that time, we mainly
focused on numerical variables and categorical ones but there is another
important one: datetime. Let's have a look again at the
type of each column from the Online Retail dataset:
import pandas as pd
file_url = 'https://github.com/fenago/'\
'data-science/blob/'\
'master/Lab12/Dataset/'\
'Online%20Retail.xlsx?raw=true'
df = pd.read_excel(file_url)
df.dtypes
You should get the following output:
Caption: Data types for the variables in the Online Retail dataset
We can see that pandas automatically detected that
InvoiceDate is of type datetime. But for some
other datasets, it may not recognize dates properly. In this case, you
will have to manually convert them using the .to_datetime()
method:
df['InvoiceDate'] = pd.to_datetime(df['InvoiceDate'])
Once the column is converted to datetime, pandas provides a
lot of attributes and methods for extracting time-related information.
For instance, if you want to get the year of a date, you use the
.dt.year attribute:
df['InvoiceDate'].dt.year
You should get the following output:
Caption: Extracted year for each row for the InvoiceDate column
As you may have guessed, there are attributes for extracting the month
and day of a date: .dt.month and .dt.day
respectively. You can get the day of the week from a date using the
.dt.dayofweek attribute:
df['InvoiceDate'].dt.dayofweek
You should get the following output.
Caption: Extracted day of the week for each row for the InvoiceDate column
With datetime columns, you can also perform some mathematical
operations. We can, for instance, add 3 days to each date by
using pandas time-series offset object,
pd.tseries.offsets.Day(3):
df['InvoiceDate'] + pd.tseries.offsets.Day(3)
You should get the following output:
Caption: InvoiceDate column offset by three days
You can also offset days by business days using
pd.tseries.offsets.BusinessDay(). For instance, if we want
to get the previous business days, we do:
df['InvoiceDate'] + pd.tseries.offsets.BusinessDay(-1)
You should get the following output:
Caption: InvoiceDate column offset by -1 business day
Another interesting date manipulation operation is to apply a specific
time-frequency using pd.Timedelta(). For instance, if you
want to get the first day of the month from a date, you do:
df['InvoiceDate'] + pd.Timedelta(1, unit='MS')
You should get the following output:
Caption: InvoiceDate column transformed to the start of the month
As you have seen in this section, the pandas package
provides a lot of different APIs for manipulating dates. You have
learned how to use a few of the most popular ones. You can now explore
the other ones on your own.
Exercise 12.03: Date Manipulation on Financial Services Consumer Complaints
In this exercise, we will learn how to extract time-related information
from two existing date columns using pandas in order to
create six new columns:
Note
The dataset we will be using in this exercise is the Financial Services Customer Complaints dataset
-
Open up a new Jupyter notebook.
-
Import the
pandaspackage:import pandas as pd -
Assign the link to the dataset to a variable called
file_url:file_url = 'https://raw.githubusercontent.com/'\ 'fenago/data-science/'\ 'master/Lab12/Dataset/Consumer_Complaints.csv' -
Use the
.read_csv()method from thepandaspackage and load the dataset into a new DataFrame calleddf:df = pd.read_csv(file_url) -
Display the first five rows using the
.head()method:df.head()You should get the following output:
Caption: First five rows of the Customer Complaint DataFrame
-
Print out the data types for each column using the
.dtypesattribute:df.dtypesYou should get the following output:
Caption: Data types for the Customer Complaint DataFrame
The `Date received` and `Date sent to company`
columns haven\'t been recognized as datetime, so we need to manually
convert them.
-
Convert the
Date receivedandDate sent to companycolumns to datetime using thepd.to_datetime()method:df['Date received'] = pd.to_datetime(df['Date received']) df['Date sent to company'] = pd.to_datetime\ (df['Date sent to company']) -
Print out the data types for each column using the
.dtypesattribute:df.dtypesYou should get the following output:
Caption: Data types for the Customer Complaint DataFrame after
conversion
Now these two columns have the right data types. Now let\'s create
some new features from these two dates.
-
Create a new column called
YearReceived, which will contain the year of each date from theDate Receivedcolumn using the.dt.yearattribute:df['YearReceived'] = df['Date received'].dt.year -
Create a new column called
MonthReceived, which will contain the month of each date using the.dt.monthattribute:df['MonthReceived'] = df['Date received'].dt.month -
Create a new column called
DayReceived, which will contain the day of the month for each date using the.dt.dayattribute:df['DomReceived'] = df['Date received'].dt.day -
Create a new column called
DowReceived, which will contain the day of the week for each date using the.dt.dayofweekattribute:df['DowReceived'] = df['Date received'].dt.dayofweek -
Display the first five rows using the
.head()method:df.head()You should get the following output:
Caption: First five rows of the Customer Complaint DataFrame
after creating four new features
We can see we have successfully created four new features:
`YearReceived`, `MonthReceived`,
`DayReceived`, and `DowReceived`. Now let\'s
create another that will indicate whether the date was during a
weekend or not.
-
Create a new column called
IsWeekendReceived, which will contain binary values indicating whether theDowReceivedcolumn is over or equal to5(0corresponds to Monday,5and6correspond to Saturday and Sunday respectively):df['IsWeekendReceived'] = df['DowReceived'] >= 5 -
Display the first
5rows using the.head()method:df.head()You should get the following output:
Caption: First five rows of the Customer Complaint DataFrame
after creating the weekend feature
We have created a new feature stating whether each complaint was
received during a weekend or not. Now we will feature engineer a new
column with the numbers of days between
`Date sent to company` and `Date received`.
-
Create a new column called
RoutingDays, which will contain the difference betweenDate sent to companyandDate received:df['RoutingDays'] = df['Date sent to company'] \ - df['Date received'] -
Print out the data type of the new
'RoutingDays'column using the.dtypeattribute:df['RoutingDays'].dtypeYou should get the following output:
Caption: Data type of the RoutingDays column
The result of subtracting two datetime columns is a new datetime
column (`dtype('<M8[ns]'`), which is a specific datetime
type for the `numpy` package). We need to convert this
data type into an `int` to get the number of days between
these two days.
-
Transform the
RoutingDayscolumn using the.dt.daysattribute:df['RoutingDays'] = df['RoutingDays'].dt.days -
Display the first five rows using the
.head()method:df.head()You should get the following output:
Caption: First five rows of the Customer Complaint DataFrame after creating RoutingDays
Performing Data Aggregation
In pandas, it is quite easy to perform data aggregation. We
just need to combine the following methods successively:
.groupby() and .agg().
We will need to specify the list of columns that will be grouped
together to the .groupby() method. If you are familiar with
pivot tables in Excel, this corresponds to the Rows field.
The .agg() method expects a dictionary with the name of a
column as a key and the aggregation function as a value such as
{'column_name': 'aggregation_function'}. In an Excel pivot
table, the aggregated column is referred to as values.
Let's see how to do it on the Online Retail dataset. First, we need to import the data:
import pandas as pd
file_url = 'https://github.com/fenago/'\
'data-science/blob/'\
'master/Lab12/Dataset/'\
'Online%20Retail.xlsx?raw=true'
df = pd.read_excel(file_url)
Let's calculate the total quantity of items sold for each country. We
will specify the Country column as the grouping column:
df.groupby('Country').agg({'Quantity': 'sum'})
You should get the following output:
Caption: Sum of Quantity per Country (truncated)
This result gives the total volume of items sold for each country. We
can see that Australia has almost sold four times more items than
Belgium. This level of information may be too high-level and we may want
a bit more granular detail. Let's perform the same aggregation but this
time we will group on two columns: Country and
StockCode. We just need to provide the names of these
columns as a list to the .groupby() method:
df.groupby(['Country', 'StockCode']).agg({'Quantity': 'sum'})
You should get the following output:
Caption: Sum of Quantity per Country and StockCode
We can see how many items have been sold for each country. We can note
that Australia has sold the same quantity of products 20675,
20676, and 20677 (216 each). This
may indicate that these products are always sold together.
We can add one more layer of information and get the number of items
sold for each country, the product, and the date. To do so, we first
need to create a new feature that will extract the date component of
InvoiceDate (we just learned how to do this in the previous
section):
df['Invoice_Date'] = df['InvoiceDate'].dt.date
Then, we can add this new column in the .groupby() method:
df.groupby(['Country', 'StockCode', \
'Invoice_Date']).agg({'Quantity': 'sum'})
You should get the following output:
Caption: Sum of Quantity per Country, StockCode, and Invoice_Date
We have generated a new DataFrame with the total quantity of items sold
per country, item ID, and date. We can see the item with
StockCode 15036 was quite popular on 2011-05-17
in Australia -- there were 600 sold items. On
the other hand, only 6 items of Stockcode
20665 were sold on 2011-03-24
in Australia.
We can now merge this additional information back into the original
DataFrame. But before that, there is an additional data transformation
step required: reset the column index. The pandas package
creates a multi-level index after data aggregation by default. You can
think of it as though the column names were stored in multiple rows
instead of one only. To change it back to a single level, you need to
call the .reset_index() method:
df_agg = df.groupby(['Country', 'StockCode', 'Invoice_Date'])\
.agg({'Quantity': 'sum'}).reset_index()
df_agg.head()
You should get the following output:
Caption: DataFrame containing data aggregation information
Now we can merge this new DataFrame into the original one using the
.merge() method we saw earlier in this lab:
df_merged = pd.merge(df, df_agg, how='left', \
on = ['Country', 'StockCode', \
'Invoice_Date'])
df_merged
You should get the following output:
Caption: Merged DataFrame (truncated)
We can see there are two columns called Quantity_x and
Quantity_y instead of Quantity.
The reason is that, after merging, there were two different columns with
the exact same name (Quantity), so by default, pandas added
a suffix to differentiate them.
We can fix this situation either by replacing the name of one of those
two columns before merging or we can replace both of them after merging.
To replace column names, we can use the .rename() method
from pandas by providing a dictionary with the old name as
the key and the new name as the value, such as
{'old_name': 'new_name'}.
Let's replace the column names after merging with Quantity
and DailyQuantity:
df_merged.rename(columns={"Quantity_x": "Quantity", \
"Quantity_y": "DailyQuantity"}, \
inplace=True)
df_merged
You should get the following output:
Caption: Renamed DataFrame (truncated)
Now we can create a new feature that will calculate the ratio between the items sold with the daily total quantity of sold items in the corresponding country:
df_merged['QuantityRatio'] = df_merged['Quantity'] \
/ df_merged['DailyQuantity']
df_merged
You should get the following output:
Caption: Final DataFrame with new QuantityRatio feature
Exercise 12.04: Feature Engineering Using Data Aggregation on the AMES Housing Dataset
In this exercise, we will create new features using data aggregation.
First, we'll calculate the maximum SalePrice and
LotArea for each neighborhood and by YrSold.
Then, we will add this information back to the dataset, and finally, we
will calculate the ratio of each property sold with these two maximum
values:
Note
The dataset we will be using in this exercise is the Ames Housing dataset
-
Open up a new Jupyter notebook.
-
Import the
pandasandaltairpackages:import pandas as pd -
Assign the link to the dataset to a variable called
file_url:file_url = 'https://raw.githubusercontent.com/'\ 'fenago/data-science/'\ 'master/Lab12/Dataset/ames_iowa_housing.csv' -
Using the
.read_csv()method from thepandaspackage, load the dataset into a new DataFrame calleddf:df = pd.read_csv(file_url) -
Perform data aggregation to find the maximum
SalePricefor eachNeighborhoodand theYrSoldusing the.groupby.agg()method and save the results in a new DataFrame calleddf_agg:df_agg = df.groupby(['Neighborhood', 'YrSold'])\ .agg({'SalePrice': 'max'}).reset_index() -
Rename the
df_aggcolumns toNeighborhood,YrSold, andSalePriceMax:df_agg.columns = ['Neighborhood', 'YrSold', 'SalePriceMax'] -
Print out the first five rows of
df_agg:df_agg.head()You should get the following output:
Caption: First five rows of the aggregated DataFrame
-
Merge the original DataFrame,
df, todf_aggusing a left join (how='left') on theNeighborhoodandYrSoldcolumns using themerge()method and save the results into a new DataFrame calleddf_new:df_new = pd.merge(df, df_agg, how='left', \ on=['Neighborhood', 'YrSold']) -
Print out the first five rows of
df_new:df_new.head()You should get the following output:
Caption: First five rows of df\_new
Note that we are displaying the last eight columns of the output.
-
Create a new column called
SalePriceRatioby dividingSalePricebySalePriceMax:df_new['SalePriceRatio'] = df_new['SalePrice'] \ / df_new['SalePriceMax'] -
Print out the first five rows of
df_new:df_new.head()You should get the following output:
Caption: First five rows of df\_new after feature engineering
Note that we are displaying the last eight columns of the output.
-
Perform data aggregation to find the maximum
LotAreafor eachNeighborhoodandYrSoldusing the.groupby.agg()method and save the results in a new DataFrame calleddf_agg2:df_agg2 = df.groupby(['Neighborhood', 'YrSold'])\ .agg({'LotArea': 'max'}).reset_index() -
Rename the column of
df_agg2toNeighborhood,YrSold, andLotAreaMaxand print out the first five columns:df_agg2.columns = ['Neighborhood', 'YrSold', 'LotAreaMax'] df_agg2.head()You should get the following output:
Caption: First five rows of the aggregated DataFrame
-
Merge the original DataFrame,
df, todf_agg2using a left join (how='left') on theNeighborhoodandYrSoldcolumns using themerge()method and save the results into a new DataFrame calleddf_final:df_final = pd.merge(df_new, df_agg2, how='left', \ on=['Neighborhood', 'YrSold']) -
Create a new column called
LotAreaRatioby dividingLotAreabyLotAreaMax:df_final['LotAreaRatio'] = df_final['LotArea'] \ / df_final['LotAreaMax'] -
Print out the first five rows of
df_finalfor the following columns:Id,Neighborhood,YrSold,SalePrice,SalePriceMax,SalePriceRatio,LotArea,LotAreaMax,LotAreaRatio:df_final[['Id', 'Neighborhood', 'YrSold', 'SalePrice', \ 'SalePriceMax', 'SalePriceRatio', 'LotArea', \ 'LotAreaMax', 'LotAreaRatio']].head()You should get the following output:
Activity 12.01: Feature Engineering on a Financial Dataset
You are working for a major bank in the Czech Republic and you have been tasked to analyze the transactions of existing customers. The data team has extracted all the tables from their database they think will be useful for you to analyze the dataset. You will need to consolidate the data from those tables into a single DataFrame and create new features in order to get an enriched dataset from which you will be able to perform an in-depth analysis of customers' banking transactions.
You will be using only the following four tables:
-
account: The characteristics of a customer's bank account for a given branch -
client: Personal information related to the bank's customers -
disp: A table that links an account to a customer -
trans: A list of all historical transactions by account
The following steps will help you complete this activity:
-
Download and load the different tables from this dataset into Python.
-
Analyze each table with the
.shapeand.head()methods. -
Find the common/similar column(s) between tables that will be used for merging based on the analysis from Step 2.
-
There should be four common tables. Merge the four tables together using
pd.merge(). -
Rename the column names after merging with
.rename(). -
Check there is no duplication after merging with
.duplicated()and.sum(). -
Transform the data type for date columns using
.to_datetime(). -
Create two separate features from
birth_numberto get the date of birth and sex for each customer.Note
This is the rule used for coding the data related to birthday and sex in this column: the number is in the YYMMDD format for men, the number is in the YYMM+50DD format for women, where YYMMDD is the date of birth.
-
Fix data quality issues with
.isna(). -
Create a new feature that will calculate customers' ages when they opened an account using date operations
Expected output:
Caption: Expected output with the merged rows
Summary
We first learned how to analyze a dataset and get a very good
understanding of its data using data summarization and data
visualization. This is very useful for finding out what the limitations
of a dataset are and identifying data quality issues. We saw how to
handle and fix some of the most frequent issues (duplicate rows, type
conversion, value replacement, and missing values) using
pandas' APIs. Finally, we went through several feature engineering techniques.
The next lab opens a new part of this course that presents data science use cases end to end. Lab 13, Imbalanced Datasets, will walk you through an example of an imbalanced dataset and how to deal with such a situation.













































