This commit is contained in:
fenago
2021-02-07 15:37:17 +05:00
parent 08de0ce85d
commit 6d1e72567a
9 changed files with 29 additions and 384 deletions
+6 -217
View File
@@ -22,208 +22,6 @@ formats using `pandas`. You will also have had your first
taste of training a model using scikit-learn.
Introduction
============
Welcome to the fascinating world of data science! We are sure you must
be pretty excited to start your journey and learn interesting and
exciting techniques and algorithms. This is exactly what this book is
intended for.
But before diving into it, let\'s define what data science is: it is a
combination of multiple disciplines, including business, statistics, and
programming, that intends to extract meaningful insights from data by
running controlled experiments similar to scientific research.
The objective of any data science project is to derive valuable
knowledge for the business from data in order to make better decisions.
It is the responsibility of data scientists to define the goals to be
achieved for a project. This requires business knowledge and expertise.
In this book, you will be exposed to some examples of data science tasks
from real-world datasets.
Statistics is a mathematical field used for analyzing and finding
patterns from data. A lot of the newest and most advanced techniques
still rely on core statistical approaches. This book will present to you
the basic techniques required to understand the concepts we will be
covering.
With an exponential increase in data generation, more computational
power is required for processing it efficiently. This is the reason why
programming is a required skill for data scientists. You may wonder why
we chose Python for this Workshop. That\'s because Python is one of the
most popular programming languages for data science. It is extremely
easy to learn how to code in Python thanks to its simple and easily
readable syntax. It also has an incredible number of packages available
to anyone for free, such as pandas, scikit-learn, TensorFlow, and
PyTorch. Its community is expanding at an incredible rate, adding more
and more new functionalities and improving its performance and
reliability. It\'s no wonder companies such as Facebook, Airbnb, and
Google are using it as one of their main stacks. No prior knowledge of
Python is required for this book. If you do have some experience with
Python or other programming languages, then this will be an advantage,
but all concepts will be fully explained, so don\'t worry if you are new
to programming.
Application of Data Science
===========================
As mentioned in the introduction, data science is a multidisciplinary
approach to analyzing and identifying complex patterns and extracting
valuable insights from data. Running a data science project usually
involves multiple steps, including the following:
1. Defining the business problem to be solved
2. Collecting or extracting existing data
3. Analyzing, visualizing, and preparing data
4. Training a model to spot patterns in data and make predictions
5. Assessing a model\'s performance and making improvements
6. Communicating and presenting findings and gained insights
7. Deploying and maintaining a model
As its name implies, data science projects require data, but it is
actually more important to have defined a clear business problem to
solve first. If it\'s not framed correctly, a project may lead to
incorrect results as you may have used the wrong information, not
prepared the data properly, or led a model to learn the wrong patterns.
So, it is absolutely critical to properly define the scope and objective
of a data science project with your stakeholders.
There are a lot of data science applications in real-world situations or
in business environments. For example, healthcare providers may train a
model for predicting a medical outcome or its severity based on medical
measurements, or a high school may want to predict which students are at
risk of dropping out within a year\'s time based on their historical
grades and past behaviors. Corporations may be interested to know the
likelihood of a customer buying a certain product based on his or her
past purchases. They may also need to better understand which customers
are more likely to stop using existing services and churn. These are
examples where data science can be used to achieve a clearly defined
goal, such as increasing the number of patients detected with a heart
condition at an early stage or reducing the number of customers
canceling their subscriptions after six months. That sounds exciting,
right? Soon enough, you will be working on such interesting projects.
What Is Machine Learning?
-------------------------
When we mention data science, we usually think about machine learning,
and some people may not understand the difference between them. Machine
learning is the field of building algorithms that can learn patterns by
themselves without being programmed explicitly. So machine learning is a
family of techniques that can be used at the modeling stage of a data
science project.
Machine learning is composed of three different types of learning:
- Supervised learning
- Unsupervised learning
- Reinforcement learning
### Supervised Learning
Supervised learning refers to a type of task where an algorithm is
trained to learn patterns based on prior knowledge. That means this kind
of learning requires the labeling of the outcome (also called the
response variable, dependent variable, or target variable) to be
predicted beforehand. For instance, if you want to train a model that
will predict whether a customer will cancel their subscription, you will
need a dataset with a column (or variable) that already contains the
churn outcome (cancel or not cancel) for past or existing customers.
This outcome has to be labeled by someone prior to the training of a
model. If this dataset contains 5,000 observations, then all of them
need to have the outcome being populated. The objective of the model is
to learn the relationship between this outcome column and the other
features (also called independent variables or predictor variables).
Following is an example of such a dataset:
![](./images/B15019_01_01.jpg)
Caption: Example of customer churn dataset
The `Cancel` column is the response variable. This is the
column you are interested in, and you want the model to predict
accurately the outcome for new input data (in this case, new customers).
All the other columns are the predictor variables.
The model, after being trained, may find the following pattern: a
customer is more likely to cancel their subscription after 12 months and
if their average monthly spent is over `$50`. So, if a new
customer has gone through 15 months of subscription and is spending \$85
per month, the model will predict this customer will cancel their
contract in the future.
When the response variable contains a limited number of possible values
(or classes), it is a classification problem (you will learn more about
this in *Lab 3, Binary Classification*, and *Lab 4, Multiclass
Classification with RandomForest*). The model will learn how to predict
the right class given the values of the independent variables. The churn
example we just mentioned is a classification problem as the response
variable can only take two different values: `yes` or
`no`.
On the other hand, if the response variable can have a value from an
infinite number of possibilities, it is called a regression problem.
An example of a regression problem is where you are trying to predict
the exact number of mobile phones produced every day for some
manufacturing plants. This value can potentially range from 0 to an
infinite number (or a number big enough to have a large range of
potential values), as shown in *Figure 1.2*.
![](./images/B15019_01_02.jpg)
Caption: Example of a mobile phone production dataset
In the preceding figure, you can see that the values for
`Daily output` can take any value from `15000` to
more than `50000`. This is a regression problem, which we will
look at in *Lab 2, Regression*.
### Unsupervised Learning
Unsupervised learning is a type of algorithm that doesn\'t require any
response variables at all. In this case, the model will learn patterns
from the data by itself. You may ask what kind of pattern it can find if
there is no target specified beforehand.
This type of algorithm usually can detect similarities between variables
or records, so it will try to group those that are very close to each
other. This kind of algorithm can be used for clustering (grouping
records) or dimensionality reduction (reducing the number of variables).
Clustering is very popular for performing customer segmentation, where
the algorithm will look to group customers with similar behaviors
together from the data. *Lab 5*, *Performing Your First Cluster
Analysis*, will walk you through an example of clustering analysis.
### Reinforcement Learning
Reinforcement learning is another type of algorithm that learns how to
act in a specific environment based on the feedback it receives. You may
have seen some videos where algorithms are trained to play Atari games
by themselves. Reinforcement learning techniques are being used to teach
the agent how to act in the game based on the rewards or penalties it
receives from the game.
For instance, in the game Pong, the agent will learn to not let the ball
drop after multiple rounds of training in which it receives high
penalties every time the ball drops.
Note
Reinforcement learning algorithms are out of scope and will not be
covered in this book.
Overview of Python
@@ -243,7 +41,7 @@ Types of Variable
In Python, you can handle and manipulate different types of variables.
Each has its own specificities and benefits. We will not go through
every single one of them but rather focus on the main ones that you will
have to use in this book. For each of the following code examples, you
have to use in this course. For each of the following code examples, you
can run the code in Google Colab to view the given output.
@@ -301,15 +99,6 @@ You should get the following output:
Caption: Printing the two text variables
Python also provides an interface called f-strings for printing text
with the value of defined variables. It is very handy when you want to
print results with additional text to make it more readable and
interpret results. It is also quite common to use f-strings to print
logs. You will need to add `f` before the quotes (or double
quotes) to specify that the text will be an f-string. Then you can add
an existing variable inside the quotes and display the text with the
value of this variable. You need to wrap the variable with curly
brackets, `{}`.
For instance, if we want to print `Text:` before the values of
`var3` and `var4`, we will write the following code:
@@ -528,13 +317,13 @@ Exercise 1.01: Creating a Dictionary That Will Contain Machine Learning Algorith
In this exercise, we will create a dictionary using Python that will
contain a collection of different machine learning algorithms that will
be covered in this book.
be covered in this course.
The following steps will help you complete the exercise:
Note
Every exercise and activity in this book is to be executed on Google
Every exercise and activity in this course is to be executed on Google
Colab.
1. Open on a new Colab notebook.
@@ -1131,7 +920,7 @@ Caption: Predictions of the trained Random Forest model
Finally, we want to assess the model\'s performance by comparing its
predictions to the actual values of the target variable. There are a lot
of different metrics that can be used for assessing model performance,
and you will learn more about them later in this book. For now, though,
and you will learn more about them later in this course. For now, though,
we will just use a metric called **accuracy**. This metric calculates
the ratio of correct predictions to the total number of observations:
@@ -1350,7 +1139,7 @@ general. We also learned the different types of machine learning
algorithms, including supervised and unsupervised, as well as regression
and classification. We had a quick introduction to Python and how to
manipulate the main data structures (lists and dictionaries) that will
be used in this book.
be used in this course.
Then we walked you through what a DataFrame is and how to create one by
loading data from different file formats using the famous pandas
@@ -1358,7 +1147,7 @@ package. Finally, we learned how to use the sklearn package to train a
machine learning model and make predictions with it.
This was just a quick glimpse into the fascinating world of data
science. In this book, you will learn much more and discover new
science. In this course, you will learn much more and discover new
techniques for handling data science projects from end to end.
The next lab will show you how to perform a regression task on a
+1 -1
View File
@@ -1743,7 +1743,7 @@ handle and fix some of the most frequent issues (duplicate rows, type
conversion, value replacement, and missing values) using
`pandas`\' APIs. Finally, we went through several feature engineering techniques.
The next lab opens a new part of this book that presents data
The next lab opens a new part of this course that presents data
science use cases end to end. *Lab 13*, *Imbalanced Datasets*, will
walk you through an example of an imbalanced dataset and how to deal
with such a situation.
+3 -3
View File
@@ -55,7 +55,7 @@ objectives of data science, a flexible programming language that
effectively combines interactivity with computing power and speed is
necessary. This is where the Python programming language meets the needs
of data science and, as mentioned in *Lab 1*, *Introduction to Data
Science in Python*, we will be using Python in this book.
Science in Python*, we will be using Python in this course.
The need to develop models to make predictions and to gain insights for
decisionmaking cuts across many industries. Data science is, therefore,
@@ -275,7 +275,7 @@ variable of interest. What happens then is that the transformed variable
tends to have a linear relationship with the other untransformed
variables, enabling the use of linear regression to fit the data. This
will be illustrated in practice on the dataset being analyzed later in
the exercises of the book.
the exercises of the course.
@@ -559,7 +559,7 @@ The following steps will help you to complete this exercise:
model that learns the relationships between the variables), and the
rest of the data to test your model (that is, to see how well your
new model can make predictions when given new data). You will use
train-test splits throughout this book, and the concept will be
train-test splits throughout this course, and the concept will be
explained in more detail in *Lab 7, The Generalization Of
Machine Learning Models*.
-87
View File
@@ -19,54 +19,6 @@ regression function and analyze classification metrics and formulate
action plans for the improvement of the model.
Introduction
============
In previous labs, where an introduction to machine learning was
covered, you were introduced to two broad categories of machine
learning; supervised learning and unsupervised learning. Supervised
learning can be further divided into two types of problem cases,
regression and classification. In the last lab, we covered
regression problems. In this lab, we will peek into the world of
classification problems.
Take a look at the following *Figure 3.1*:
![](./images/B15019_03_01.jpg)
Caption: Overview of machine learning algorithms
Classification problems are the most prevalent use cases you will
encounter in the real world. Unlike regression problems, where a real
numbered value is predicted, classification problems deal with
associating an example to a category. Classification use cases will take
forms such as the following:
- Predicting whether a customer will buy the recommended product
- Identifying whether a credit transaction is fraudulent
- Determining whether a patient has a disease
- Analyzing images of animals and predicting whether the image is of a
dog, cat, or panda
- Analyzing text reviews and capturing the underlying emotion such as
happiness, anger, sorrow, or sarcasm
If you observe the preceding examples, there is a subtle difference
between the first three and the last two. The first three revolve around
binary decisions:
- Customers can either buy the product or not.
- Credit card transactions can be fraudulent or legitimate.
- Patients can be diagnosed as positive or negative for a disease.
Use cases that align with the preceding three genres where a binary
decision is made are called binary classification problems. Unlike the
first three, the last two associate an example with multiple classes or
categories. Such problems are called multiclass classification problems.
This lab will deal with binary classification problems. Multiclass
classification will be covered next in *Lab 4*, *Multiclass
Classification with RandomForest*.
Understanding the Business Context
==================================
@@ -91,45 +43,6 @@ buy term deposits.
Business Discovery
------------------
The first process when embarking on a data science problem like the
preceding is the business discovery process. This entails understanding
various drivers influencing the business problem. Getting to know the
business drivers is important as it will help in formulating hypotheses
about the business problem, which can be verified during the
**exploratory data analysis** (**EDA**). The verification of hypotheses
will help in formulating intuitions for feature engineering, which will
be critical for the veracity of the models that we build.
Let\'s understand this process in detail from the context of our use
case. The problem statement is to identify those customers who have a
propensity to buy term deposits. As you might be aware, term deposits
are bank instruments where your money will be locked for a certain
period, assuring higher interest rates than saving accounts or
interest-bearing checking accounts. From an investment propensity
perspective, term deposits are generally popular among risk-averse
customers. Equipped with the business context, let\'s look at some
questions on business factors influencing a propensity to buy term
deposits:
- Would age be a factor, with more propensity shown by the elderly?
- Is there any relationship between employment status and the
propensity to buy term deposits?
- Would the asset portfolio of a customer---that is, house, loan, or
higher bank balance---influence the propensity to buy?
- Will demographics such as marital status and education influence the
propensity to buy term deposits? If so, how are demographics
correlated to a propensity to buy?
Formulating questions on the business context is critical as this will
help in arriving at various trails that we can take when we do
exploratory analysis. We will deal with that in the next section. First,
let\'s explore the data related to the preceding business problem.
Exercise 3.01: Loading and Exploring the Data from the Dataset
--------------------------------------------------------------
-11
View File
@@ -1965,17 +1965,6 @@ You should get the following output:
Caption: Scatter plot of the standardized data
k-means results are very different from the standardized data. Now we
can see that there are two main clusters (blue and red) and their
boundaries are not straight vertical lines anymore but diagonal. So,
k-means is actually taking into consideration both axes now. The orange
cluster contains much fewer data points compared to previous iterations,
and it seems it is grouping all the extreme outliers with high values
together. If your project was about detecting anomalies, you would have
found a way here to easily separate outliers from \"normal\"
observations.
Exercise 5.06: Standardizing the Data from Our Dataset
------------------------------------------------------
+1 -1
View File
@@ -290,7 +290,7 @@ that give rise to different loss functions. Two of these are:
- Manhattan distance
- Euclidean distance
There are various loss functions for regression, but in this book, we
There are various loss functions for regression, but in this course, we
will be looking at two of the commonly used loss functions for
regression, which are:
+1 -1
View File
@@ -1125,7 +1125,7 @@ Distributions of continuous random variables are a bit more challenging
in that we cannot calculate an exact `P(X=x)` directly because
`X` lies on a continuum. We can, however, use integration to
approximate probabilities between a range of values, but this is beyond
the scope of this book. The relationship between `X` and
the scope of this course. The relationship between `X` and
probability is described using a probability density function,
`P(X)`. Perhaps the most well-known continuous distribution is
the normal distribution, which visually takes the form of a bell.
-46
View File
@@ -15,52 +15,6 @@ importance. You will use a partial dependence plot to analyze single
variables and make use of the lime package for local interpretation.
Introduction
============
In the previous lab, you saw how to find the optimal hyperparameters
of some of the most popular machine learning algorithms in order to get
better predictive performance (that is, more accurate predictions).
Machine learning algorithms are always referred to as black box where we
can only see the inputs and outputs and the implementation inside the
algorithm is quite opaque, so people don\'t know what is happening
inside.
With each day that passes, we can sense the elevated need for more
transparency in machine learning models. In the last few years, we have
seen some cases where algorithms have been accused of discriminating
against groups of people. For instance, a few years ago, a
not-for-profit news organization called ProPublica highlighted bias in
the COMPAS algorithm, built by the Northpointe company. The objective of
the algorithm is to assess the likelihood of re-offending for a
criminal. It was shown that the algorithm was predicting a higher level
of risk for specific groups of people based on their demographics rather
than other features. This example highlighted the importance of
interpreting the results of your model and its logic properly and
clearly.
Luckily, some machine learning algorithms provide methods to understand
the parameters they learned for a given task and dataset. There are also
some functions that are model-agnostic and can help us to better
understand the predictions made. So, there are different techniques that
are either model-specific or model-agnostic for interpreting a model.
These techniques can also differ in their scope. In the literature, we
either have a global or local interpretation. A global interpretation
means we are looking at the variables for all observations from a
dataset and we want to understand which features have the biggest
overall influence on the target variable. For instance, if you are
predicting customer churn for a telco company, you may find the most
important features for your model are customer usage and the average
monthly amount paid. Local interpretation, on the other hand, focuses
only on a single observation and analyzes the impact of the different
variables. We will look at a single specific case and see what led the
model to make its final prediction. For example, you will look at a
specific customer who is predicted to churn and will discover that they
usually buy the new iPhone model every year, in September.
In this lab, we will go through some techniques on how to interpret
your models or their results.
+17 -17
View File
@@ -26,35 +26,35 @@ Gain expert guidance on how to successfully develop machine learning models in P
Labs for this course are available at endpoints shared below. Update `<host-ip>` with the lab environment DNS.
1. ##### Introduction to Data Science in Python
* http://<host-ip>/lab/workspaces/lab1_
* http://<host-ip>/lab/workspaces/lab1_Introduction
2. ##### Regression
* http://<host-ip>/lab/workspaces/lab2_
* http://<host-ip>/lab/workspaces/lab2_Regression
3. ##### Binary Classification
* http://<host-ip>/lab/workspaces/lab3_
* http://<host-ip>/lab/workspaces/lab3_Classification
4. ##### Multiclass Classification with RandomForest
* http://<host-ip>/lab/workspaces/lab4_
* http://<host-ip>/lab/workspaces/lab4_RandomForest
5. ##### Performing Your First Cluster Analysis
* http://<host-ip>/lab/workspaces/lab5_
* http://<host-ip>/lab/workspaces/lab5_Analysis
6. ##### How to Assess Performance
* http://<host-ip>/lab/workspaces/lab6_
* http://<host-ip>/lab/workspaces/lab6_Performance
7. ##### The Generalization of Machine Learning Models
* http://<host-ip>/lab/workspaces/lab7_
* http://<host-ip>/lab/workspaces/lab7_Models
8. ##### Hyperparameter Tuning
* http://<host-ip>/lab/workspaces/lab8_
* http://<host-ip>/lab/workspaces/lab8_Hyperparameter
9. ##### Interpreting a Machine Learning Model
* http://<host-ip>/lab/workspaces/lab9_
* http://<host-ip>/lab/workspaces/lab9_ML
10. ##### Analyzing a Dataset
* http://<host-ip>/lab/workspaces/lab10_
* http://<host-ip>/lab/workspaces/lab10_Dataset
11. ##### Data Preparation
* http://<host-ip>/lab/workspaces/lab11_
* http://<host-ip>/lab/workspaces/lab11_Data
12. ##### Feature Engineering
* http://<host-ip>/lab/workspaces/lab12_
* http://<host-ip>/lab/workspaces/lab12_Feature
13. ##### Imbalanced Datasets
* http://<host-ip>/lab/workspaces/lab13_
* http://<host-ip>/lab/workspaces/lab13_Imbalanced
14. ##### Dimensionality Reduction
* http://<host-ip>/lab/workspaces/lab14_
* http://<host-ip>/lab/workspaces/lab14_Dimensionality
15. ##### Ensemble Learning
* http://<host-ip>/lab/workspaces/lab15_
* http://<host-ip>/lab/workspaces/lab15_Ensemble
### About
@@ -62,6 +62,6 @@ Where theres data, theres insight. With so much data being generated, ther
The course begins by introducing different types of projects and showing you how to incorporate machine learning algorithms in them. Youll learn to select a relevant metric and even assess the performance of your model. To tune the hyperparameters of an algorithm and improve its accuracy, youll get hands-on with approaches such as grid search and random search.
Next, youll learn dimensionality reduction techniques to easily handle many variables at once, before exploring how to use model ensembling techniques and create new features to enhance model performance. In a bid to help you automatically create new features that improve your model, the book demonstrates how to use the automated feature engineering tool. Youll also understand how to use the orchestration and scheduling workflow to deploy machine learning models in batch.
Next, youll learn dimensionality reduction techniques to easily handle many variables at once, before exploring how to use model ensembling techniques and create new features to enhance model performance. In a bid to help you automatically create new features that improve your model, the course demonstrates how to use the automated feature engineering tool. Youll also understand how to use the orchestration and scheduling workflow to deploy machine learning models in batch.
By the end of this book, youll have the skills to start working on data science projects confidently. By the end of this book, youll have the skills to start working on data science projects confidently.
By the end of this course, youll have the skills to start working on data science projects confidently. By the end of this course, youll have the skills to start working on data science projects confidently.