added

2026-05-29 12:50:51 +00:00 · 2021-02-07 15:37:17 +05:00
parent 08de0ce85d
commit 6d1e72567a
9 changed files with 29 additions and 384 deletions
@@ -22,208 +22,6 @@ formats using `pandas`. You will also have had your first
 taste of training a model using scikit-learn.


-Introduction
-============
-
-
-Welcome to the fascinating world of data science! We are sure you must
-be pretty excited to start your journey and learn interesting and
-exciting techniques and algorithms. This is exactly what this book is
-intended for.
-
-But before diving into it, let\'s define what data science is: it is a
-combination of multiple disciplines, including business, statistics, and
-programming, that intends to extract meaningful insights from data by
-running controlled experiments similar to scientific research.
-
-The objective of any data science project is to derive valuable
-knowledge for the business from data in order to make better decisions.
-It is the responsibility of data scientists to define the goals to be
-achieved for a project. This requires business knowledge and expertise.
-In this book, you will be exposed to some examples of data science tasks
-from real-world datasets.
-
-Statistics is a mathematical field used for analyzing and finding
-patterns from data. A lot of the newest and most advanced techniques
-still rely on core statistical approaches. This book will present to you
-the basic techniques required to understand the concepts we will be
-covering.
-
-With an exponential increase in data generation, more computational
-power is required for processing it efficiently. This is the reason why
-programming is a required skill for data scientists. You may wonder why
-we chose Python for this Workshop. That\'s because Python is one of the
-most popular programming languages for data science. It is extremely
-easy to learn how to code in Python thanks to its simple and easily
-readable syntax. It also has an incredible number of packages available
-to anyone for free, such as pandas, scikit-learn, TensorFlow, and
-PyTorch. Its community is expanding at an incredible rate, adding more
-and more new functionalities and improving its performance and
-reliability. It\'s no wonder companies such as Facebook, Airbnb, and
-Google are using it as one of their main stacks. No prior knowledge of
-Python is required for this book. If you do have some experience with
-Python or other programming languages, then this will be an advantage,
-but all concepts will be fully explained, so don\'t worry if you are new
-to programming.
-
-
-Application of Data Science
-===========================
-
-
-As mentioned in the introduction, data science is a multidisciplinary
-approach to analyzing and identifying complex patterns and extracting
-valuable insights from data. Running a data science project usually
-involves multiple steps, including the following:
-
-1.  Defining the business problem to be solved
-2.  Collecting or extracting existing data
-3.  Analyzing, visualizing, and preparing data
-4.  Training a model to spot patterns in data and make predictions
-5.  Assessing a model\'s performance and making improvements
-6.  Communicating and presenting findings and gained insights
-7.  Deploying and maintaining a model
-
-As its name implies, data science projects require data, but it is
-actually more important to have defined a clear business problem to
-solve first. If it\'s not framed correctly, a project may lead to
-incorrect results as you may have used the wrong information, not
-prepared the data properly, or led a model to learn the wrong patterns.
-So, it is absolutely critical to properly define the scope and objective
-of a data science project with your stakeholders.
-
-There are a lot of data science applications in real-world situations or
-in business environments. For example, healthcare providers may train a
-model for predicting a medical outcome or its severity based on medical
-measurements, or a high school may want to predict which students are at
-risk of dropping out within a year\'s time based on their historical
-grades and past behaviors. Corporations may be interested to know the
-likelihood of a customer buying a certain product based on his or her
-past purchases. They may also need to better understand which customers
-are more likely to stop using existing services and churn. These are
-examples where data science can be used to achieve a clearly defined
-goal, such as increasing the number of patients detected with a heart
-condition at an early stage or reducing the number of customers
-canceling their subscriptions after six months. That sounds exciting,
-right? Soon enough, you will be working on such interesting projects.
-
-
-
-What Is Machine Learning?
-------------------------
-
-When we mention data science, we usually think about machine learning,
-and some people may not understand the difference between them. Machine
-learning is the field of building algorithms that can learn patterns by
-themselves without being programmed explicitly. So machine learning is a
-family of techniques that can be used at the modeling stage of a data
-science project.
-
-Machine learning is composed of three different types of learning:
-
- Supervised learning
- Unsupervised learning
- Reinforcement learning
-
-
-
-### Supervised Learning
-
-Supervised learning refers to a type of task where an algorithm is
-trained to learn patterns based on prior knowledge. That means this kind
-of learning requires the labeling of the outcome (also called the
-response variable, dependent variable, or target variable) to be
-predicted beforehand. For instance, if you want to train a model that
-will predict whether a customer will cancel their subscription, you will
-need a dataset with a column (or variable) that already contains the
-churn outcome (cancel or not cancel) for past or existing customers.
-This outcome has to be labeled by someone prior to the training of a
-model. If this dataset contains 5,000 observations, then all of them
-need to have the outcome being populated. The objective of the model is
-to learn the relationship between this outcome column and the other
-features (also called independent variables or predictor variables).
-Following is an example of such a dataset:
-
-![](./images/B15019_01_01.jpg)
-
-Caption: Example of customer churn dataset
-
-The `Cancel` column is the response variable. This is the
-column you are interested in, and you want the model to predict
-accurately the outcome for new input data (in this case, new customers).
-All the other columns are the predictor variables.
-
-The model, after being trained, may find the following pattern: a
-customer is more likely to cancel their subscription after 12 months and
-if their average monthly spent is over `$50`. So, if a new
-customer has gone through 15 months of subscription and is spending \$85
-per month, the model will predict this customer will cancel their
-contract in the future.
-
-When the response variable contains a limited number of possible values
-(or classes), it is a classification problem (you will learn more about
-this in *Lab 3, Binary Classification*, and *Lab 4, Multiclass
-Classification with RandomForest*). The model will learn how to predict
-the right class given the values of the independent variables. The churn
-example we just mentioned is a classification problem as the response
-variable can only take two different values: `yes` or
-`no`.
-
-On the other hand, if the response variable can have a value from an
-infinite number of possibilities, it is called a regression problem.
-
-An example of a regression problem is where you are trying to predict
-the exact number of mobile phones produced every day for some
-manufacturing plants. This value can potentially range from 0 to an
-infinite number (or a number big enough to have a large range of
-potential values), as shown in *Figure 1.2*.
-
-![](./images/B15019_01_02.jpg)
-
-Caption: Example of a mobile phone production dataset
-
-In the preceding figure, you can see that the values for
-`Daily output` can take any value from `15000` to
-more than `50000`. This is a regression problem, which we will
-look at in *Lab 2, Regression*.
-
-
-
-### Unsupervised Learning
-
-Unsupervised learning is a type of algorithm that doesn\'t require any
-response variables at all. In this case, the model will learn patterns
-from the data by itself. You may ask what kind of pattern it can find if
-there is no target specified beforehand.
-
-This type of algorithm usually can detect similarities between variables
-or records, so it will try to group those that are very close to each
-other. This kind of algorithm can be used for clustering (grouping
-records) or dimensionality reduction (reducing the number of variables).
-Clustering is very popular for performing customer segmentation, where
-the algorithm will look to group customers with similar behaviors
-together from the data. *Lab 5*, *Performing Your First Cluster
-Analysis*, will walk you through an example of clustering analysis.
-
-
-
-### Reinforcement Learning
-
-Reinforcement learning is another type of algorithm that learns how to
-act in a specific environment based on the feedback it receives. You may
-have seen some videos where algorithms are trained to play Atari games
-by themselves. Reinforcement learning techniques are being used to teach
-the agent how to act in the game based on the rewards or penalties it
-receives from the game.
-
-For instance, in the game Pong, the agent will learn to not let the ball
-drop after multiple rounds of training in which it receives high
-penalties every time the ball drops.
-
-Note
-
-Reinforcement learning algorithms are out of scope and will not be
-covered in this book.


 Overview of Python
@@ -243,7 +41,7 @@ Types of Variable
 In Python, you can handle and manipulate different types of variables.
 Each has its own specificities and benefits. We will not go through
 every single one of them but rather focus on the main ones that you will
-have to use in this book. For each of the following code examples, you
+have to use in this course. For each of the following code examples, you
 can run the code in Google Colab to view the given output.


@@ -301,15 +99,6 @@ You should get the following output:

 Caption: Printing the two text variables

-Python also provides an interface called f-strings for printing text
-with the value of defined variables. It is very handy when you want to
-print results with additional text to make it more readable and
-interpret results. It is also quite common to use f-strings to print
-logs. You will need to add `f` before the quotes (or double
-quotes) to specify that the text will be an f-string. Then you can add
-an existing variable inside the quotes and display the text with the
-value of this variable. You need to wrap the variable with curly
-brackets, `{}`.

 For instance, if we want to print `Text:` before the values of
 `var3` and `var4`, we will write the following code:
@@ -528,13 +317,13 @@ Exercise 1.01: Creating a Dictionary That Will Contain Machine Learning Algorith

 In this exercise, we will create a dictionary using Python that will
 contain a collection of different machine learning algorithms that will
-be covered in this book.
+be covered in this course.

 The following steps will help you complete the exercise:

 Note

-Every exercise and activity in this book is to be executed on Google
+Every exercise and activity in this course is to be executed on Google
 Colab.

 1.  Open on a new Colab notebook.
@@ -1131,7 +920,7 @@ Caption: Predictions of the trained Random Forest model
 Finally, we want to assess the model\'s performance by comparing its
 predictions to the actual values of the target variable. There are a lot
 of different metrics that can be used for assessing model performance,
-and you will learn more about them later in this book. For now, though,
+and you will learn more about them later in this course. For now, though,
 we will just use a metric called **accuracy**. This metric calculates
 the ratio of correct predictions to the total number of observations:

@@ -1350,7 +1139,7 @@ general. We also learned the different types of machine learning
 algorithms, including supervised and unsupervised, as well as regression
 and classification. We had a quick introduction to Python and how to
 manipulate the main data structures (lists and dictionaries) that will
-be used in this book.
+be used in this course.

 Then we walked you through what a DataFrame is and how to create one by
 loading data from different file formats using the famous pandas
@@ -1358,7 +1147,7 @@ package. Finally, we learned how to use the sklearn package to train a
 machine learning model and make predictions with it.

 This was just a quick glimpse into the fascinating world of data
-science. In this book, you will learn much more and discover new
+science. In this course, you will learn much more and discover new
 techniques for handling data science projects from end to end.

 The next lab will show you how to perform a regression task on a
@@ -1743,7 +1743,7 @@ handle and fix some of the most frequent issues (duplicate rows, type
 conversion, value replacement, and missing values) using
 `pandas`\' APIs. Finally, we went through several feature engineering techniques.

-The next lab opens a new part of this book that presents data
+The next lab opens a new part of this course that presents data
 science use cases end to end. *Lab 13*, *Imbalanced Datasets*, will
 walk you through an example of an imbalanced dataset and how to deal
 with such a situation.
@@ -55,7 +55,7 @@ objectives of data science, a flexible programming language that
 effectively combines interactivity with computing power and speed is
 necessary. This is where the Python programming language meets the needs
 of data science and, as mentioned in *Lab 1*, *Introduction to Data
-Science in Python*, we will be using Python in this book.
+Science in Python*, we will be using Python in this course.

 The need to develop models to make predictions and to gain insights for
 decisionmaking cuts across many industries. Data science is, therefore,
@@ -275,7 +275,7 @@ variable of interest. What happens then is that the transformed variable
 tends to have a linear relationship with the other untransformed
 variables, enabling the use of linear regression to fit the data. This
 will be illustrated in practice on the dataset being analyzed later in
-the exercises of the book.
+the exercises of the course.



@@ -559,7 +559,7 @@ The following steps will help you to complete this exercise:
    model that learns the relationships between the variables), and the
    rest of the data to test your model (that is, to see how well your
    new model can make predictions when given new data). You will use
-    train-test splits throughout this book, and the concept will be
+    train-test splits throughout this course, and the concept will be
    explained in more detail in *Lab 7, The Generalization Of
    Machine Learning Models*.

@@ -19,54 +19,6 @@ regression function and analyze classification metrics and formulate
 action plans for the improvement of the model.


-Introduction
-============
-
-
-In previous labs, where an introduction to machine learning was
-covered, you were introduced to two broad categories of machine
-learning; supervised learning and unsupervised learning. Supervised
-learning can be further divided into two types of problem cases,
-regression and classification. In the last lab, we covered
-regression problems. In this lab, we will peek into the world of
-classification problems.
-
-Take a look at the following *Figure 3.1*:
-
-![](./images/B15019_03_01.jpg)
-
-Caption: Overview of machine learning algorithms
-
-Classification problems are the most prevalent use cases you will
-encounter in the real world. Unlike regression problems, where a real
-numbered value is predicted, classification problems deal with
-associating an example to a category. Classification use cases will take
-forms such as the following:
-
- Predicting whether a customer will buy the recommended product
- Identifying whether a credit transaction is fraudulent
- Determining whether a patient has a disease
- Analyzing images of animals and predicting whether the image is of a
-    dog, cat, or panda
- Analyzing text reviews and capturing the underlying emotion such as
-    happiness, anger, sorrow, or sarcasm
-
-If you observe the preceding examples, there is a subtle difference
-between the first three and the last two. The first three revolve around
-binary decisions:
-
- Customers can either buy the product or not.
- Credit card transactions can be fraudulent or legitimate.
- Patients can be diagnosed as positive or negative for a disease.
-
-Use cases that align with the preceding three genres where a binary
-decision is made are called binary classification problems. Unlike the
-first three, the last two associate an example with multiple classes or
-categories. Such problems are called multiclass classification problems.
-This lab will deal with binary classification problems. Multiclass
-classification will be covered next in *Lab 4*, *Multiclass
-Classification with RandomForest*.
-

 Understanding the Business Context
 ==================================
@@ -91,45 +43,6 @@ buy term deposits.



-Business Discovery
------------------
-
-The first process when embarking on a data science problem like the
-preceding is the business discovery process. This entails understanding
-various drivers influencing the business problem. Getting to know the
-business drivers is important as it will help in formulating hypotheses
-about the business problem, which can be verified during the
-**exploratory data analysis** (**EDA**). The verification of hypotheses
-will help in formulating intuitions for feature engineering, which will
-be critical for the veracity of the models that we build.
-
-Let\'s understand this process in detail from the context of our use
-case. The problem statement is to identify those customers who have a
-propensity to buy term deposits. As you might be aware, term deposits
-are bank instruments where your money will be locked for a certain
-period, assuring higher interest rates than saving accounts or
-interest-bearing checking accounts. From an investment propensity
-perspective, term deposits are generally popular among risk-averse
-customers. Equipped with the business context, let\'s look at some
-questions on business factors influencing a propensity to buy term
-deposits:
-
- Would age be a factor, with more propensity shown by the elderly?
- Is there any relationship between employment status and the
-    propensity to buy term deposits?
- Would the asset portfolio of a customer---that is, house, loan, or
-    higher bank balance---influence the propensity to buy?
- Will demographics such as marital status and education influence the
-    propensity to buy term deposits? If so, how are demographics
-    correlated to a propensity to buy?
-
-Formulating questions on the business context is critical as this will
-help in arriving at various trails that we can take when we do
-exploratory analysis. We will deal with that in the next section. First,
-let\'s explore the data related to the preceding business problem.
-
-
-
 Exercise 3.01: Loading and Exploring the Data from the Dataset
 --------------------------------------------------------------

@@ -1965,17 +1965,6 @@ You should get the following output:

 Caption: Scatter plot of the standardized data

-k-means results are very different from the standardized data. Now we
-can see that there are two main clusters (blue and red) and their
-boundaries are not straight vertical lines anymore but diagonal. So,
-k-means is actually taking into consideration both axes now. The orange
-cluster contains much fewer data points compared to previous iterations,
-and it seems it is grouping all the extreme outliers with high values
-together. If your project was about detecting anomalies, you would have
-found a way here to easily separate outliers from \"normal\"
-observations.
-
-

 Exercise 5.06: Standardizing the Data from Our Dataset
 ------------------------------------------------------
@@ -290,7 +290,7 @@ that give rise to different loss functions. Two of these are:
 - Manhattan distance
 - Euclidean distance

-There are various loss functions for regression, but in this book, we
+There are various loss functions for regression, but in this course, we
 will be looking at two of the commonly used loss functions for
 regression, which are:

@@ -1125,7 +1125,7 @@ Distributions of continuous random variables are a bit more challenging
 in that we cannot calculate an exact `P(X=x)` directly because
 `X` lies on a continuum. We can, however, use integration to
 approximate probabilities between a range of values, but this is beyond
-the scope of this book. The relationship between `X` and
+the scope of this course. The relationship between `X` and
 probability is described using a probability density function,
 `P(X)`. Perhaps the most well-known continuous distribution is
 the normal distribution, which visually takes the form of a bell.
@@ -15,52 +15,6 @@ importance. You will use a partial dependence plot to analyze single
 variables and make use of the lime package for local interpretation.


-Introduction
-============
-
-
-In the previous lab, you saw how to find the optimal hyperparameters
-of some of the most popular machine learning algorithms in order to get
-better predictive performance (that is, more accurate predictions).
-
-Machine learning algorithms are always referred to as black box where we
-can only see the inputs and outputs and the implementation inside the
-algorithm is quite opaque, so people don\'t know what is happening
-inside.
-
-With each day that passes, we can sense the elevated need for more
-transparency in machine learning models. In the last few years, we have
-seen some cases where algorithms have been accused of discriminating
-against groups of people. For instance, a few years ago, a
-not-for-profit news organization called ProPublica highlighted bias in
-the COMPAS algorithm, built by the Northpointe company. The objective of
-the algorithm is to assess the likelihood of re-offending for a
-criminal. It was shown that the algorithm was predicting a higher level
-of risk for specific groups of people based on their demographics rather
-than other features. This example highlighted the importance of
-interpreting the results of your model and its logic properly and
-clearly.
-
-Luckily, some machine learning algorithms provide methods to understand
-the parameters they learned for a given task and dataset. There are also
-some functions that are model-agnostic and can help us to better
-understand the predictions made. So, there are different techniques that
-are either model-specific or model-agnostic for interpreting a model.
-
-These techniques can also differ in their scope. In the literature, we
-either have a global or local interpretation. A global interpretation
-means we are looking at the variables for all observations from a
-dataset and we want to understand which features have the biggest
-overall influence on the target variable. For instance, if you are
-predicting customer churn for a telco company, you may find the most
-important features for your model are customer usage and the average
-monthly amount paid. Local interpretation, on the other hand, focuses
-only on a single observation and analyzes the impact of the different
-variables. We will look at a single specific case and see what led the
-model to make its final prediction. For example, you will look at a
-specific customer who is predicted to churn and will discover that they
-usually buy the new iPhone model every year, in September.
-
 In this lab, we will go through some techniques on how to interpret
 your models or their results.

@@ -26,35 +26,35 @@ Gain expert guidance on how to successfully develop machine learning models in P
 Labs for this course are available at endpoints shared below. Update `<host-ip>` with the lab environment DNS.

 1. ##### Introduction to Data Science in Python
-		* http://<host-ip>/lab/workspaces/lab1_
+		* http://<host-ip>/lab/workspaces/lab1_Introduction
 2. ##### Regression
-		* http://<host-ip>/lab/workspaces/lab2_
+		* http://<host-ip>/lab/workspaces/lab2_Regression
 3. ##### Binary Classification
-		* http://<host-ip>/lab/workspaces/lab3_
+		* http://<host-ip>/lab/workspaces/lab3_Classification
 4. ##### Multiclass Classification with RandomForest
-		* http://<host-ip>/lab/workspaces/lab4_
+		* http://<host-ip>/lab/workspaces/lab4_RandomForest
 5. ##### Performing Your First Cluster Analysis
-		* http://<host-ip>/lab/workspaces/lab5_
+		* http://<host-ip>/lab/workspaces/lab5_Analysis
 6. ##### How to Assess Performance
-		* http://<host-ip>/lab/workspaces/lab6_
+		* http://<host-ip>/lab/workspaces/lab6_Performance
 7. ##### The Generalization of Machine Learning Models
-		* http://<host-ip>/lab/workspaces/lab7_
+		* http://<host-ip>/lab/workspaces/lab7_Models
 8. ##### Hyperparameter Tuning
-		* http://<host-ip>/lab/workspaces/lab8_
+		* http://<host-ip>/lab/workspaces/lab8_Hyperparameter
 9. ##### Interpreting a Machine Learning Model
-		* http://<host-ip>/lab/workspaces/lab9_
+		* http://<host-ip>/lab/workspaces/lab9_ML
 10. ##### Analyzing a Dataset
-		* http://<host-ip>/lab/workspaces/lab10_
+		* http://<host-ip>/lab/workspaces/lab10_Dataset
 11. ##### Data Preparation
-		* http://<host-ip>/lab/workspaces/lab11_
+		* http://<host-ip>/lab/workspaces/lab11_Data
 12. ##### Feature Engineering
-		* http://<host-ip>/lab/workspaces/lab12_
+		* http://<host-ip>/lab/workspaces/lab12_Feature
 13. ##### Imbalanced Datasets
-		* http://<host-ip>/lab/workspaces/lab13_
+		* http://<host-ip>/lab/workspaces/lab13_Imbalanced
 14. ##### Dimensionality Reduction
-		* http://<host-ip>/lab/workspaces/lab14_
+		* http://<host-ip>/lab/workspaces/lab14_Dimensionality
 15. ##### Ensemble Learning
-		* http://<host-ip>/lab/workspaces/lab15_
+		* http://<host-ip>/lab/workspaces/lab15_Ensemble

 ### About

@@ -62,6 +62,6 @@ Where there’s data, there’s insight. With so much data being generated, ther

 The course begins by introducing different types of projects and showing you how to incorporate machine learning algorithms in them. You’ll learn to select a relevant metric and even assess the performance of your model. To tune the hyperparameters of an algorithm and improve its accuracy, you’ll get hands-on with approaches such as grid search and random search.

-Next, you’ll learn dimensionality reduction techniques to easily handle many variables at once, before exploring how to use model ensembling techniques and create new features to enhance model performance. In a bid to help you automatically create new features that improve your model, the book demonstrates how to use the automated feature engineering tool. You’ll also understand how to use the orchestration and scheduling workflow to deploy machine learning models in batch.
+Next, you’ll learn dimensionality reduction techniques to easily handle many variables at once, before exploring how to use model ensembling techniques and create new features to enhance model performance. In a bid to help you automatically create new features that improve your model, the course demonstrates how to use the automated feature engineering tool. You’ll also understand how to use the orchestration and scheduling workflow to deploy machine learning models in batch.

-By the end of this book, you’ll have the skills to start working on data science projects confidently. By the end of this book, you’ll have the skills to start working on data science projects confidently.
+By the end of this course, you’ll have the skills to start working on data science projects confidently. By the end of this course, you’ll have the skills to start working on data science projects confidently.