From e8f9b371e124cfe1b2f90a321ef572b1d6e374ec Mon Sep 17 00:00:00 2001 From: fenago Date: Mon, 8 Feb 2021 20:53:32 +0500 Subject: [PATCH] added --- lab_guides/Lab_11.md | 48 +------- lab_guides/Lab_12.md | 23 ---- lab_guides/Lab_13.md | 10 +- lab_guides/Lab_14.md | 50 -------- lab_guides/Lab_15.md | 263 ------------------------------------------- lab_guides/Lab_2.md | 6 - lab_guides/Lab_3.md | 19 +--- lab_guides/Lab_4.md | 137 ++-------------------- lab_guides/Lab_6.md | 20 ---- lab_guides/Lab_8.md | 58 +--------- lab_guides/Lab_9.md | 17 +-- 11 files changed, 17 insertions(+), 634 deletions(-) diff --git a/lab_guides/Lab_11.md b/lab_guides/Lab_11.md index 3f162d9..a0abafb 100644 --- a/lab_guides/Lab_11.md +++ b/lab_guides/Lab_11.md @@ -641,20 +641,7 @@ dataset, refer to the following note. Let\'s get started: ![](./images/B15019_11_15.jpg) - Caption: List of columns and their assigned data types - Note - - The preceding output has been truncated. - - From *Lab 10*, *Analyzing a Dataset* you know that the - `Id`, `MSSubClass`, `OverallQual`, and - `OverallCond` columns have been incorrectly classified as - numerical variables. They have a finite number of unique values and - you can\'t perform any mathematical operations on them. For example, - it doesn\'t make sense to add, remove, multiply, or divide two - different values from the `Id` column. Therefore, you need - to convert them into categorical variables. 6. Using the `astype()` method, convert the `'Id'` column into a categorical variable, as shown in the following code @@ -694,14 +681,6 @@ dataset, refer to the following note. Let\'s get started: ![](./images/B15019_11_16.jpg) - Caption: List of categories for the four newly converted - variables - - Now, these four columns have been converted into categorical - variables. From the output of *Step 5*, we can see that there are a - lot of variables of the `object` type. Let\'s have a look - at them and see if they need to be converted as well. - 9. Create a new DataFrame called `obj_df` that will only contain variables of the `object` type using the `select_dtypes` method along with the @@ -1348,15 +1327,7 @@ You should get the following output: ![](./images/B15019_11_38.jpg) -Caption: Rows with missing values in CustomerID - -This time, all the transactions look normal, except they are missing -values for the `CustomerID` column; all the other variables -have been filled in with values that seem genuine. There is no other way -to infer the missing values for the `CustomerID` column. These -rows represent almost 25% of the dataset, so we can\'t remove them. - -However, most algorithms require a value for each observation, so you +Most algorithms require a value for each observation, so you need to provide one for these cases. We will use the `.fillna()` method from `pandas` to do this. Provide the value to be imputed as `Missing` and use @@ -1385,15 +1356,6 @@ You should get the following output: ![](./images/B15019_11_40.jpg) -Caption: Summary of missing values for each variable - -You have successfully fixed all the missing values in this dataset. -These methods also work when we want to handle missing numerical -variables. We will look at this in the following exercise. All you need -to do is provide a numerical value when you want to impute a value with -`.fillna()`. - - Exercise 11.04: Fixing Missing Values for the Horse Colic Dataset ----------------------------------------------------------------- @@ -1699,48 +1661,40 @@ The following figure illustrates a list of unique values for gaming: ![](./images/B15019_11_49.jpg) -Caption: List of unique values for gaming The following figure displays the data types of each column: ![](./images/B15019_11_50.jpg) -Caption: Data types of each column The following figure displays the updated data types of each column: ![](./images/B15019_11_51.jpg) -Caption: Data types of each column The following figure displays the number of missing values for numerical variables: ![](./images/B15019_11_52.jpg) -Caption: Number of missing values for numerical variables The following figure displays the list of unique values for `int_corr`: ![](./images/B15019_11_53.jpg) -Caption: List of unique values for \'int\_corr\' The following figure displays the list of unique values for numerical variables: ![](./images/B15019_11_54.jpg) -Caption: List of unique values for numerical variables The following figure displays the number of missing values for numerical variables: ![](./images/B15019_11_55.jpg) -Caption: Number of missing values for numerical variables - Summary ======= diff --git a/lab_guides/Lab_12.md b/lab_guides/Lab_12.md index 83f67b2..dd5b86d 100644 --- a/lab_guides/Lab_12.md +++ b/lab_guides/Lab_12.md @@ -38,14 +38,7 @@ You should get the following output. ![](./images/B15019_12_01.jpg) -Caption: First five rows of the Online Retail dataset -Next, we are going to load all the public holidays in the UK into -another `pandas` DataFrame. From *Lab 10*, *Analyzing a -Dataset* we know the records of this dataset are only for the years 2010 -and 2011. So we are going to extract public holidays for those two -years, but we need to do so in two different steps as the API provided -by `date.nager` is split into single years only. Let\'s focus on 2010 first: @@ -759,17 +752,6 @@ You should get the following output: ``` 30 ``` -`30` is the number of unique values for the -`Country_bin` column. So we reduced the number of unique -values in this column from `38` to `30`: - -We just saw how to group categorical values together, but the same -process can be applied to numerical values as well. For instance, it is -quite common to group people\'s ages into bins such as 20s (20 to 29 -years old), 30s (30 to 39), and so on. - -Have a look at *Exercise 12.02*, *Binning the YearBuilt variable from -the AMES Housing dataset*. @@ -1768,8 +1750,3 @@ of a dataset are and identifying data quality issues. We saw how to handle and fix some of the most frequent issues (duplicate rows, type conversion, value replacement, and missing values) using `pandas`\' APIs. Finally, we went through several feature engineering techniques. - -The next lab opens a new part of this course that presents data -science use cases end to end. *Lab 13*, *Imbalanced Datasets*, will -walk you through an example of an imbalanced dataset and how to deal -with such a situation. diff --git a/lab_guides/Lab_13.md b/lab_guides/Lab_13.md index 746b3fc..5aa887f 100644 --- a/lab_guides/Lab_13.md +++ b/lab_guides/Lab_13.md @@ -148,10 +148,6 @@ Classification*, and you will look closely at the metrics: ``` - After the categorical values are transformed, they must be combined - with the scaled numerical values of the data frame to get the - feature-engineered dataset. - 10. Create the independent variables, `X`, and dependent variables, `Y`, from the combined dataset for modeling, as in the following code snippet: @@ -171,13 +167,9 @@ Classification*, and you will look closely at the metrics: The output is as follows: -![Caption: The independent variables and the combined data - (truncated) ](./images/B15019_13_03.jpg) +![](./images/B15019_13_03.jpg) - Caption: The independent variables and the combined data - (truncated) - We are now ready for the modeling task. Let\'s first import the necessary packages. diff --git a/lab_guides/Lab_14.md b/lab_guides/Lab_14.md index 01e8cf4..779cb97 100644 --- a/lab_guides/Lab_14.md +++ b/lab_guides/Lab_14.md @@ -1693,45 +1693,6 @@ The following steps will help you complete this exercise: -From this exercise, you may come up with a few questions: - -- How do you think we can improve the classification results using - ICA? -- Increasing the number of components results in a marginal increase - in the accuracy metrics. -- Are there any other side effects because of the strategy adopted to - improve the results? - -Increasing the number of components also results in a longer training -time for the logistic regression model. - - - -Factor Analysis ---------------- - -Factor analysis is a technique that achieves dimensionality reduction by -grouping variables that are highly correlated. Let\'s look at an example -from our context of predicting advertisements. - -In our dataset, there could be many features that describe the geometry -(the size and shape of an image in the ad) of the images on a web page. -These features can be correlated because they refer to specific -characteristics of an image. - -Similarly, there could be many features that describe the anchor text or -phrases occurring in a URL, which are highly correlated. Factor analysis -looks at correlated groups such as these from the data and then groups -them into latent factors. Therefore, if there are 10 raw features -describing the geometry of an image, factor analysis will group them -into one feature that characterizes the geometry of an image. Each of -these groups is called factors. As many correlated features are combined -to form a group, the resulting number of features will be much smaller -in comparison with the original dimensions of the dataset. - -Let\'s now see how factor analysis can be implemented as a technique for -dimensionality reduction. - Exercise 14.06: Dimensionality Reduction Using Factor Analysis @@ -2015,18 +1976,7 @@ You should get the following output: ![](./images/B15019_14_35.jpg) -Caption: Sample data frame -What we will do next is sample some data points with the same shape as -the data frame we created. - -Let\'s sample some data points from a normal distribution that has mean -`0` and standard deviation of `0.1`. We touched -briefly on normal distributions in *Lab 3, Binary Classification.* A -normal distribution has two parameters. The first one is the mean, which -is the average of all the data in the distribution, and the second one -is standard deviation, which is a measure of how spread out the data -points are. By assuming a mean and standard deviation, we will be able to draw samples from a normal distribution using the diff --git a/lab_guides/Lab_15.md b/lab_guides/Lab_15.md index 53ffb93..52707dc 100644 --- a/lab_guides/Lab_15.md +++ b/lab_guides/Lab_15.md @@ -19,109 +19,6 @@ where we will try to predict whether a credit card application will be approved. -Introduction -============ - - -In the previous lab, we learned various techniques, such as the -backward elimination technique, factor analysis, and so on, that helped -us to deal with high-dimensional datasets. - -In this lab, we will further enhance our repertoire of skills with -another set of techniques, called **ensemble learning**, in which we -will be dealing with different ensemble learning techniques such as the -following: - -- Averaging -- Weighted averaging -- Max voting -- Bagging -- Boosting -- Blending - - -Ensemble Learning -================= - - -Ensemble learning, as the name denotes, is a method that combines -several machine learning models to generate a superior model, thereby -decreasing variability/variance and bias, and boosting performance. - -Before we explore what ensemble learning is, let\'s look at the concepts -of bias and variance with the help of the classical bias-variance -quadrant, as shown here: - -![](./images/B15019_15_01.jpg) - -Caption: Bias-variance quadrant - - - -Variance --------- - -Variance is the measure of how spread out data is. In the context of -machine learning, models with high variance imply that the predictions -generated on the same test set will differ considerably when different -training sets are used to fit the model. The underlying reason for high -variability could be attributed to the model being attuned to specific -nuances of training data rather than generalizing the relationship -between input and output. Ideally, we want every machine learning model -to have low variance. - - - -Bias ----- - -Bias is the difference between the ground truth and the average value of -our predictions. A low bias will indicate that the predictions are very -close to the actual values. A high bias implies that the model has -oversimplified the relationship between the inputs and outputs, leading -to high error rates on test sets, which again is an undesirable outcome. - -*Figure 15.1* helps us to visualize the trade-off between bias and -variance. The top-left corner is the depiction of a scenario where the -bias is high, and the variance is low. The top-right quadrant displays a -scenario where both bias and variance are high. From the figure, we can -see that when the bias is high, it is further away from the truth, which -in this case, is the *bull\'s eye*. The presence of variance is -manifested as whether the arrows are spread out or congregated in one -spot. - -Ensemble models combine many weaker models that differ in variance and -bias, thereby creating a better model, outperforming the individual -weaker models. Ensemble models exemplify the adage *the wisdom of the -crowds*. In this lab, we will learn about different ensemble -techniques, which can be classified into two types, that is, simple and -advanced techniques: - -![](./images/B15019_15_02.jpg) - -Caption: Different ensemble learning methods - - - -Business Context ----------------- - -You are working in the credit card division of your bank. The operations -head of your company has requested your help in determining whether a -customer is creditworthy or not. You have been provided with credit card -operations data. - -This dataset contains credit card applications with around 15 variables. -The variables are a mix of continuous and categorical data pertaining to -credit card operations. The label for the dataset is a flag, which -indicates whether the application has been approved or not. - -You want to fit some benchmark models and try some ensemble learning -methods on the dataset to address the problem and come up with a tool -for predicting whether or not a given customer should be approved for -their credit application. - - Exercise 15.01: Loading, Exploring, and Cleaning the Data --------------------------------------------------------- @@ -783,71 +680,6 @@ the new combination of weights in *iteration 2*: ![](./images/B15019_15_21.jpg) -Caption: Classification report - -In this exercise, we implemented the weighted averaging technique for -ensemble learning. We did two iterations with the weights. We saw that -in the second iteration, where we increased the weight of the logistic -regression prediction from `0.6` to `0.7`, the -accuracy actually improved from `0.89` to `0.90`. -This is a validation of our assumption about the prominence of the -logistic regression model in the ensemble. To check whether there is -more room for improvement, we should again change the weights, just like -we did in iteration `2`, and then validate against the -metrics. We should continue these iterations until there is no further -improvement noticed in the metrics. - -Comparing it with the metrics from the averaging method, we can see that -the accuracy level has gone down from `0.91` to -`0.90`. However, the recall value of class `1` has -gone up from `0.91` to `0.92`, and the corresponding -value for class `0` has gone down from `0.91` to -`0.88`. It could be that the weights that we applied have -resulted in a marginal degradation of the results from what we got from -the averaging method. - -Looking at the results from a business perspective, we can see that with -the increase in the recall value of class `1`, the card -division is getting more creditworthy customers. However, this has come -at the cost of increasing the risk with more unworthy customers, with -`12%` (`100% - 88%`) being tagged as creditworthy -customers. - - - -### Max Voting - -The max voting method works on the principle of majority rule. In this -method, the opinion of the majority rules the roost. In this technique, -individual models, or, in ensemble learning jargon, individual learners, -are fit on the training set and their predictions are then generated on -the test set. Each individual learner\'s prediction is considered to be -a vote. On the test set, whichever class gets the maximum vote is the -ultimate winner. Let\'s demonstrate this with a toy example. - -Let\'s say we have three individual learners who learned on the training -set. Each of them generates their predictions on the test set, which is -tabulated in the following table. The predictions are either for class -\'1\' or class \'0\': - -![](./images/B15019_15_22.jpg) - -Caption: Predictions for learners - -In the preceding example, we can see that for `Example 1` and -`Example 3`, the majority vote is for class \'1,\' and for the -other two examples, the majority of the vote is for class \'0\'. The -final predictions are based on which class gets the majority vote. This -method of voting, where we output a class, is called \"hard \" voting. - -When implementing the max voting method using the -`scikit-learn` library, we use a special function called -`VotingClassifier()`. We provide individual learners as input -to `VotingClassifier` to create the ensemble model. This -ensemble model is then used to fit the training set and then is finally -used to predict on the test sets. We will explore the dynamics of max -voting in *Exercise 15.04*, *Ensemble Model Using Max Voting*. - Exercise 15.04: Ensemble Model Using Max Voting @@ -967,101 +799,6 @@ regression, KNN, and random forest: ![](./images/B15019_15_24.jpg) -Caption: Classification report - - - -Advanced Techniques for Ensemble Learning -========================================= - - -Having learned simple techniques for ensemble learning, let\'s now -explore some advanced techniques. Among the advanced techniques, we will -be dealing with three different kinds of ensemble learning: - -- Bagging -- Boosting -- Stacking/blending - -Before we deal with each of them, there are some basic dynamics of these -advanced ensemble learning techniques that need to be deciphered. As -described at the beginning of the lab, the essence of ensemble -learning is in combining individual models to form a superior model. -There are some subtle nuances in the way the superior model is generated -in the advanced techniques. In these techniques, the individual models -or learners generate predictions and those predictions are used to form -the final predictions. The individual models or learners, which generate -the first set of predictions, are called **base** **learners** or -**base** **estimators** and the model, which is a combination of the -predictions of the base learners, is called the **meta** **learner** or -**meta estimator**. The way in which the meta learners learn from the -base learners differs for each of the advanced techniques. Let\'s -understand each of the advanced techniques in detail. - - - -Bagging -------- - -Bagging is a pseudonym for **B**ootstrap **Agg**regat**ing**. Before we -explain how bagging works, let\'s describe what bootstrapping is. -Bootstrapping has its etymological origins in the phrase, *Pulling -oneself up by one\'s bootstrap*. The essence of this phrase is to make -the best use of the available resources. In the statistical context, -bootstrapping entails taking samples from the available dataset by -replacement. Let\'s look at this concept with a toy example. - -Suppose we have a dataset consisting of 10 numbers from 1 to 10. We now -need to create 4 different datasets of 10 each from the available -dataset. How do we do this? This is where the concept of bootstrapping -comes in handy. In this method, we take samples from the available -dataset one by one and then replace the number we took before taking the -next sample. We continue with this until we get a sample with the number -of data points we need. - -As we are replacing each number after it is selected, there is a chance -that we might have more than one of a given data point in a sample. This -is explained by the following figure: - -![](./images/B15019_15_25.jpg) - -Caption: Bootstrapping - -Now that we have understood bootstrapping, let\'s apply this concept to -a machine learning context. Earlier in the lab, we discussed that -ensemble learning helps in reducing the variance of predictions. One way -that variance could be reduced is by averaging out the predictions from -multiple learners. In bagging, multiple subsets of the data are created -using bootstrapping. On each of these subsets of data, a base learner is -fitted and the predictions generated. These predictions from all the -base learners are then averaged to get the meta learner or the final -predictions. - -When implementing bagging, we use a function called -`BaggingClassifier()`, which is available in the -`Scikit learn` library. Some of the important arguments that -are provided when creating an ensemble model include the following: - -- `base_estimator`: This argument is to define the base - estimator to be used. -- `n_estimator`: This argument defines the number of base - estimators that will be used in the ensemble. -- `max_samples`: The maximum size of the bootstrapped sample - for fitting the base estimator is defined using this argument. This - is represented as a proportion (0.8, 0.7, and so on). -- `max_features`: When fitting multiple individual learners, - it has been found that randomly selecting the features to be used in - each dataset results in superior performance. The - `max_features` argument indicates the number of features - to be used. For example, if there were 10 features in the dataset - and the `max_features` argument was to be defined as 0.8, - then only 8 (0.8 x 10) features would be used to fit a model using - the base learner. - -Let\'s explore ensemble learning with bagging in *Exercise 15.05*, -*Ensemble Learning Using Bagging*. - - Exercise 15.05: Ensemble Learning Using Bagging ----------------------------------------------- diff --git a/lab_guides/Lab_2.md b/lab_guides/Lab_2.md index b996812..96c0dc1 100644 --- a/lab_guides/Lab_2.md +++ b/lab_guides/Lab_2.md @@ -232,12 +232,6 @@ The following steps will help you to complete this exercise: ``` - The use of the backslash character, `\`, on *line 4* in - the preceding code snippet is to enforce the continuation of code on - to a new line in Python. The `\` character is not required - if you are entering the full line of code into a single line in - your notebook. - You should get the following output: diff --git a/lab_guides/Lab_3.md b/lab_guides/Lab_3.md index 475a19e..fb59038 100644 --- a/lab_guides/Lab_3.md +++ b/lab_guides/Lab_3.md @@ -81,12 +81,6 @@ The following steps will help you to complete this exercise: ``` - Note - - The `#` symbol in the code snippet above denotes a code - comment. Comments are added into code to help explain specific bits - of logic. - The `pd.read_csv()` function\'s arguments are the filename as a string and the limit separator of a CSV, which is `";"`. After reading the file, the DataFrame is printed @@ -289,23 +283,12 @@ their age. We will be using a line graph for this exercise. The following steps will help you to complete this exercise: -1. Begin by defining the hypothesis. - - The first step in the verification process will be to define a - hypothesis about the relationship. A hypothesis can be based on your - experiences, domain knowledge, some published pieces of knowledge, - or your business intuitions. - - Let\'s first define our hypothesis on age and propensity to buy term +1. Let\'s first define our hypothesis on age and propensity to buy term deposits: *The propensity to buy term deposits is more with elderly customers compared to younger ones*. This is our hypothesis. - Now that we have defined our hypothesis, it is time to verify its - veracity with the data. One of the best ways to get business - intuitions from data is by taking cross-sections of our data and - visualizing them. 2. Import the pandas and altair packages: ``` diff --git a/lab_guides/Lab_4.md b/lab_guides/Lab_4.md index 4865c3d..38c78f7 100644 --- a/lab_guides/Lab_4.md +++ b/lab_guides/Lab_4.md @@ -75,21 +75,12 @@ from the DataFrame: ``` target = df.pop('Activity') ``` + Now the response variable is contained in the variable called `target` and all the features are in the DataFrame called `df`. -Now we are going to split the dataset into training and testing sets. -The model uses the training set to learn relevant parameters in -predicting the response variable. The test set is used to check whether -a model can accurately predict unseen data. We say the model is -overfitting when it has learned the patterns relevant only to the -training set and makes incorrect predictions about the testing set. In -this case, the model performance will be much higher for the training -set compared to the testing one. Ideally, we want to have a very similar -level of performance for the training and testing sets. This topic will -be covered in more depth in *Lab 7*, *The Generalization of Machine -Learning Models*. + The `sklearn` package provides a function called `train_test_split()` to randomly split the dataset into two @@ -116,6 +107,7 @@ class from `sklearn.ensemble`: ``` from sklearn.ensemble import RandomForestClassifier ``` + Now we can instantiate the Random Forest classifier with some hyperparameters. Remember from *Lab 1, Introduction to Data Science in Python*, a hyperparameter is a type of parameter the model can\'t @@ -203,15 +195,9 @@ The output will be as follows: ![](./images/B15019_04_06.jpg) -Caption: Accuracy score on the training set -Remember, in the last section, we split the dataset into training and -testing sets. We used the training set to fit the model and assess its -predictive power on it. But it hasn\'t seen the observations from the -testing set at all, so we can use it to assess whether our model is -capable of generalizing unseen data. Let\'s calculate the accuracy score -for the testing set: +Let\'s calculate the accuracy score for the testing set: ``` test_preds = rf_model.predict(X_test) @@ -438,94 +424,15 @@ score: -Number of Trees Estimator -------------------------- -Now that we know how to fit a Random Forest classifier and assess its -performance, it is time to dig into the details. In the coming sections, -we will learn how to tune some of the most important hyperparameters for -this algorithm. As mentioned in *Lab 1, Introduction to Data Science -in Python*, hyperparameters are parameters that are not learned -automatically by machine learning algorithms. Their values have to be -set by data scientists. These hyperparameters can have a huge impact on -the performance of a model, its ability to generalize to unseen data, -and the time taken to learn patterns from the data. -The first hyperparameter you will look at in this section is called -`n_estimators`. This hyperparameter is responsible for -defining the number of trees that will be trained by the -`RandomForest` algorithm. - -Before looking at how to tune this hyperparameter, we need to understand -what a tree is and why it is so important for the -`RandomForest` algorithm. - -A tree is a logical graph that maps a decision and its outcomes at each -of its nodes. Simply speaking, it is a series of yes/no (or true/false) -questions that lead to different outcomes. - -A leaf is a special type of node where the model will make a prediction. -There will be no split after a leaf. A single node split of a tree may -look like this: - -![](./images/B15019_04_14.jpg) - -Caption: Example of a single tree node - -A tree node is composed of a question and two outcomes depending on -whether the condition defined by the question is met or not. In the -preceding example, the question is `is avg_rss12 > 41?` If the -answer is yes, the outcome is the `bending_1` leaf and if not, -it will be the `sitting` leaf. - -A tree is just a series of nodes and leaves combined together: - -![](./images/B15019_04_15.jpg) - -Caption: Example of a tree - -In the preceding example, the tree is composed of three nodes with -different questions. Now, for an observation to be predicted as -`sitting`, it will need to meet the conditions: -`avg_rss13 <= 41`, `var_rss > 0.7`, and -`avg_rss13 <= 16.25`. - -The `RandomForest` algorithm will build this kind of tree -based on the training data it sees. We will not go through the -mathematical details about how it defines the split for each node but, -basically, it will go through every column of the dataset and see which -split value will best help to separate the data into two groups of -similar classes. Taking the preceding example, the first node with the -`avg_rss13 > 41` condition will help to get the group of data -on the left-hand side with mostly the `bending_1` class. The -`RandomForest` algorithm usually builds several of this kind -of tree and this is the reason why it is called a forest. - -As you may have guessed now, the `n_estimators` hyperparameter -is used to specify the number of trees the `RandomForest` -algorithm will build. For example (as in the previous exercise), say we -ask it to build 10 trees. For a given observation, it will ask each tree -to make a prediction. Then, it will average those predictions and use -the result as the final prediction for this input. For instance, if, out -of 10 trees, 8 of them predict the outcome `sitting`, then the -`RandomForest` algorithm will use this outcome as the final -prediction. - -Note - -If you don\'t pass in a specific `n_estimators` -hyperparameter, it will use the default value. The default depends on -the version of scikit-learn you\'re using. In early versions, the -default value is 10. From version 0.22 onwards, the default is 100. You -can find out which version you are using by executing the following -code: +You can find out which version you are using by executing the following code: `import sklearn` `sklearn.__version__` -For more information, see here: - + In general, the higher the number of trees is, the better the performance you will get. Let\'s see what happens with @@ -1118,31 +1025,8 @@ print(accuracy_score(y_test, test_preds9)) The output will be as follows: -![Caption: Accuracy scores for the training and testing sets for -min\_samples\_leaf=25 ](./images/B15019_04_31.jpg) +![](./images/B15019_04_31.jpg) -Caption: Accuracy scores for the training and testing sets for -min\_samples\_leaf=25 - -Both accuracies for the training and testing sets decreased but they are -quite close to each other now. So, we will keep this value -(`25`) as the optimal one for this dataset as the performance -is still OK and we are not overfitting too much. - -When choosing the optimal value for this hyperparameter, you need to be -careful: a value that\'s too low will increase the chance of the model -overfitting, but on the other hand, setting a very high value will lead -to underfitting (the model will not accurately predict the right -outcome). - -For instance, if you have a dataset of `1000` rows, if you set -`min_samples_leaf` to `400`, then the model will not -be able to find good splits to predict `5` different classes. -In this case, the model can only create one single split and the model -will only be able to predict two different classes instead of -`5`. It is good practice to start with low values first and -then progressively increase them until you reach satisfactory -performance. @@ -1258,13 +1142,6 @@ We will be using the same zoo dataset as in the previous exercise. ![](./images/B15019_04_33.jpg) - Caption: Accuracy scores for the training and testing sets - - The accuracy score decreased for both the training and testing sets - compared to the best result we got in the previous exercise. Now the - difference between the training and testing sets\' accuracy scores - is much smaller so our model is overfitting less. - 11. Instantiate another `RandomForestClassifier` with `random_state=42`, `n_estimators=30`, `max_depth=2`, and `min_samples_leaf=7`, and diff --git a/lab_guides/Lab_6.md b/lab_guides/Lab_6.md index d88ddf5..8983a5b 100644 --- a/lab_guides/Lab_6.md +++ b/lab_guides/Lab_6.md @@ -77,13 +77,6 @@ The following steps will help you complete the exercise: ![](./images/B15019_06_01.jpg) - Caption: The car dataset without headers - - Note - - Alternatively, you can enter the dataset URL in the browser to view - the dataset. - `CSV` files normally have the name of each column written in the first row of the data. For instance, have a look at this dataset\'s CSV file, which you used in *Lab 3, Binary @@ -1375,19 +1368,6 @@ The following steps will help you accomplish this task: ![](./images/B15019_06_35.jpg) - Caption: Reading the dataset - - You will need to do a few things to work with this file. Skip 15 - rows and specify the column headers and read the file without an - index. - - The code shows how you do that by creating a Python list to hold - your column headers and then read in the file using - `read_csv()`. The parameters that you pass in are the - file\'s location, the column headers as a Python list, the name of - the index column (in this case, it is None), and the number of rows - to skip. - The `head()` method will print out the top five rows and should look similar to the following: diff --git a/lab_guides/Lab_8.md b/lab_guides/Lab_8.md index d46ff0c..174c5f7 100644 --- a/lab_guides/Lab_8.md +++ b/lab_guides/Lab_8.md @@ -784,44 +784,11 @@ dataset we will use contains 1,797 labeled images of handwritten digits. ![](./images/B15019_08_14.jpg) -Caption: Using pandas to visualize the results - - - -Advantages and Disadvantages of Grid Search -------------------------------------------- - -The primary advantage of the grid search compared to a manual search is -that it is an automated process that one can simply set and forget. -Additionally, you have the power to dictate the exact -hyperparameterizations evaluated, which can be a good thing when you -have prior knowledge of what kind of hyperparameterizations might work -well in your context. It is also easy to understand exactly what will -happen during the search thanks to the explicit definitions of the grid. - -The major drawback of the grid search strategy is that it is -computationally very expensive; that is, when the number of -hyperparameterizations to try increases substantially, processing times -can be very slow. Also, when you define your grid, you may inadvertently -omit an hyperparameterization that would in fact be optimal. If it is -not specified in your grid, it will never be tried To overcome these drawbacks, we will be looking at random search in the next section. -Random Search -============= - - -Instead of searching through every hyperparameterizations in a -pre-defined set, as is the case with a grid search, in a random search -we sample from a distribution of possibilities by assuming each -hyperparameter to be a random variable. Before we go through the process -in depth, it will be helpful to briefly review what random variables are -and what we mean by a distribution. - - Random Variables and Their Distributions ---------------------------------------- @@ -831,6 +798,7 @@ Random Variables and Their Distributions Caption: Probability mass function for the discrete uniform distribution + The following code will allow us to see the form of this distribution with 10 possible values of X. @@ -900,9 +868,7 @@ p_X_1 = stats.norm.pdf(x=x, loc=0.0, scale=1.0**2) p_X_2 = stats.norm.pdf(x=x, loc=0.0, scale=1.5**2) ``` -Note - -In this case, `loc` corresponds to 𝜇, while `scale` +**Note:** In this case, `loc` corresponds to 𝜇, while `scale` corresponds to the standard deviation, which is the square root of `𝜎``2`, hence why we square the inputs. @@ -1017,9 +983,7 @@ samples = stats.gamma.rvs(a=1, loc=1, scale=2, \ size=n_iter, random_state=100) ``` -Note - -We set a random state to ensure reproducible results. +**Note** We set a random state to ensure reproducible results. Plotting a histogram of the sample, as shown in the following figure, reveals a shape that approximately conforms to the distribution that we @@ -1086,17 +1050,14 @@ The output will be as follows: ![](./images/B15019_08_22.jpg) -Caption: Output for the random search process -Note - -The results will be different, depending on the data used. It is always beneficial to visualize results where possible. Plotting 𝛼 by negative mean squared error as a scatter plot makes it clear that venturing away from 𝛼 = 1 does not result in improvements in predictive performance: + ``` plt.scatter(df_result.alpha, \ df_result.mean_neg_mean_squared_error) @@ -1108,7 +1069,6 @@ The output will be as follows: ![](./images/B15019_08_23.jpg) -Caption: Plotting the scatter plot The fact that we found the optimal 𝛼 to be 1 (its default value) is a special case in hyperparameter tuning in that the optimal @@ -1189,9 +1149,7 @@ The output will be as follows: Caption: Output for tuning using RandomizedSearchCV -Note - -The preceding results may vary, depending on the data. +Note: The preceding results may vary, depending on the data. @@ -1351,12 +1309,6 @@ The following steps will help you complete the exercise. ![](./images/B15019_08_26.jpg) - Caption: Top five hyperparameterizations - - Note - - You may get slightly different results. However, the values you - obtain should largely agree with those in the preceding output. 9. The last step is to visualize the result. Including every parameterization will result in a cluttered plot, so we will filter diff --git a/lab_guides/Lab_9.md b/lab_guides/Lab_9.md index c42fee2..2f4c4a4 100644 --- a/lab_guides/Lab_9.md +++ b/lab_guides/Lab_9.md @@ -364,11 +364,8 @@ The output will be as shown in the following figure: ![](./images/B15019_09_12.jpg) -Caption: Feature importance of a Random Forest model -Note - -Due to randomization, you may get a slightly different result. +**Note:** Due to randomization, you may get a slightly different result. It might be a little difficult to evaluate which importance value corresponds to which variable from this output. Let\'s create a @@ -1378,19 +1375,9 @@ We will be using the same dataset as in the previous exercise. You should get the following output: -![Caption: LIME output for the third observation of the testing - set ](./images/B15019_09_41.jpg) +![](./images/B15019_09_41.jpg) -Caption: LIME output for the third observation of the testing set - - -You have completed the last exercise of this lab. You saw how to use -LIME to interpret the prediction of single observations. We learned that -the `a1pop`, `a2pop`, and `a3pop` features -have a strong negative impact on the first and third observations of the -training set. - Activity 9.01: Train and Analyze a Network Intrusion Detection Model