Random Forest and Ensembles Learning with Amazon Food Reviews

Sachin D N
Analytics Vidhya
Published in
17 min readNov 21, 2020

--

In this blog, we’ll try to understand one of the most important algorithms in machine learning i.e. Random Forest Algorithm. We will try to look at the things that make Random Forest so special and will try to implement it on a real-world dataset.

Contents

  1. What Are Ensembles?
  2. Types of Ensemble Learning.
  3. Bagging.
  4. Random Forest and Construction.
  5. Best and Worst cases of Random Forest.
  6. Boosting.
  7. Types of Boosting.
  8. Gradient Boosting.
  9. AdaBoost (Adaptive Boosting).
  10. XGBoost.
  11. Stacking Classifier.
  12. Cascading Classifier.
  13. Random Forest and XGBOOST with Amazon Food Reviews.

What Are Ensembles?

Commonly, the individual model suffers from bias or variances and that’s why we need ensemble learning.

An Ensemble method is a technique that combines the predictions from multiple machine learning algorithms together to make more accurate predictions than any individual model. A model comprised of many models is called an Ensemble model.

Generally in Machine Learning when we have Multiple models then combine them and possible to make powerful models is called Ensembles.

Types of Ensemble Learning

  • Bagging(Bootstrap aggregating)
  • Boosting
  • Stacking
  • Cascading

When the models are more different than we get better performances.

Bagging(Bootstrap aggregating): Random Forest is the popular Bagging Technique.

Bootstrap refers to random sampling with replacement. Bootstrap allows us to better understand the bias and the variance with the dataset. Bootstrap involves random sampling of a small subset of data from the dataset.

It is a general procedure that can be used to reduce the variance for those algorithms that have high variance, typically decision trees. Bagging makes each model run independently and then aggregates the outputs at the end without preference to any model.

Technique: Training a model using Bootstrap samples i.e each model has seen a different subset of data and combines all the models using aggregation.

The typical way to aggregating for classification takes the majority vote, if it is Regression take the mean (or) median.

When we perform Bootstrap sampling?

when the model changes a lot with changes in training data said to be high variance, when we applying bootstrap sampling to this model by changing the dataset bootstrap sampling does not impact the model because we using aggregation technique.

Bagging: It is a technique that can reduce variance in a model without impacting the bias.

The base model is a low bias and high variance model. using a Bagging on this model we can reduce the high variance into reasonable variance without changing the bias because of bootstrap sampling and aggregation.

Example: Decision tree of depth is large (high variance and low bias)

Problems with Decision Trees?

Decision trees are sensitive to the specific data on which they are trained. If the training data is changed the resulting decision tree can be quite different and in turn the predictions can be quite different.

Also, Decision trees are computationally expensive to train, carry a big risk of overfitting and tend to find local optima because they can’t go back after they have made a split.

To address these weaknesses, we turn to Random Forest, which illustrates the power of combining many decision trees into one model.

Random Forest and Construction

Random forest is a Supervised Learning algorithm that uses an ensemble learning method for classification and regression.

Random forest is a bagging technique. The trees in random forests are run in parallel. There is no interaction between these trees while building the trees.

It operates by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees.

A random forest is a meta-estimator (i.e. it combines the result of multiple predictions) which aggregates many decision trees, with some helpful modifications.

The number of features that can be split on at each node is limited to some percentage of the total (which is known as the hyperparameter). This ensures that the ensemble model does not rely too heavily on any individual feature, and makes fair use of all potentially predictive features.

Each tree draws a random sample from the original data set when generating its splits, adding a further element of randomness that prevents overfitting.

Out-Of-Bag (OOB) points

In random forests, there is no need for cross-validation or a separate test set to get an unbiased estimate of the test set error. It is estimated internally as follows.

Each tree is constructed using a different bootstrap sample from the original data. About one-third of the cases are left out of the bootstrap sample and not used in the construction of the kth tree.

we can use these points in the cross-validation.

Best and Worst cases of Random Forest

  1. For applications in classification problems, the Random Forest algorithm will avoid the overfitting problem.
  2. For both classification and regression tasks, the same random forest algorithm can be used.
  3. The Random Forest algorithm can be used for identifying the most important features from the training dataset, in other words, feature engineering.
  4. It generates an internal unbiased estimate of the generalization error as the forest building progresses.
  5. It has an effective method for estimating missing data and maintains accuracy when a large proportion of the data are missing.
  6. Random forests have been observed to overfit for some datasets with noisy classification (or) regression tasks.
  7. For data including categorical variables with different numbers of levels, random forests are biased in favor of those attributes with more levels. Therefore, the variable importance scores from the random forests are not reliable for this type of data.
  8. Where ever the Decision tree fails the Random forest also fails.

Boosting

The term ‘Boosting’ refers to a family of algorithms which converts weak learner to strong learners.

Instead of creating a single powerful model, boosting combines multiple simple models into a single composite model. The idea is that, as we introduce more and more simple models, the overall model becomes stronger and stronger. In boosting terminology, the simple models are called weak models (or) weak learners.

Then the second model is built which tries to correct the errors present in the first model. This procedure is continued and models are added until either the complete training data set is predicted correctly or the maximum number of models are added.

core idea: Boosting reduces bias, we use the low variance and high bias model as a base learner in boosting.

sequential ensemble approach

The main idea of boosting is to add new models to the ensemble sequentially. In essence, boosting attacks the bias-variance-tradeoff by starting with a weak model (e.g., a decision tree with only a few splits) and sequentially boosts its performance by continuing to build new trees, where each new tree in the sequence tries to fix up where the previous one made the biggest mistakes (i.e., each new tree in the sequence will focus on the training rows where the previous tree had the largest prediction errors).

The base learners: Boosting is a framework that iteratively improves any weak learning model. Many gradient boosting applications allow you to “plugin” various classes of weak learners at your disposal. In practice, however, boosted algorithms always use decision trees as the base-learner.

Training weak models: A weak model is one whose error rate is only slightly better than random guessing. Essentially, by focusing on the rows of the training data where the previous tree had the largest errors or residuals. With regards to decision trees, shallow trees i.e., trees with relatively few splits represent a weak learner. In boosting, trees with 1–6 splits are most common.

Sequential training concerning errors: Boosted trees are grown sequentially. each tree is grown using information from previously grown trees to improve performance. By fitting each tree in the sequence to the previous tree’s residuals, we’re allowing each new tree in the sequence to focus on the previous tree’s mistakes. The steps as follows...

  1. we have given train data Dtrain(Xi, Yi) at the start. Fit a Shallow decision tree to the data i.e F1(x)=Y.
  2. Then we fit the next decision tree to residuals, i.e h1(x)=Yi−F1(x).
  3. Add this new tree to our algorithm,i.e F2(x)=F1(x)+h1(x).
  4. Fit the next decision tree to the residuals of F2(x) i.e h2(x)=Yi−F2(x).
  5. Add this new tree to our algorithm,i.e F3(x)=F2(x)+h2(x).
  6. Continue this process until low residual, then training error reduces means bias is reduces.

The final model here is a stage-wise additive model of j individual trees,

Types of Boosting

  1. Gradient Boosting
  2. AdaBoost (Adaptive Boosting)
  3. XGBoost

Gradient Boosting

The name Gradient boosting machine comes from the fact that this procedure can be generalized to loss functions.

Gradient boosting is considered a Gradient descent algorithm. Gradient descent is a very generic optimization algorithm capable of finding optimal solutions to a wide range of problems. The general idea of gradient descent is to tweak parameter(s) iteratively to minimize a cost function.

Gradient boosting trees solves part of the limitation mentioned above. Instead of training a single tree, multiple trees are trained sequentially. To lower the variance of the trees, they are however restricted. They are turned into weak learners by setting limits on the depth of the trees. The decision tree's depth is often chosen between 3 and 6 layers. We allow a little bit of depth so that we can compare jointly occurring variables.

Where does the gradient appear in gradient boosting?

Residuals contain direction not just magnitude information and so residuals are vectors. The sign of a residual vector is also a vector. Most articles treat residuals as the error between the approximation and the true target, but we generically call these direction vectors to emphasize the fact that they are vectors, not just magnitudes. Considering the residual as an error vector also makes it awkward to talk about gradient boosting machines that optimize the absolute error, not the squared error.

By adding in approximations to residuals, gradient boosting machines are chasing gradients, hence, the term gradient boosting.

Gradient Boosting provides the minimizing any loss function as long as it is differentiable.

Residual and Loss functions

Consider our boosting final equation shown below.

Residual is nothing but,

Error =Yi−F1(x) at the end of the model k.

Now if we consider the error as squared error then the loss function is,

Then we take the derivative w.r.t Fk(x) then,

This negative gradient is called pseudo residual and it helps to any loss function up to differentiable.

key idea: we can replace this error with pseudo residual (negative gradient) because we can use any loss function.

For more information about loss functions and gradient visit here.

The loss function can be Logistic loss, square loss, hinge loss (or) any loss as long as differentiable.

The steps for the Gradient Boost algorithm as follows :

For more information visit here.

Regularization

Fitting the training set too closely can lead to the degradation of the model’s generalization ability. Several so-called regularization techniques reduce this overfitting effect by constraining the fitting procedure.

One natural regularization parameter is the number of gradients boosting iterations M i.e. the number of trees in the model when the base learner is a decision tree. Increasing M reduces the error on the training set, but setting it too high may lead to overfitting. An optimal value of M is often selected by monitoring prediction error on a separate validation data set. Besides controlling M, several other regularization techniques are used.

Another regularization parameter is the depth of the trees. The higher this value the more likely the model will overfit the training data.

Shrinkage

An important part of the gradient boosting method is regularization by shrinkage which consists of modifying the update rule as follows:

where the parameter ν is called the “learning rate”.

Empirically it has been found that using small learning rates, such as ν<0.1, yields dramatic improvements in models’ generalization ability over gradient boosting without shrinking (ν =1). However, it comes at the price of increased computational time both during training and querying a lower learning rate requires more iterations.

AdaBoost (Adaptive Boosting)

Adaboost combines multiple weak learners into a single strong learner. The weak learners in AdaBoost are decision trees with a single split, called decision stumps.

When AdaBoost creates its first decision stump, all observations are weighted equally. To correct the previous error, the observations that were incorrectly classified now carry more weight than the observations that were correctly classified. AdaBoost algorithms can be used for both classification and regression problems.

AdaBoost is extensively used in Image processing in face detections. The steps for the AdaBoost algorithm as follows :

  1. Initialize the dataset and assign equal weight to each of the data points.
  2. Provide this as input to the model and identify the wrongly classified data points.
  3. Increase the weight of the wrongly classified data points.
  4. if (got required results)
    Go to step 5
    else
    Go to step 2
  5. End

Consider this toy dataset in 2D and assume that our weak classifiers are decision stumps (vertical or horizontal half-planes):

The first round: After calculating the weights we will increase the misclassified points weights before going to the next step.

The second round: In this step also find the weights and increase the weights of misclassified points before going to the next step.

The third round: find the weights and increase the weights of misclassified points before going to the next step.

The final classifier: combine all the models in the final step like α1h1(x)+α2h2(x)+α3h3(x) we get the F(x) model.

For more information about AdaBoost visit here.

XGBoost

XGBoost stands for eXtreme Gradient Boosting. XGBoost is an implementation of gradient boosted decision trees designed for speed and performance. Gradient boosting machines are generally very slow in implementation because of sequential model training. Hence, they are not very scalable. Thus, XGBoost is focused on computational speed and model performance.

The library provides a system for use in a range of computing environments, not least:

  • Parallelization of tree construction using all of your CPU cores during training.
  • Distributed Computing for training very large models using a cluster of machines.
  • Out-of-Core Computing for very large datasets that don’t fit into memory.
  • Cache Optimization of data structures and algorithm to make the best use of hardware.

Stacking Classifier

Stacking is an ensemble learning technique to combine multiple classification models via a meta-classifier. The individual classification models are trained based on the complete training set; then, the meta-classifier is fitted based on the outputs — meta-features — of the individual classification models in the ensemble. The meta-classifier can either be trained on the predicted class labels or probabilities from the ensemble. The algorithm can be summarized as follows,

This technique is often used in the real world but extensively used in Kaggle competition for better performances.

In the standard stacking procedure, the first-level classifiers are fit to the same training set that is used to prepare the inputs for the second-level classifier, which may lead to overfitting. The Stacking Classifier, however, uses the concept of cross-validation, the dataset is split into k folds, and in k successive rounds, k-1 folds are used to fit the first level classifier; in each round, the first-level classifiers are then applied to the remaining 1 subset that was not used for model fitting in each iteration. The resulting predictions are then stacked and provided -- as input data -- to the second-level classifier.

After the training of the Stacking Classifier, the first-level classifiers are fit to the entire dataset as illustrated in the figure below.

This technique is often used in the real world because of time complexity. For more information about Stacking Classifier visit here.

Cascading Classifier

This class of models is very very accurate. Cascading is mostly used in scenarios where you cannot afford to make a mistake. For example, a cascading technique is mostly used to detect fraudulent credit card transactions, or maybe when you want to be sure that you don’t have cancer.

Cascade models are mostly used when the cost of making a mistake is very very high.

Given that we have a transaction query point Xq, we will feed it to Model 1. Model 1 can be any model random forest (or) logistic regression. It can be anything! Basically what Model 1 does is that it predicts class probabilities to determine to which class does a given query point has higher chances of belonging.

Let’s say class label1 means the transaction is fraudulent, and class label0 means the transaction is not a fraud. Typically, the predicted probabilities are given by this P(Yq=0) and P(Yq=1), where Yq is our actual class label. Now let’s assume that P(Yq=0), i.e. the probability of the transaction to be not fraudulent is very high. So even when we are slightly unsure, we will train another Model 2. It does the same thing, it receives the query point and predicts P(Yq=0). This process continues until we confirmed.

In the above diagram, we have four models in a sequence. We train the model1 on the whole training dataset and evaluate its performance on the test dataset. wherever we are not sure about some data points i.e if they are fraudulent or not we pass it to the next stage.

Intuitively the cascades are designed in such a way that the next model in the sequence is only trained on the data points for which the model isn’t sure what the class label is. We must always train our model as per what type of data it would see during runtime in the production environment.

Random Forest and XGBOOST with Amazon Food Reviews

Let’s apply the Random Forest and XGBOOST algorithm for the real-world dataset Amazon Fine Food Review Analysis from Kaggle.

First We want to know What is Amazon Fine Food Review Analysis?

This dataset consists of reviews of fine foods from amazon. The data span a period of more than 10 years, including all ~500,000 reviews up to October 2012. Reviews include product and user information, ratings, and a plaintext review. We also have reviews from all other Amazon categories.

Amazon reviews are often the most publicly visible reviews of consumer products. As a frequent Amazon user, I was interested in examining the structure of a large database of Amazon reviews and visualizing this information so as to be a smarter consumer and reviewer.

Source: https://www.kaggle.com/snap/amazon-fine-food-reviews

The Amazon Fine Food Reviews dataset consists of reviews of fine foods from Amazon.

Number of reviews: 568,454
Number of users: 256,059
Number of products: 74,258
Timespan: Oct 1999 — Oct 2012
Number of Attributes/Columns in data: 10

Attribute Information:

  1. Id
  2. ProductId — unique identifier for the product
  3. UserId — unique identifier for the user
  4. ProfileName
  5. HelpfulnessNumerator — number of users who found the review helpful
  6. HelpfulnessDenominator — number of users who indicated whether they found the review helpful or not
  7. Score — a rating between 1 and 5
  8. Time — timestamp for the review
  9. Summary — Brief summary of the review
  10. Text — Text of the review

Objective

Given a review, determine whether the review is positive (rating of 4 or 5) or negative (rating of 1 or 2).

Data Preprocessing

Data Preprocessing is a technique that is used to convert the raw data into a clean data set. In other words, whenever the data is gathered from different sources it is collected in raw format which is not feasible for the analysis.

To Know the Complete overview of the Amazon Food review dataset and Featurization visit my previous blog link here.

Train-Test split

The train-test split procedure is used to estimate the performance of machine learning algorithms when they are used to make predictions on data not used to train the model.

If you have one dataset, you’ll need to split it by using the Sklearn train_test_split function first.

Text Featurization using Bag of Words

Hyper Parameter tuning

we want to choose the best Depth and best n_estimators means the number of trees for better performance of the model, to choose these by using Grid Search cross-validation.

we already defined a Grid_search Function when we call it, it will give the result.

After we find the best Parameters using a Grid search CV we want to check the performance with Test data, in this case, study, we use the AUC as the Performance measure.

we already defined a Function for testing the test data when we call it, it will give the result.

Performance Metrics

Performance metrics are used to measure the behavior, activities, and performance of a business. This should be in the form of data that measures required data within a range, allowing a basis to be formed supporting the achievement of overall business goals.

To Know detailed information about performance metrics used in Machine Learning please visit my previous blog link here.

we already defined a Function for performance metrics when we call it, it will give the result.

Feature Importance on BOW

Top 20 important features represented in Wordcloud. To know about Wordcloud visit here.

Similarly, we built a both Random Forest and XGBOOST model with TFIDF, AvgWord2Vec, TFIDF_AvgWord2Vec features. To understand the full code please visit my GitHub link.

Conclusions

To write concussions in the table we used the python library PrettyTable.

The pretty table is a simple Python library designed to make it quick and easy to represent tabular data in visually appealing tables.

Observations

  1. From the above table, we conclude that TFIDF in Random Forest with an optimal depth of 60 and optimal estimator of 500 have the Highest AUC score i.e 93.40 %.
  2. xgboost also TFIDF with an optimal depth of 10 and optimal estimator of 500 have the Highest AUC score i.e 93.96 %.
  3. Both Random Forest and xgboost models have performed reasonably well on Test data.

To Know the Complete overview of the Amazon Food review dataset and Featurization visit my previous blog link here.

To Know detailed information about performance metrics used in Machine Learning please visit my previous blog link here.

To understand the full code please visit my GitHub link.

References

  • Applied AI
  • Wikipedia
  • Coursera
  • Data Camp

Thanks for reading and your patience. I hope you liked the post, let me know if there are any errors in my post. Let’s discuss in the comments if you find anything wrong in the post or if you have anything to add…

Happy Learning!!

--

--