Decision Tree With Amazon Food Reviews

Published in

Analytics Vidhya

16 min readNov 18, 2020

Decision trees are a popular supervised learning method for a variety of reasons. The benefits of decision trees include that they can be used for both regression and classification, they are easy to interpret and they don’t require feature scaling. They have several flaws including being prone to overfitting.

What are Decision Trees?.
Geometric Intuition of Decision Trees.
Entropy.
Information Gain.
Gini impurity.
Play Tennis Dataset Example of Decision Tree.
Steps to Constructing a Decision Tree.
Decision Tree Regression.
Real-world cases of Decision Tree.
Best and Worst cases of Decision Tree Algorithm.
Decision Tree with Amazon Food Reviews.

What are decision trees?

Decision trees which are also modernly known as classification and regression trees (CART) were introduced by Leo Breiman to refer, Decision Tree algorithms. They have a supervised learning algorithm that has a pre-defined target variable & they are mostly used in non-linear decision making with a simple linear decision surface. In other words, they are adaptable for solving any kind of problem at hand (classification or regression).

One of the best and most used supervised learning methods are tree-based algorithms. They empower predictive modeling with higher accuracy, better stability, and provide ease of interpretation. Unlike linear modeling techniques, they map non-linear relationships quite well. Methods like decision trees, random forest, gradient boosting are being popularly used in all kinds of data science problems. Hence, for every analyst, it’s important to learn these algorithms and apply them at the time of modeling.

Geometric Intuition of Decision Trees

The below Image shows the simple implementation of different nodes in the decision tree.

Root Node: It represents the entire population or sample and this further gets divided into two or more homogeneous sets.
Splitting: It is a process of dividing a node into two or more sub-nodes.
Decision Node: When a sub-node splits into further sub-nodes, then it is called a decision node.
Leaf (or) Terminal Node: Nodes do not split is called Leaf or Terminal node.
Pruning: When we remove sub-nodes of a decision node, this process is called pruning.
Branch (or) Sub-Tree: A sub-section of the entire tree is called a branch (or) sub-tree.
Parent and Child Node: A node, which is divided into sub-nodes is called the parent node of sub-nodes where sub-nodes are the child of the parent node.

Let’s now see Geometrically how Decision trees are represented?

There are multiple algorithms written to build a decision tree, which can be used according to the problem characteristics you are trying to solve. Regression trees are used when the dependent variable is continuous and Classification trees are used when the dependent variable is categorical.

Below is a simple decision tree of the IRIS Data Set. a & b are the sepal, petal lengths, respectively.

We can alternatingly think that decision trees are a group of nested IF-ELSE conditions which can be modeled as a tree where the decision is made in the internal node and output is obtained in the leaf node.

Geometrically we can think of decision trees as a set of a number of axis-parallel hyperplanes that divide the space into the number of hypercuboids during the inference.

we can see that in the above Image Decision boundaries are axis parallel & have a decision surface for each leaf node (Yi).

Simple Play Tennis Example of Decision Tree Looks Like……At any scenario where learning data has the following traits:

The learning data has attribute value pair like in the example shown above: Wind as an attribute has two possible values — strong or weak
The target function has a discreet output. Here, the target function is should you play tennis? And the output to this discreet output is Yes and No

When we apply the Decision Tree algorithm to the dataset ………It looks like

The above Image tells us the root node as the outlook is overcast then we can play, if the outlook is sunny and humidity is normal we can play tennis but when the humidity is high we can’t play tennis same will be processed.

Creating a decision tree is just a matter of choosing which attribute should be tested at each node in the tree. Whereas, information gain is the measure which will be used to decide which attribute (or) feature should be tested at each node.

Information gain is itself calculated using a measure called entropy, which we first define for the case of a binary decision problem and then define for the general case. The reason we defined entropy first for a binary decision problem is easier to get an idea of what it is trying to calculate.

Entropy

Tom Mitchell states it fairly well as:

“In order to define information gain precisely, we begin by defining a measure commonly used in information theory, called entropy that characterizes the (im)purity of an arbitrary collection of examples.”

Mathematically, Entropy can be represented as:

Pj: Proportion of the samples that belong to class C for a particular node.

Note: This is the definition of entropy for all non-empty classes(p≠0). The Entropy is 0 if all samples at a node belong to the same class.

This measure satisfies our criteria, because of the -p*log2(p) construction: when p gets close to zero (i.e., the category has only a few examples in it), then the log(p) becomes a big negative number, but the p part dominates the calculation, so the entropy works out to be nearly zero.

Remembering that entropy calculates the disorder in the data, this low score is good, as it reflects our desire to reward categories with few examples. Similarly, if p gets close to 1 (i.e., the category has most of the examples in), then the log(p) part gets very close to zero, and again ‘p’ which dominates the calculation, so the overall value gets close to zero.

Hence we see that when both the category is nearly or completely empty, or when the category nearly contains or completely contains all the examples, the score for the category gets close to zero, which models what we wanted it to. Note that 0*ln(0) is taken to be zero by convention.

Graphical Representation of Entropy

Information Gain

The information gain is based on the decrease in entropy after a dataset is split on an attribute. Constructing a decision tree is all about finding the attribute that returns the highest information gain (i.e., the most homogeneous branches).

In words, we can also define Information Gain as follows,

Information Gain criteria help in making these decisions. Using an independent variable value(s), the child nodes are created. We need to calculate the Entropy of Parent and Child Nodes for calculating the information gain due to the split. A variable with the highest information gain is selected for the split.

In general, the entropy is maximum if all the outcomes are equally likely on the other if some outcomes are more probable than others then the entropy decreases.

Gini Impurity

Gini impurity can be considered as an alternative for the entropy method. Gini impurity is a measure of how often a randomly chosen element from the set would be incorrectly labeled if it was randomly labeled according to the distribution of labels in the subset.

Mathematically, it can be represented as:

Calculation of Entropy in the Gini method behaves in the same way but entropy involves a log computation and Gini impurity involved a square computation. Since the computing square is cheaper than the logarithmic function we prefer Gini impurity over entropy.

The graph below shows that the Gini index and entropy are very similar impurity criteria.

Some of the more popular algorithms are ID3, C4.5, and CART. Scikit-learn uses an optimized version of the CART algorithm. You can learn about the time complexity here.

Let’s understand the example here. Please refer to the play tennis dataset that is pasted above.

We have data for 14 days. We have only two outcomes :

Either we played tennis or we didn’t play.

In the given 14 days, we played tennis on 9 occasions and we did not play on 5 occasions.

Now, we will see the probability of not playing tennis.

And now entropy of outcome,

So, the entropy of the whole system before we make our first question is 0.940

Now, we have four features to make a decision and they are:

Outlook
Temperature
Windy
Humidity

Let’s see what happens to entropy when we make our first decision on the basis of Outlook.

If we make a decision tree division at this level 0 based on outlook, we have three branches possible; either it will be Sunny or Overcast or it will be Raining.

Sunny: In the given data, 5 days were sunny. Among those 5 days, tennis was played on 2 days and tennis was not played in 3 days. What is the entropy here?

2. Overcast: In the given data, 4 days were overcast and tennis was played on all four days. Let

3. Rain: In the given data, 5 days were rainy. Among those 5 days, tennis was played on 3 days and tennis was not played in 2 days. What is the entropy here?

Entropy among the three branches:

What is the reduction in randomness due to choosing an outlook as a decision-maker?

This reduction in randomness is called Information Gain. A similar calculation can be done for other features.

Temperature

Information Gain = 0.029

Windy

Information Gain = 0.048

Humidity

Information Gain = 0.152

We can see that decrease in randomness, or information gain is most for Outlook. So, we choose the first decision-maker as Outlook.

Steps to Constructing a Decision Tree

Create a feature list, attribute list.

Example: Feature List: Outlook, Windy, Temperature, and Humidity

Attributes for Outlook are Sunny, Overcast, and Rainy.

2. Find the maximum information gain among all the features. Assign its root node.

Outlook in our example and has three branches: Sunny, Overcast, and Rainy.

3. Remove the feature assigned in the root node from the feature list and again find the maximum increase in information gain for each branch. Assign the feature as a child node of each branch and remove that feature from the feature list for that branch.

Sunny Branch for outlook root node has humidity as a child node.

4. Repeat step 3 until you get branches with only pure leaf.

In our example, either yes or no.

For More Information Visit here.

Stop criterion

If we continue to grow the tree fully until each leaf node corresponds to the lowest impurity, then the data have typically been overfitted.

If splitting is stopped too early, error on training data is not sufficiently high and performance will suffer due to bais. Thus, preventing overfitting & underfitting is pivotal while modeling a decision tree and it can be done in two ways.

Setting constraints on tree size

Providing a minimum number of samples for a node split.
Deploying the minimum number of samples for a terminal node (leaf).
Allowing the maximum depth of the tree (vertical depth).
The maximum number of terminal nodes.
Maximum features to consider for the split.

2. Tree pruning

pruning is a technique in machine learning that reduces the size of decision trees by removing sections of the tree. It also reduces the complexity of the final classifier, and hence improves predictive accuracy by the reduction of overfitting.

Tree pruning can be done in two ways by pre-pruning or by post-pruning.

pre-pruning:

Stop splitting the current node if it does not improve the entropy by at least some pre-set(threshold) value.
Stop partitioning if the number of data points is less than some preset(Threshold) values.
Restricting the depth of the tree to some pre-set(Threshold) value.

Post-pruning:

It can be done by first allowing the tree to grow to its full potential and then pruning the tree at each level after calculating the cross-validation accuracy at each level.

Where Can we use Decision Trees (which kind of problems)?

Decision Trees can be used in both Regression & Classification problems. Implementation methods are quite different as Regression problems have real numbers as the target variables, where are Classification problems have binary or discrete target variables.

Regression

In Regression, the decision of making strategic splits heavily affects a tree’s accuracy. The decision criteria are different for classification and regression trees. Decision trees regression normally use mean squared error (MSE) to decide to split a node into two or more sub-nodes.

Suppose we are doing a binary tree the algorithm first will pick a value, and split the data into two subsets. For each subset, it will calculate the MSE separately. The tree chooses the value with results in the smallest MSE value.

Let’s examine how is Splitting Decided for Decision Trees Regressor in more detail. The first step to create a tree is to create the first binary decision. How are you going to do it?

We need to pick a variable and the value to split on such that the two groups are as different from each other as possible.
For each variable, for each possible value of the possible value of that variable see whether it is better.
How to determine if it is better? Take a weighted average of two new nodes (mse*num_samples).

To sum up, we now have:

A single number that represents how good a split is the weighted average of the mean squared errors of the two groups that create.
A way to find the best split which is to try every variable and to try every possible value of that variable and see which variable and which value gives us a split with the best score.

This is the entirety of creating a decision tree regressor and will stop when some stopping condition (defined by hyperparameters) is met:

When you hit a limit that was requested (for examplemax_depth)
When your leaf nodes only have one thing in them (no further split is possible, MSE for the train will be zero but will overfit for any other set -not a useful model)

The simple diagram is shown below.

Splitting of numerical features

The splitting of numerical features can be performed by sorting the features in the ascending order and trying each value as the threshold point and calculating the information gain for each value as the threshold. Finally, if that value obtained is equal to the threshold which gives the maximum Information Gain value then hurray..!

Feature scaling: Feature scaling (or) column standardization not necessary to perform in decision trees. Because Decision Tree is not a distance-based method like SVM.

Categorical features: In order to handle categorical features in Decision trees, we must never perform one-hot encoding on a categorical variable even if the categorical variables are nominal since most of the libraries can handle categorical variables automatically. we can still assign a number for each variable if desired.

Imbalanced Dataset: Imbalanced class does have a detrimental impact on the tree’s structure so it can be avoided by either using upsampling or by using downsampling depending upon the dataset.

Multiclass-Classification: Decision Tree Naturally it does the Multiclass-Classification.we can also use one v/s Rest method.

Decision surface: The decision surface is axis parallel Hyper-cuboids.

Similarity Matrix: Decision Trees need features explicitly and not work in similarity matrices.

Large Dimensionality: Apart from skewed classes, high dimensionality can also have an adverse effect on the structure of the tree if dimensionality is very high which means we have a lot of features which means that to find the splitting criterion on each node will consume a lot of time.

Outliers impact: Outliers also impact the tree’s structure as the depth increases the chance of outliers in the tree increases.

Feature importance: Feature importance can be determined by calculating the normalized sum at every level as we have t reduce the entropy and we then select the feature that helps to reduce the entropy by a large margin. so for whichever feature the normalized sum is highest, we can then think of it as the most important feature. similarly, a feature which has the second-highest normalized sum can be thought of as a second important feature. One advantage of classification trees is that they are relatively easy to interpret.

Best and Worst cases of Decision Tree Algorithm

Decision trees can inherently perform multiclass classification.
They provide most model interpretability because they are simply a series of if-else conditions.
They can handle both numerical and categorical data.
Nonlinear relationships among features do not affect the performance of the decision trees.
A small change in the dataset can make the tree structure unstable which can cause variance.
Decision tree learners create underfit trees if some classes are imbalanced. It is therefore recommended to balance the data set prior to fitting with the decision tree.

Decision Tree with Amazon Food Reviews

Let’s apply the Decision Tree algorithm for the real-world dataset Amazon Fine Food Review Analysis from Kaggle.

First We want to know What is Amazon Fine Food Review Analysis?

This dataset consists of reviews of fine foods from amazon. The data span a period of more than 10 years, including all ~500,000 reviews up to October 2012. Reviews include product and user information, ratings, and a plaintext review. We also have reviews from all other Amazon categories.

Amazon reviews are often the most publicly visible reviews of consumer products. As a frequent Amazon user, I was interested in examining the structure of a large database of Amazon reviews and visualizing this information so as to be a smarter consumer and reviewer.

Source: https://www.kaggle.com/snap/amazon-fine-food-reviews

The Amazon Fine Food Reviews dataset consists of reviews of fine foods from Amazon.

Number of reviews: 568,454
Number of users: 256,059
Number of products: 74,258
Timespan: Oct 1999 — Oct 2012
Number of Attributes/Columns in data: 10

Attribute Information:

Id
ProductId — unique identifier for the product
UserId — unique identifier for the user
ProfileName
HelpfulnessNumerator — number of users who found the review helpful
HelpfulnessDenominator — number of users who indicated whether they found the review helpful or not
Score — a rating between 1 and 5
Time — timestamp for the review
Summary — Brief summary of the review
Text — Text of the review

Objective

Given a review, determine whether the review is positive (rating of 4 or 5) or negative (rating of 1 or 2).

Data Preprocessing

Data Preprocessing is a technique that is used to convert the raw data into a clean data set. In other words, whenever the data is gathered from different sources it is collected in raw format which is not feasible for the analysis.

To Know the Complete overview of the Amazon Food review dataset and Featurization visit my previous blog link here.

Train-Test split

The train-test split procedure is used to estimate the performance of machine learning algorithms when they are used to make predictions on data not used to train the model.

If you have one dataset, you’ll need to split it by using the Sklearn train_test_split function first.

Text Featurization using Bag of Words

Hyper Parameter tuning

we want to choose the best Depth and best min_split for better performance of the model, to choose these by using Grid Search cross-validation.

we already defined a Grid_search Function when we call it, it will give the result.

After we find the best Parameters using a Grid search CV we want to check the performance with Test data, in this case, study, we use the AUC as the Performance measure.

we already defined a Function for testing the test data when we call it, it will give the result.

Performance Metrics

Performance metrics are used to measure the behavior, activities, and performance of a business. This should be in the form of data that measures required data within a range, allowing a basis to be formed supporting the achievement of overall business goals.

To Know detailed information about performance metrics used in Machine Learning please visit my previous blog link here.

we already defined a Function for performance metrics when we call it, it will give the result.

Feature Importance on BOW

Top 20 important features

Graphviz visualization of Decision Tree BOW Features

Similarly, we built a Decision Tree model with TFIDF, AvgWord2Vec, TFIDF_AvgWord2Vec features. To understand the full code please visit my GitHub link.

Conclusions

To write concussions in the table we used the python library PrettyTable.

The pretty table is a simple Python library designed to make it quick and easy to represent tabular data in visually appealing tables.

Observations

Compare to all other features representation, Bag of words features with Depth of 50 and min_split of 500 are getting the highest 80.65% AUC scores on Test data.

2. The optimal Depth and Min_split values from Grid_search_CV are different from model to model.