Support Vector Machines with Amazon Food Reviews

Published in

Analytics Vidhya

17 min readOct 21, 2020

SVM is a supervised Machine Learning algorithm that is used in many classifications and regression problems. It still presents as one of the most used robust prediction methods that can be applied to many use cases involving classifications.

Geometric Intuition Of Support Vector Machines.
Mathematical Formulation of Support Vector Machines.
Loss Minimization Interpretation of SVMs.
Dual Form of Support Vector Machines.
Kernel Trick in Support Vector Machines.
Train and Runtime Complexities of SVMs.
Support Vector Machines — Regression (SVR)
Best and Worst cases of Support Vector Machines Algorithm.
Support Vector Machines with Amazon Food Reviews.

Key Idea of SVM

Logistic Regression doesn’t care whether the instances are close to the decision boundary. Therefore, the decision boundary it picks may not be optimal. If a point is far from the decision boundary, we may be more confident in our predictions.

Therefore, the optimal decision boundary should be able to maximize the distance between the decision boundary and all instances. i.e., maximize the margins. That’s why the SVMs algorithm is important!

Find a Hyperplane that separates Positive points from Negative points as wide as possible.

The support vector machine works by finding an optimal separation line called a ‘hyperplane’ to accurately separate 2 or more different classes in a classification problem. The goal is to find the optimal hyperplane separation through training the linearly separable data with the SVM algorithm.

Here, we have three hyper-planes (A, B, and C). Now, identify the right hyper-plane to classify star and circle.

We need to remember a thumb rule to identify the right hyper-plane. “Select the hyper-plane which segregates the two classes better”. In this scenario, hyper-plane “B” has excellently performed this job.

Here, we have three hyper-planes (A, B, and C) and all are segregating the classes well. Now, How can we identify the right hyper-plane?

Here, maximizing the distances between the nearest data point (either class) and hyper-plane will help us to decide the right hyper-plane. This distance is called as Margin.

Above, you can see that the margin for hyper-plane C is high as compared to both A and B. Hence, we name the right hyper-plane as C. Another reason for selecting the hyper-plane with a higher margin is robustness.

If we select a hyper-plane having a low margin then there is a high chance of misclassification.

Support Vectors

These are the data points that are the most difficult to classify. In general, the larger the margin or distance between the support vectors, the easier it is for the algorithm to classify accurately. Hence, once the hyperplane is optimized, it is said to be the optimal separation or the maximum margin classifier.

Alternative Intuition about SVMs

In this section, we will be using the concept of “Convex Hull” in order to find the margin-maximizing hyper-plane.

Convex Hull can be defined as the smallest convex polygon such that all its points are either inside the polygon or on the polygon.

A convex polygon is basically a polygon such that all the points on the shortest line connecting any two points (situated inside or on the polygon) lie inside or on the polygon.

Steps to find the margin-maximizing hyper-plane using the concept of Convex Hull,

Draw the convex hull for positive data points.
Draw the convex hull for negative data points.
Then find the shortest line or hyper-plane connecting the hulls.
Finally, draw the line or hyper-plane that bisects the line or hyper-plane. This line or hyper-plane obtained is referred to as the “margin-maximizing hyper-plane”.

Mathematical Formulation of Support Vector Machines

Let us consider that the positive and negative hyper-planes are at unit distances away from the separating hyper-plane.

The Equation of separating Hyperplane is given by,

The equation for Positive Hyperplane is given by,

The equation for Negative Hyperplane is given by,

The Margin is given by,

Then For SVM, we can write a constraint optimization problem as,

This Equation is also called as Hard Margin SVM.

The above equation works only if given data is Linearly separable i.e, all positive points lie on one side of the plane and all negative points lie on another side of the plane and there are no data points lie between the positive and negative plane.

As most of the real-world data are not fully linearly separable, we will allow some margin violation to occur which is called soft margin classification. It is better to have a large margin, even though some constraints are violated. Margin violation means choosing a hyperplane, which can allow some data points to stay in either the incorrect side of the hyperplane and between the margin and correct side of the hyperplane.

From the Hard Margin SVM equation, we can’t solve this problem so we can go with other implementation.

The Equation of separating Hyperplane is given by,

The equation for Positive Hyperplane is given by,

The equation for Negative Hyperplane is given by,

The Margin is given by,

Then For SVM, we can write a constraint optimization problem as,

This Equation is also called as Soft Margin SVM.

In the above equation, the first portion of the equation before the ‘positive’ sign is referred to as the ‘regularization’ and the second portion is referred to as the ‘Hinge Loss’.

C is the hyper-parameter. If C increases, then our model is overfitting or high variance, and if C decreases, then our model is underfitting or high bias.

For large values of C, the optimization will choose a smaller-margin hyper-plane if that hyper-plane does a better job of getting all the training points classified correctly. Conversely, a very small value of C will cause the optimizer to look for a larger-margin separating hyper-plane, even if that hyper-plane misclassifies more points.

Loss Minimization Interpretation of SVMs

In the SVM algorithm, we are looking to maximize the margin between the data points and the hyperplane. The loss function that helps maximize the margin is hinge loss.

The y-axis is referred to as ‘hinge loss’ and the x-axis is written as

We need to minimize the above loss function to find the max-margin classifier. , the optimization problem for loss function is given by,

The optimization equation obtained in this section and the optimization equation obtained in the previous section is the same. Only C and lambda are inversely related to each other λ=1/C (C is always used for regularization coefficient)

The function of the first term, hinge loss is to penalize misclassifications. It measures the error due to misclassification. The second term is the regularization term which is a technique to avoid overfitting by penalizing large coefficients in the solution vector.

The λ(lambda) is the regularization coefficient and its major role is to determine the trade-off between increasing the margin size and ensuring that the xi lies on the correct side of the margin.

Dual Form of Support Vector Machines

When we see the equation of the soft margin equation it looks like below and we can be called as Primal form of SVM.

The above equation is equivalently written as,

Both are Equivalent to each other, for the above equation α is 0 for Non-support vectors and α is > 0 for support vectors.

Kernel Trick in SVM

All the algorithms we have described so far use the data only through inner products. Because of this, they can be made non-linear in a very general way.

That sounds alright. However, when there are more and more dimensions, computations within that space become more and more expensive. This is when the kernel trick comes in. It allows us to operate in the original feature space without computing the coordinates of the data in a higher-dimensional space.

Let’s recap what we have seen. In short, an SVM classifier can be solved by computing the convex Lagrange dual of the max-margin SVM formulation is as follows:

The kernel is a way of computing the dot product of two vectors x and y in some (possibly very high dimensional) feature space, which is why kernel functions are sometimes called “generalized dot product”.

For a non-linear data, instead of imagining the original features of each data point, let us first consider a transformation to a new feature space where the data point has N features, one for each support-vector.

Now consider what if we had data as shown in the image below? Clearly, there is no line that can separate the two classes in this x-y plane. So what do we do?

we apply transformation and add one more dimension as we call it the z-axis. Let’s assume the value of points on z plane, w = x² + y². In this case, we can manipulate it as a distance of a point from z-origin. Now if we plot in the z-axis, a clear separation is visible and a line can be drawn.

What’s left is mapping it back to two dimensions, When we transform back this line to the original plane, it maps to the circular boundary as shown below.

What kernelization does is that it takes data which is ‘d’ dimensional and it does a feature transform internally and implicitly using the kernel trick to a dimension ‘d1’, typically where d1 >d. In ‘d1’, the data becomes linearly separable.

If we want to know more about the Kernel trick, you can refer here.

Do note that once the kernel function transformation is applied, the original features of the data point are now irrelevant. It is then represented only in terms of its dot products with support vectors (which are basically special data points chosen by the SVM optimization algorithm).

There are a couple of non-linear methods that are called kernel functions or kernel tricks which can be applied to transform the data.

So if we have a function ‘K’ defined as follows :

then we just need to know K and not the mapping function itself. This function is known as Kernel function and it reduces the complexity of finding the mapping function. So, the Kernel function defines the inner products in the transformed space.

We have to choose the correct kernel based on our application. There are various types of kernels :

Mercer’s Theorem

Apart from these predefined kernels, what conditions determine which functions can be considered as Kernels? This is given by Mercer’s theorem.

The first condition is rather trivial i.e. the Kernel function must be symmetric. As a Kernel function is dot product (inner product) of the mapping function we can write as below:

Mercer theorem guides us to the necessary and sufficient condition for a function to be Kernel function.

Train and Runtime Complexities of SVMs

The Training time complexities of SVMs is approximately O(n²). If n is very large, then O(n²) is also very large, so SVMs have not used in low-latency based applications.

The Runtime Complexity is approximately O(k.d) where

k = Number of Support Vectors

d =Dimensionality of data

Support Vector Machines — Regression (SVR)

Support Vector Machine can also be used as a regression method, maintaining all the main features that characterize the algorithm (maximal margin). The Support Vector Regression (SVR) uses the same principles as the SVM for classification, with only a few minor differences.

Suppose we are given training data {(x1, y1),…,(x, y)} ⊂ X × R, where X denotes the space of the input patterns (e.g. X = Rd ). These might be, for instance, exchange rates for some currency measured at subsequent days together with corresponding econometric indicators.

In ε-SV regression, our goal is to find a function f (x) that has at most ε deviation from the actually obtained targets yi for all the training data, and at the same time is as flat as possible. In other words, we do not care about errors as long as they are less than ε, but will not accept any deviation larger than this. This may be important if you want to be sure not to lose more than ε money when dealing with exchange rates, for instance.

For pedagogical reasons, we begin by describing the case of linear functions f, taking the form

f (x) = w, x + b with w ∈ X, b ∈ R

However, the main idea is always the same: to minimize error, individualizing the hyper-plane which maximizes the margin, keeping in mind that part of the error is tolerated.

Sometimes, however, this may not be the case, or we also may want to allow for some errors. Analogously to the “soft margin” loss function.

The constant C > 0 determines the trade-off between the flatness of ‘f’ and the amount up to which deviations larger than ε are tolerated.

This corresponds to dealing with a so-called ε-insensitive loss function |ξ |ε described by

To know more about SVR visit here.

If we don’t know which kernel to use we can always use the RBF kernel to solve the problem.

Tuning Parameters

Kernel: We have already discussed how important kernel functions are. Depending on the nature of the problem, the right kernel function has to be chosen as the kernel-function defines the hyperplane chosen for the problem

Regularization: The regularization parameter in python’s Scikit-learn C parameter used to maintain regularization. Here C is the penalty parameter, which represents misclassification or error term. The misclassification or error term tells the SVM optimization of how much error is bearable. This is how you can control the trade-off between decision boundary and misclassification term. A smaller value of C creates a small-margin hyperplane and a larger value of C creates a larger-margin hyperplane.

Gamma: A lower value of Gamma will loosely fit the training dataset, whereas a higher value of gamma will exactly fit the training dataset, which causes over-fitting. In other words, you can say a low value of gamma considers only nearby points in calculating the separation line, while the a value of gamma considers all the data points in the calculation of the separation line.

Effect of σ on the RBF Kernal Values

When σ = 1 is given a small range of points will be considered as similar.

When σ = 10 is given comparatively more no points will have the chance of being similar to one another. When σ is very very small only a few points in the vicinity will be considered as similar.

Let us look at the below graph: Sigma is, as usually defined in a Gaussian Distribution, is the standard deviation. It determines the width of the Gaussian distribution.

When we use the RBF Kernel in the Dual form of the Soft-Max SVM we get different decision boundaries with the different values of sigma and C. Both sigma and C are the hyperparameters.

Multiclass-Classification: For Multiclass-Classification typically we can use one v/s Rest method.

Decision surface: Decision surface is Linear (or) Hyperplane.

Similarity Matrix: Easy to kernelize from similarity function to K(Xi, Xj).

Outlier Impact: Less impact because of using Support Vectors.

Large Dimensionality: It is very comfortable for large dimensionality d, but we want to choose the right kernel.

Best and Worst cases of Support Vector Machines Algorithm

SVM Classifiers offer good accuracy and perform faster prediction compared to the Naïve Bayes algorithm.
They also use less memory because they use a subset of training points in the decision phase.
SVM works well with a clear margin of separation and with high dimensional space.
SVM is not suitable for large datasets because of its high training time and it also takes more time in training compared to Naïve Bayes.
It works poorly with overlapping classes and is also sensitive to the type of kernel used.

Support Vector Machines with Amazon Food Reviews

Let’s apply the Logistic Regression algorithm for the real-world dataset Amazon Fine Food Review Analysis from Kaggle.

First We want to know What is Amazon Fine Food Review Analysis?

This dataset consists of reviews of fine foods from amazon. The data span a period of more than 10 years, including all ~500,000 reviews up to October 2012. Reviews include product and user information, ratings, and a plaintext review. We also have reviews from all other Amazon categories.

Amazon reviews are often the most publicly visible reviews of consumer products. As a frequent Amazon user, I was interested in examining the structure of a large database of Amazon reviews and visualizing this information so as to be a smarter consumer and reviewer.

Source: https://www.kaggle.com/snap/amazon-fine-food-reviews

The Amazon Fine Food Reviews dataset consists of reviews of fine foods from Amazon.

Number of reviews: 568,454
Number of users: 256,059
Number of products: 74,258
Timespan: Oct 1999 — Oct 2012
Number of Attributes/Columns in data: 10

Attribute Information:

Id
ProductId — unique identifier for the product
UserId — unique identifier for the user
ProfileName
HelpfulnessNumerator — number of users who found the review helpful
HelpfulnessDenominator — number of users who indicated whether they found the review helpful or not
Score — a rating between 1 and 5
Time — timestamp for the review
Summary — Brief summary of the review
Text — Text of the review

Objective

Given a review, determine whether the review is positive (rating of 4 or 5) or negative (rating of 1 or 2).

Data Preprocessing

Data Preprocessing is a technique that is used to convert the raw data into a clean data set. In other words, whenever the data is gathered from different sources it is collected in raw format which is not feasible for the analysis.

To Know the Complete overview of the Amazon Food review dataset and Featurization visit my previous blog link here.

Train-Test split

The train-test split procedure is used to estimate the performance of machine learning algorithms when they are used to make predictions on data not used to train the model.

If you have one dataset, you’ll need to split it by using the Sklearn train_test_split function first.

Text Featurization using Bag of Words

Hyper Parameter tuning

we want to choose the best alpha for better performance of the model, to choose the best alpha by using Grid Search cross-validation.

we already defined a Grid_search Function when we call it, it will give the result.

After we find the best alpha using a Grid search CV we want to check the performance with Test data, in this case, study, we use the AUC as the Performance measure.

we already defined a Function for testing the test data when we call it, it will give the result.

Performance Metrics

Performance metrics are used to measure the behavior, activities, and performance of a business. This should be in the form of data that measures required data within a range, allowing a basis to be formed supporting the achievement of overall business goals.

To Know detailed information about performance metrics used in Machine Learning please visit my previous blog link here.

we already defined a Function for performance metrics when we call it, it will give the result.

Feature Importance on BOW

Top 10 important features

Similarly, we built a Support Vector Machines model with TFIDF, AvgWord2Vec, TFIDF_AvgWord2Vec features with Kernelization also. To understand the full code please visit my GitHub link.

Conclusions

To write concussions in the table we used the python library PrettyTable.

The pretty table is a simple Python library designed to make it quick and easy to represent tabular data in visually appealing tables.

Observations

Compare to Bag of words features representation, TFIDF features with L2 Regularization are getting the highest 93.29% AUC score on Test data.

2. The C and Gamma values are different from model to model.

To Know the Complete overview of the Amazon Food review dataset and Featurization visit my previous blog link here.

To Know detailed information about performance metrics used in Machine Learning please visit my previous blog link here.

To understand the full code please visit my GitHub link.

References

Applied AI
Wikipedia
Coursera
Data Camp

Thanks for reading and your patience. I hope you liked the post, let me know if there are any errors in my post. Let’s discuss in the comments if you find anything wrong in the post or if you have anything to add…

Happy Learning!!

Support Vector Machines with Amazon Food Reviews

Contents

Key Idea of SVM

Support Vectors

Alternative Intuition about SVMs

Mathematical Formulation of Support Vector Machines

Loss Minimization Interpretation of SVMs

Dual Form of Support Vector Machines

Kernel Trick in SVM

Mercer’s Theorem

Train and Runtime Complexities of SVMs

Support Vector Machines — Regression (SVR)

Tuning Parameters

Effect of σ on the RBF Kernal Values

Best and Worst cases of Support Vector Machines Algorithm

Support Vector Machines with Amazon Food Reviews

Objective

Data Preprocessing

Train-Test split

Text Featurization using Bag of Words

Hyper Parameter tuning

Performance Metrics

Feature Importance on BOW

Conclusions

Observations

References

Written by Sachin D N