Photo by Tomas Sobek on Unsplash

When To And Not Use Them, And How

Wei Hao Khoong

You may have chanced upon my previous article on introducing support vector machines (SVMs), where key fundamental concepts were introduced at a high level. In this article, we discuss when SVMs are not appropriate for use, across the classification and regression use-cases.

The original SVM implementation is known to have a concrete theoretical foundation, but it is not suitable for classifying in large datasets for one straightforward reason — the complexity of the algorithm’s training is highly dependent on the size of the dataset. In other words, training time grows with the dataset to a point where it becomes infeasible to train and use due to compute constraints.

On the bright side, there have been several advancements to the SVM since its original implementation by AT&T Bell Laboratories back in 1992 [1]. Training SVMs are much more scalable with dataset sizes nowadays.

There are two more well-attributed reasons [2] for this. The first being the weakness of the soft margin optimization problem. This results in the hyperplanes being skewed to the minority class when imbalanced data is used for training.

The second reason arises from the issue of an imbalanced support vector ratio, i.e. the ratio between the positive and negative support vectors becoming imbalanced and as a result, datapoints at the decision boundaries of the hyperplanes have a higher chance of being classified as negative.

There are however approaches to reduce this impact. One of the more commonly used approaches is to introduce class weights, so the magnitude of the positive support will be proportionately higher than that of the negative support vector. Class weights are used in other machine learning algorithms as well, when training with imbalanced datasets.

This is a somewhat intuitive one. For SVMs nowadays, choosing the right kernel function is key. As an example, using the linear kernel when the data are not linearly separable results in the algorithm performing poorly. However, choosing the ‘right’ kernel is another problem on its own, which among the techniques used, popular ones include varying the kernel function in the hyperparameter search.

In these cases of noisy data, target classes are overlapping, in the sense that the features can have very similar or overlapping properties. This possibly results in arriving at several local optima due to the nature of the optimization algorithm, especially for…

Continue reading:—-7f60cf5620c9—4