A Look at Scikit Learn

Christos Maglaras
6 min readFeb 16, 2021

With only a few lines of code, Scikit Learn is able to tackle almost any basic machine learning task. Regression, classification, and clustering, are all among Scikit Learn’s strong suits, as well as functionalities such as preprocessing and dimensionality reduction. On the other hand more complicated machine learning tasks such as neural networks and reinforcement learning are not yet included in Scikit Learn, for these workloads TensorFlow and Keras are more appropriate. Scikit Learn is built on the numpy stack, relying on numpy for ndarrays and scipy for sparse matrices. Knowledge on these two foundational libraries will be helpful moving forward through Scikit. Visualizations are not built into the library as well, matplotlib or seaborn will have to be employed to visualize your results, and the statsmodel package will often have to be used for certain tasks such as time series forecasting.

There are a total of twenty six models supported by Scikit, with each having many subcategories detailing specific subtypes of each model. Each of these models fits into one of two categories, supervised or unsupervised models. The difference between the two is that supervised models include an “answer key” that the performance of the model is graded against. The other type, unsupervised, in not able to grade itself on the difference from ground truth and what it has predicted, as it is not provided data labeled with an “answer key”. Instead an unsupervised model must learn the distinctions and associations between groups and values on its own. Supervised is generally more accurate and trustworthy, but requires data that is labeled with the correct output values, which is not always possible in a real world scenario. Unsupervised is less accurate than supervised but is able to better find unknown patterns and correlations within the data, as well as not requiring the training data to include an “answer key”. Both techniques are useful in their own right when applied correctly to datasets, as time goes on which type to use will become clearer as you learn more about the models within each type.

Linear Regression is the most basic of supervised models, although it has many variants. As you can guess, linear regression creates a best fit, straight line through a plot. This is the subtype called Ordinary Least Squares, which has a few issues that may arise such as multicolinearity between independent variables. Proceeding subtypes attempt to remedy the issues with the original ordinary least squares, the Ridge subtype attempts to alleviate some of the multicolinearity issues, while least angle regression expands its ability to process high dimensional data. Bayesian regression even allows you to tune the model to your specific dataset. While linear regression is undoubtedly a useful tool, there are more advanced and newer methods that scikit learn makes just as simple to use.

SVM, or support vector machines, is a much more popular to use method that is effective in both classification and regression, where it is called support vector regression, or SVR. It is able to process high dimensional data, even where there are more dimensions than samles. One drawback of the model is that it is resource expensive and does not scale well when even larger datasets are used. There are a number of different pre built kernels for this model to use, or you can pass in one of your own. A few standard kernels are linear kernels, linearSCV, SVC with RBF, and SCV of different polynomial degrees.

SGD, or stochastic gradient descent is used only as a way to train a model, as it is a very efficient way to calculate loss. It is also able to classify or calculate regression, and is usually used as it is an efficient way to train very large datasets. In other cases with smaller datasets linear regression is more applicable as there is not as much of a need for efficiency. Although it is simple to implement tuning of parameters is necessary as it is sensitive to different forms of data.

Nearest Neighbors is one of the most well knows models, possibly as it comes in both supervised and unsupervised flavors. It can either classify labeled data or compute regression on unlabeled data. It operates by finding the predicted point based on a predefined number of points closest to where the predicted point will be, or simply based on the points within a radius of where it will predict a point. KDE, or kernel density estimation is a very popular model that relies mainly on Nearest Neighbors.

Naive Bayes is a method that works quite well for classification, but not very well for estimations. Since it is “naive” it assumes and over simplifies many things, allowing it to process extremely fast. Although it makes many assumptions it has proven itself to work very well for classification and its further subtypes expand upon it to cover up its flaws in assumptions.

Decision Trees are another type of model that is among the most popular. It creates and stores logic in a branching tree format. There are many unique benefits as well as downsides to this type of model. A few of it’s benefits is little data preparation, an easily understandable model as it is all based on a flow of logic, its computation cost scales well, and is able to handle both continuous and categorical data, although th e scikit learn implementation at the time does not support categorical values. On the downside overfitting is a often encountered issue with this model that requires various steps such as limiting the depth of the tree or pruning ‘leaves’ from the tree.

Moving onto unsupervised models, we have Manifold Learning, which is used as a dimensionality reduction tool. At its core it takes a projection of the high dimensionality data transforming it into a lower dimnsionality object, much in the same way that a paper map is a lower dimensionality projection of a globe. Taking this projection from a random angle causes the end result to miss most of the useful information, leaving each subtype of this model to have a different projection algorithm. A few of the techniques are Multi-dimensional scaling, Locally Linear Embedding, Spectral Embedding, and Local Tangent Space Alignment. One note is that this machine learning method does not work well with noisy data, connecting space that would otherwise be omitted in the dimensionality reduction.

The Clustering module is one of the largest in Scikit Learn, as clustering is one of the most popular ways to differentiate items within a dataset. K-Means is a popular general purpose model for lower dimensionality data that has a small number of clusters and low variance in size. Affinity Propagation uses nearest neigbors to chart lines between values and a central point for the cluster. It finds the number of clusters on its own and can handle a high amount as well. The main drawback is the computation time. Mean-Shift is much like the previous Affinity propagation in that it uses nearest neighbor to find the points closest to the center, but is used to find a smooth bubble instead of a spike ball. Spectral Clustering is used for clustering in photos, and can handle only a few clusters of similar size. Hierarchical Clustering creates a tree of clusters that belong within other clusters, allowing it to support a hight amount of total clusters. There are many different types of hierarchical clustering techniques that exist, the main being Ward.

Neural Network Models are actually built into Scikit Learn, supervised networks like Multi-layer Perceptrons and unsupervised networks like Restricted Boltzmann machines are ready to be used in Scikit. The reason why TensorFlow and Keras are encouraged is that Scikit Learn can only use the CPU for compute, while TensorFlow is able to leverage GPU power. Neural net tasks are extremely costly, so attempting to build anything more than the most simple neural net will take orders of magnitude longer running solely on CPU.

Preprocessing tools are included in Scikit Learn as well. Support covers Standardization, which is necessary for most estimators as they expect a gaussian curve to the data. Two types of non-linear transformations are supported, which are quantile and power transformations which map to the normal or gaussian distributions. Normalization features are included as well to scale individual samples. Various encoding methods are present as well such as OneHot Encoder. Finally there are Discretization functions to bin continuous values into discrete values.

Scikit learn is ultimately an essential part of todays machine learning environment, the plethora of algorithms offered combined with the ease of use place it at the top of the list when it comes to the functionality it presents. The documentation and project as a whole is one of the most well maintained libraries, to expand on this short look at the library feel free to delve into any one of the points above and explore all of the further possiblities

--

--