You are on page 1of 2

MLlib is a Spark implementation of some common machine learning (ML) functionality, as well

associated tests and data generators. MLlib currently supports four common types of machine
learning problem settings, namely, binary classification, regression, clustering and collaborative
filtering, as well as an underlying gradient descent optimization primitive. This guide will outline the
functionality supported in MLlib and also provides an example of invoking MLlib.

Dependencies
MLlib uses the jblas linear algebra library, which itself depends on native Fortran routines. You may
need to install the gfortran runtime library if it is not already present on your nodes. MLlib will throw a
linking error if it cannot detect these libraries automatically.
To use MLlib in Python, you will need NumPy version 1.7 or newer and Python 2.7.

Binary Classification
Binary classification is a supervised learning problem in which we want to classify entities into one of
two distinct categories or labels, e.g., predicting whether or not emails are spam. This problem
involves executing a learning Algorithm on a set of labeled examples, i.e., a set of entities
represented via (numerical) features along with underlying category labels. The algorithm returns a
trained Model that can predict the label for new entities for which the underlying label is unknown.
MLlib currently supports two standard model families for binary classification, namely Linear Support
Vector Machines (SVMs) and Logistic Regression, along with L1 and L2 regularized variants of each
model family. The training algorithms all leverage an underlying gradient descent primitive
(described below), and take as input a regularization parameter (regParam) along with various
parameters associated with gradient descent (stepSize, numIterations, miniBatchFraction).
Available algorithms for binary classification:

 SVMWithSGD
 LogisticRegressionWithSGD

Linear Regression
Linear regression is another classical supervised learning setting. In this problem, each entity is
associated with a real-valued label (as opposed to a binary label as in binary classification), and we
want to predict labels as closely as possible given numerical features representing entities. MLlib
supports linear regression as well as L1 (lasso) and L2 (ridge) regularized variants. The regression
algorithms in MLlib also leverage the underlying gradient descent primitive (described below), and
have the same parameters as the binary classification algorithms described above.
Available algorithms for linear regression:

 LinearRegressionWithSGD
 RidgeRegressionWithSGD
 LassoWithSGD

Clustering
Clustering is an unsupervised learning problem whereby we aim to group subsets of entities with
one another based on some notion of similarity. Clustering is often used for exploratory analysis
and/or as a component of a hierarchical supervised learning pipeline (in which distinct classifiers or
regression models are trained for each cluster). MLlib supports k-means clustering, one of the most
commonly used clustering algorithms that clusters the data points into predfined number of clusters.
The MLlib implementation includes a parallelized variant of the k-means++ method called kmeans||.
The implementation in MLlib has the following parameters:

 k is the number of desired clusters.


 maxIterations is the maximum number of iterations to run.
 initializationMode specifies either random initialization or initialization via k-means||.
 runs is the number of times to run the k-means algorithm (k-means is not guaranteed to find a
globally optimal solution, and when run multiple times on a given dataset, the algorithm returns
the best clustering result).
 initializiationSteps determines the number of steps in the k-means|| algorithm.
 epsilon determines the distance threshold within which we consider k-means to have converged.

Available algorithms for clustering:

 KMeans

Collaborative Filtering
Collaborative filtering is commonly used for recommender systems. These techniques aim to fill in
the missing entries of a user-item association matrix. MLlib currently supports model-based
collaborative filtering, in which users and products are described by a small set of latent factors that
can be used to predict missing entries. In particular, we implement the alternating least squares
(ALS) algorithm to learn these latent factors. The implementation in MLlib has the following
parameters:

 numBlocks is the number of blacks used to parallelize computation (set to -1 to auto-configure).


 rank is the number of latent factors in our model.
 iterations is the number of iterations to run.
 lambda specifies the regularization parameter in ALS.
 implicitPrefs specifies whether to use the explicit feedback ALS variant or one adapted
for implicit feedback data
 alpha is a parameter applicable to the implicit feedback variant of ALS that governs
the baseline confidence in preference observations

You might also like