Professional Documents
Culture Documents
associated tests and data generators. MLlib currently supports four common types of machine
learning problem settings, namely, binary classification, regression, clustering and collaborative
filtering, as well as an underlying gradient descent optimization primitive. This guide will outline the
functionality supported in MLlib and also provides an example of invoking MLlib.
Dependencies
MLlib uses the jblas linear algebra library, which itself depends on native Fortran routines. You may
need to install the gfortran runtime library if it is not already present on your nodes. MLlib will throw a
linking error if it cannot detect these libraries automatically.
To use MLlib in Python, you will need NumPy version 1.7 or newer and Python 2.7.
Binary Classification
Binary classification is a supervised learning problem in which we want to classify entities into one of
two distinct categories or labels, e.g., predicting whether or not emails are spam. This problem
involves executing a learning Algorithm on a set of labeled examples, i.e., a set of entities
represented via (numerical) features along with underlying category labels. The algorithm returns a
trained Model that can predict the label for new entities for which the underlying label is unknown.
MLlib currently supports two standard model families for binary classification, namely Linear Support
Vector Machines (SVMs) and Logistic Regression, along with L1 and L2 regularized variants of each
model family. The training algorithms all leverage an underlying gradient descent primitive
(described below), and take as input a regularization parameter (regParam) along with various
parameters associated with gradient descent (stepSize, numIterations, miniBatchFraction).
Available algorithms for binary classification:
SVMWithSGD
LogisticRegressionWithSGD
Linear Regression
Linear regression is another classical supervised learning setting. In this problem, each entity is
associated with a real-valued label (as opposed to a binary label as in binary classification), and we
want to predict labels as closely as possible given numerical features representing entities. MLlib
supports linear regression as well as L1 (lasso) and L2 (ridge) regularized variants. The regression
algorithms in MLlib also leverage the underlying gradient descent primitive (described below), and
have the same parameters as the binary classification algorithms described above.
Available algorithms for linear regression:
LinearRegressionWithSGD
RidgeRegressionWithSGD
LassoWithSGD
Clustering
Clustering is an unsupervised learning problem whereby we aim to group subsets of entities with
one another based on some notion of similarity. Clustering is often used for exploratory analysis
and/or as a component of a hierarchical supervised learning pipeline (in which distinct classifiers or
regression models are trained for each cluster). MLlib supports k-means clustering, one of the most
commonly used clustering algorithms that clusters the data points into predfined number of clusters.
The MLlib implementation includes a parallelized variant of the k-means++ method called kmeans||.
The implementation in MLlib has the following parameters:
KMeans
Collaborative Filtering
Collaborative filtering is commonly used for recommender systems. These techniques aim to fill in
the missing entries of a user-item association matrix. MLlib currently supports model-based
collaborative filtering, in which users and products are described by a small set of latent factors that
can be used to predict missing entries. In particular, we implement the alternating least squares
(ALS) algorithm to learn these latent factors. The implementation in MLlib has the following
parameters: