Professional Documents
Culture Documents
Xiangrui Meng
What is MLlib?
What is MLlib?
MLlib is a Spark subproject providing machine
learning primitives:
33 contributors
What is MLlib?
Algorithms:!
clustering: k-means
Why MLlib?
scikit-learn?
Algorithms:!
Mahout?
Algorithms:!
LIBLINEAR?
Mahout?
H2O?
Vowpal Wabbit?
MATLAB?
R?
scikit-learn?
Weka?
8
Why MLlib?
Why MLlib?
10
Gradient descent
w
n
X
g(w; xi , yi )
i=1
11
k-means (scala)
// Load and parse the data.
val data = sc.textFile("kmeans_data.txt")
val parsedData = data.map(_.split( ').map(_.toDouble)).cache()
"
12
k-means (python)
# Load and parse the data
data = sc.textFile("kmeans_data.txt")
parsedData = data.map(lambda line:
array([float(x) for x in line.split(' )])).cache()
"
# Build the model (cluster the data)
clusters = KMeans.train(parsedData, 2, maxIterations = 10,
runs = 1, initialization_mode = "kmeans||")
"
# Evaluate clustering by computing the sum of squared errors
def error(point):
center = clusters.centers[clusters.predict(point)]
return sqrt(sum([x**2 for x in (point - center)]))
"
cost = parsedData.map(lambda point: error(point))
.reduce(lambda x, y: x + y)
print("Sum of squared error = " + str(cost))
13
Dimension reduction
+ k-means
// compute principal components
val points: RDD[Vector] = ...
val mat = RowRDDMatrix(points)
val pc = mat.computePrincipalComponents(20)
"
Collaborative filtering
// Load and parse the data
val data = sc.textFile("mllib/data/als/test.data")
val ratings = data.map(_.split(',') match {
case Array(user, item, rate) =>
Rating(user.toInt, item.toInt, rate.toDouble)
})
"
// Build the recommendation model using ALS
val model = ALS.train(ratings, 1, 20, 0.01)
"
// Evaluate the model on rating data
val usersProducts = ratings.map { case Rating(user, product, rate) =>
(user, product)
}
val predictions = model.predict(usersProducts)
15
Why MLlib?
16
Start navigation.
17
Why smartphone?
Out for dinner?!
18
Why MLlib?
A special-purpose device may be better at one
aspect than a general-purpose device. But the cost
of context switching is high:
19
"
"
Streaming + MLlib
// collect tweets using streaming
"
// train a k-means model
val model: KMmeansModel = ...
"
// apply model to filter tweets
val tweets = TwitterUtils.createStream(ssc, Some(authorizations(0)))
val statuses = tweets.map(_.getText)
val filteredTweets =
statuses.filter(t => model.predict(featurize(t)) == clusterNumber)
"
// print tweets within this particular cluster
filteredTweets.print()
GraphX + MLlib
// assemble link graph
val graph = Graph(pages, links)
val pageRank: RDD[(Long, Double)] = graph.staticPageRank(10).vertices
"
// load page labels (spam or not) and content features
val labelAndFeatures: RDD[(Long, (Double, Seq((Int, Double)))] = ...
val training: RDD[LabeledPoint] =
labelAndFeatures.join(pageRank).map {
case (id, ((label, features), pageRank)) =>
LabeledPoint(label, Vectors.sparse(features ++ (1000, pageRank))
}
"
// train a spam detector using logistic regression
val model = LogisticRegressionWithSGD.train(training)
Why MLlib?
Reads from HDFS, S3, HBase, and any Hadoop data source.
Why MLlib?
Reads from HDFS, S3, HBase, and any Hadoop data source.
Why MLlib?
Scalability
Performance
User-friendly APIs
25
Logistic regression
26
10
3000
walltime (s)
relative walltime
MLbase
MLlib
VW
Ideal
2000
n=6K, d=160K
n=12.5K, d=160K
n=25K, d=160K
n=50K, d=160K
n=100K, d=160K
n=200K, d=160K
1000
2
0
0
10
15
20
# machines
25
30
MLbase
MLlib
VW
Matlab
35
30
speedup
25
20
15
27
wa
0
0
10
15
20
# machines
25
30
1000
VW
Matlab
35
MLlib
MLbase
VW
Ideal
30
1400
1200
25
walltime (s)
1000
speedup
MLbase
20
15
800
1 Machine
2 Machines
4 Machines
8 Machines
16 Machines
32 Machines
600
400
10
200
0
0
10
15
20
# machines
25
30
MLbase
MLlib
VW
Matlab
8: Strong scaling
for logistic
regression
Fixed
Dataset:
50K images, 160K dense features.
with respect
to computation. In practice, we see comp
MLlib exhibits better scaling
properties.
scaling results as more machines are added.
MLlib is faster than VW with 16 and 32 machines.
System
Lines of Code
In MATLAB, we implement gradient descent inste
SGD, as gradient descent requires roughly the same nu
MLbase
32
of numeric operations as SGD but does not require
an
28
GraphLab
383
loop to pass over the data. It can thus be implemented
Collaborative filtering
29
Collaborative filtering
?
?
?
?
30
System
MATLAB
15443
Mahout
4206
GraphLab
291
MLlib
481
Implementation of k-means
Initialization:
random
k-means++
k-means||
Implementation of k-means
Iterations:
For each point, find its closest center.
cj k22
i,li =j
cj = P
i,li =j
xj
1
Implementation of k-means
The points are usually sparse, but the centers are most likely to be
dense. Computing the distance takes O(d) time. So the time
complexity is O(n d k) per iteration. We dont take any advantage of
sparsity on the running time. However, we have
kx
2
ck2
2
kxk2
2
kck2
2hx, ci
However, is it accurate?
Implementation of ALS
broadcast everything
data parallel
fully parallel
35
36
Broadcast everything
Master broadcasts
data and initial
models.
At each iteration,
updated models are
broadcast again.
Lots of
communication
overhead - doesnt
scale well.
Ratings
Movie!
Factors
User!
Factors
Master
Workers
Data parallel
Ratings
Movie!
Factors
Ratings
At each iteration,
updated models are
broadcast again
Much better scaling
Ratings
User!
Factors
Ratings
Master
38
Workers
Master broadcasts
initial models
Works on large
datasets
Fully parallel
Ratings
Movie!
User!
Factors
Factors
Ratings
Movie!
User!
Factors
Factors
Ratings
Movie!
User!
Factors
Factors
Ratings
Movie!
User!
Factors
Factors
Master
39
Workers
At each iteration,
models are shared via
join between workers.
Much better
scalability.
Works on large
datasets
Models are
instantiated at
workers.
Implementation of ALS
broadcast everything
data parallel
fully parallel
block-wise parallel
40
Sparse data
L-BFGS
Model evaluation
Discretization
41
Contributors
Ameet Talwalkar, Andrew Tulloch, Chen Chao, Nan Zhu, DB Tsai, Evan
Sparks, Frank Dai, Ginger Smith, Henry Saputra, Holden Karau,
Hossein Falaki, Jey Kottalam, Cheng Lian, Marek Kolodziej, Mark
Hamstra, Martin Jaggi, Martin Weindel, Matei Zaharia, Nick Pentreath,
Patrick Wendell, Prashant Sharma, Reynold Xin, Reza Zadeh, Sandy
Ryza, Sean Owen, Shivaram Venkataraman, Tor Myklebust, Xiangrui
Meng, Xinghao Pan, Xusen Yin, Jerry Shao, Ryan LeCompte
42
Interested?
Website: http://spark.apache.org
Tutorials: http://ampcamp.berkeley.edu
Github: https://github.com/apache/spark