CS246: Mining Massive Datasets Jure Leskovec,: Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University
http://cs246.stanford.edu
Would like to do prediction: estimate a function f(x) so that y = f(x) Where y can be:
Real number: Regression Categorical: Classification Complex object:
X Y
Ranking of items, Parse tree, etc.
X Training and test set
Data is labeled:
Have many pairs {(x, y)}
x vector of real valued features y class ({+1, -1}, or a real number)

Jure Leskovec, Stanford C246: Mining Massive Datasets 2
2/14/2011
We will talk about the following methods:

k-Nearest Neighbor (Instance based learning) Perceptron algorithm Support Vector Machines Decision trees
Main question: How to efficiently train (build a model/find model parameters)?

2/14/2011
Instance based learning Example: Nearest neighbor

Keep the whole training dataset: {(x, y)} A query example (vector) q comes Find closest example(s) x* Predict y*
Can be used both for regression and classification

Recommendation systems
2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 4
To make Nearest Neighbor work we need 4 things:

Distance metric:
Euclidean One
How many neighbors to look at? Weighting function (optional): How to fit with the local points?
Unused Just predict the same output as the nearest neighbor
2/14/2011
Jure Leskovec, Stanford C246: Mining Massive Datasets
Suppose x1,, xm are two dimensional: One can draw nearest neighbor regions:
x1=(x11,x12), x2=(x21,x22),
d(xi,xj) = (xi1-xj1)2 + (xi2-xj2)2

2/14/2011
d(xi,xj) = (xi1-xj1)2 + (3xi2-3xj2)2
Distance metric: How many neighbors to look at? Weighting function (optional): How to fit with the local points?
Unused Just predict the average output among k nearest neighbors k Euclidean
k=9
Distance metric: How many neighbors to look at? Weighting function:

wi = All of them (!) exp(-d(xi, q)2/Kw)
wi
Euclidean
d(xi, q) = 0
How to fit with the local points?

K=10 K=20
Nearby points to query q are weighted more strongly. Kwkernel width.
Predict weighted average: wiyi/wi

K=80
2/14/2011
Given: a set P of n points in Rd Goal: Given a query point q

NN: find the nearest neighbor p of q in P Range search: find one/all points in P within distance r from q
p q
2/14/2011
Main memory:
Linear scan Tree based:
Quadtree kd-tree
Hashing:
Secondary storage:
R-trees
Locality-Sensitive Hashing
2/14/2011
10
Simplest spatial structure on Earth! Split the space into 2d equal subsquares Repeat until done:
only one pixel left only one point left only a few points left split only one dimension at a time Kd-trees (in a moment)
Variants:
2/14/2011
11
Range search:
Put root node on the stack Repeat:
pop the next node T from the stack for each child C of T:
if C is a leaf, examine point(s) in C if C intersects with the ball of radius r around q, add C to the stack
Nearest neighbor:
Start range search with r = Whenever a point is found, update r Only investigate nodes with respect to current r
2/14/2011
Quadtrees work great for 2 to 3 dimensions Problems:

Empty spaces: if the points form sparse clouds, it takes a while to reach them Space exponential in dimension Time exponential in dimension, e.g., points on the hypercube
2/14/2011
13
Main ideas [Bentley 75] :
Only one-dimensional splits Choose the split carefully:
Advantages:
Queries: as for quadtrees
E.g., Pick dimension of largest variance and split at median (balanced split) Do SVD or CUR, project and split
Query time at most:
no (or less) empty spaces only linear space Min[dn, exponential(d)]

2/14/2011
Range search:
Put root node on the stack Repeat:
pop the next node T from the stack for each child C of T:
if C is a leaf, examine point(s) in C if C intersects with the ball of radius r around q, add C to the stack
In what order we search the children?

Best-Bin-First (BBF), Last-Bin-First (LBF)
Performance of a single Kd-tree is low Randomized Kd-trees: Build several trees

Find top few dimensions of largest variance Randomly select one of these dimensions; split on median Construct many complete (i.e., one point per leaf) trees Drawbacks:
More memory Additional parameter to tune: number of trees
Search
Descend through each tree until leaf is reached Maintain a single priority queue for all the trees For approximate search, stop after a certain number of nodes have been examined
2/14/2011

2/14/2011
d=128, n=100k
[Muja-Lowe, 2010]
17
Overlapped partitioning reduces boundary errors Spilling

no backtracking necessary Increases tree depth
more memory slower to build
Better when split passes through sparse regions Lower nodes may spill too much
hybrid of spill and non-spill nodes
Designing a good spill factor hard

For high dim. data, use randomized projections (CUR) or SVD Use Best-Bin-First (BBF)
Make a priority queue of all unexplored nodes Visit them in order of their closeness to the query
Space permitting:
Closeness is defined by distance to a cell boundary
Keep extra statistics on lower and upper bound for each cell and use triangle inequality to prune space Use spilling to avoid backtracking Use lookup tables for fast distance computation
Bottom-up approach [Guttman 84]

Start with a set of points/rectangles Partition the set into groups of small cardinality For each group, find minimum rectangle containing objects from this group (MBR) Repeat
Advantages:
Supports near(est) neighbor search (similar as before) Works for points and rectangles Avoids empty spaces
2/14/2011
R-trees with fan-out 4:

group nearby rectangles to parent MBRs
I A B E D C F G H J
2/14/2011
#21

every parent node completely covers its children
P3 A B E P2 D C F P4 G H J
P1
A B C D E
F G
2/14/2011
#22

every parent node completely covers its children
P1
P1 P2 P3 P4
A B C D E
F G
2/14/2011
#23
Example of a range search query

P1
P1 P2 P3 P4
A B C D E
F G
2/14/2011
#24
Example of a range search query

P1
P1 P2 P3 P4
A B C D E
F G
2/14/2011
#25
Insertion of point x:
Find MBR intersecting with x and insert If a node is full, then a split:
Linear choose far apart nodes as ends. Randomly choose nodes and assign them so that they require the smallest MBR enlargement Quadratic choose two nodes so the dead space between them is maximized. Insert nodes so area enlargement is minimized P1 P2 P3 P4
C F E
A B P1 P2
2/14/2011
G P3
H J
P4
A B C D E
D
F G
#26
Approach [Weber, Schek, Blott98]

In high-dimensional spaces, all tree-based indexing structures examine large fraction of leaves If we need to visit so many nodes anyway, it is better to scan the whole data set and avoid performing seeks altogether 1 seek = transfer of few hundred KB
2/14/2011
27
Natural question: How to speed-up linear scan? Answer: Use approximation

Use only i bits per dimension (and speed-up the scan by a factor of 32/i) Identify all points which could be returned as an answer Verify the points using original data set
2/14/2011
28
Example: Spam filtering
Instance space X: Class Y:

Binary feature vectors x of word occurrences d features (words + other things, d~100,000) y: Spam (+1), Ham (-1)
2/14/2011
30
Binary classification: f (x) = Input: Vectors xi and labels yi Goal: Find vector w = (w1, w2 ,... , wn)
Each wi is a real number
wx=
1 0
if w1 x1 + w2 x2 +. . . wn xn otherwise
wx=0
-- - - -- - -- Jure Leskovec, Stanford C246: Mining Massive Datasets
Note:
x x, 1
w w,
31
2/14/2011
(very) Loose motivation: Neuron Inputs are feature values Each feature has a weight wi w1 x1 Activation is the sum: w2 If the f(x) is:
f(x) = i wixi = wx - Positive: predict +1 Negative: predict -1
nigeria wx=0 x2 Ham=-1
x2 w3 x3 w4 x4
>0?
x1 w
Spam=1
viagra
32
2/14/2011
Perceptron: y = sign(wx) How to find parameters w?
Start with w0 = 0 Pick training examples x one by one (from disk) Predict class of x using current weights If y is correct (i.e., y=y): If y is wrong: adjust w wt+1 = wt + y x
where:
is the learning rate parameter x is the training example y is true class label ({+1, -1})
y = sign(wt x)
No change: wt+1 = wt
yx
wt wt+1 x
2/14/2011
Perceptron Convergence Theorem:

If there exist a set of weights that are consistent (i.e., the data is linearly separable) the perceptron learning algorithm will converge
How long would it take to converge? Perceptron Cycling Theorem:
How to provide robustness, more expressivity?

If the training data is not linearly separable the perceptron learning algorithm will eventually repeat the same set of weights and therefore enter an infinite loop
2/14/2011
34
Separability: some parameters get training set perfectly Convergence: if training set is separable, perceptron will converge (binary case) Mistake bound: number of mistakes < 1/2
2/14/2011
If more than 2 classes:

Weight vector wc for each class Calculate activation for each class
f(x,c)= i wc,ixi = wcx
Highest activation wins:

c = arg maxc f(x,c)
w3x biggest w3 w1 w1x biggest w2 w2x biggest
2/14/2011
36
Overfitting: Regularization: if the data is not separable weights dance around Mediocre generalization:
Finds a barely separating solution
2/14/2011
37
Winnow algorithm
Similar to perceptron, just different updates
Initialize : = n; w i = 1 Prediction is 1 iff w x If no mistake : do nothing If f(x) = 1 but w x < , w i 2w i (if x i = 1) (promotion ) If f(x) = 0 but w x , w i w i /2 (if x i = 1) (demotion)
Learns linear threshold functions
2/14/2011
38
Algorithm learns monotone functions For the general case:

Duplicate variables:
To negate variable xi, introduce a new variable xi=-xi Learn monotone functions over 2n variables
Balanced version:
Keep two weights for each variable; effective weight is the difference
Update Rule : If f ( x) = 1 but ( w + w ) x , If f ( x) = 0 but ( w + w ) x ,
2/14/2011
wi+ 2 wi+ wi wi+
1 wi where xi = 1 (promotion ) 2
1 + wi wi 2 wi where xi = 1 (demotion) 2
39
Thick Separator (aka Perceptron with Margin)

(Applies both for Perceptron and Winnow) Promote if: wx= wx> + -- Demote if: - - -- - -- wx< -
wx=0
Note: is a functional margin. Its effect could disappear as w grows. Nevertheless, this has been shown to be a very effective algorithmic addition.
Examples : x {0,1} n ; Prediction is 1 iff
Hypothesis : w R n w x
w w + i y j x j
Additive weight update algorithm [Perceptron, Rosenblatt, 1958]
If Class = 1 but w x , w i w i + 1 (if x i = 1) (promotion ) If Class = 0 but w x , w i w i - 1 (if x i = 1) (demotion)
Multiplicative weight update algorithm w w i exp{yjxj} [Winnow, Littlestone, 1988]
If Class = 1 but w x , w i 2w i (if x i = 1) (promotion) If Class = 0 but w x , w i w i /2 (if x i = 1) (demotion)

Perceptron Online: can adjust to changing target, over time Advantages Simple Guaranteed to learn a linearly separable problem Limitations only linear separations only converges for linearly separable data not really efficient with many features
2/14/2011
Winnow Online: can adjust to changing target, over time Advantages Simple Guaranteed to learn a linearly separable problem Suitable for problems with many irrelevant attributes Limitations only linear separations only converges for linearly separable data not really efficient with many features
42

CS246: Mining Massive Datasets Jure Leskovec,: Stanford University

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CS246: Mining Massive Datasets Jure Leskovec,: Stanford University

Uploaded by

Copyright:

Available Formats

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Ranking of items, Parse tree, etc.

X Training and test set

Have many pairs {(x, y)}

x vector of real valued features y class ({+1, -1}, or a real number)

We will talk about the following methods:

Main question: How to efficiently train (build a model/find model parameters)?

Instance based learning Example: Nearest neighbor

Can be used both for regression and classification

To make Nearest Neighbor work we need 4 things:

Jure Leskovec, Stanford C246: Mining Massive Datasets

d(xi,xj) = (xi1-xj1)2 + (xi2-xj2)2

d(xi,xj) = (xi1-xj1)2 + (3xi2-3xj2)2

Jure Leskovec, Stanford C246: Mining Massive Datasets

Distance metric: How many neighbors to look at? Weighting function:

How to fit with the local points?

Nearby points to query q are weighted more strongly. Kwkernel width.

Predict weighted average: wiyi/wi

Jure Leskovec, Stanford C246: Mining Massive Datasets

Given: a set P of n points in Rd Goal: Given a query point q

Jure Leskovec, Stanford C246: Mining Massive Datasets

Jure Leskovec, Stanford C246: Mining Massive Datasets

Jure Leskovec, Stanford C246: Mining Massive Datasets

Quadtrees work great for 2 to 3 dimensions Problems:

Jure Leskovec, Stanford C246: Mining Massive Datasets

Main ideas [Bentley 75] :

Only one-dimensional splits Choose the split carefully:

Queries: as for quadtrees

Query time at most:

no (or less) empty spaces only linear space Min[dn, exponential(d)]

In what order we search the children?

Performance of a single Kd-tree is low Randomized Kd-trees: Build several trees

Jure Leskovec, Stanford C246: Mining Massive Datasets

Overlapped partitioning reduces boundary errors Spilling

Designing a good spill factor hard

Closeness is defined by distance to a cell boundary

Bottom-up approach [Guttman 84]

R-trees with fan-out 4:

Jure Leskovec, Stanford C246: Mining Massive Datasets

R-trees with fan-out 4:

Jure Leskovec, Stanford C246: Mining Massive Datasets

R-trees with fan-out 4:

Jure Leskovec, Stanford C246: Mining Massive Datasets

Example of a range search query

Jure Leskovec, Stanford C246: Mining Massive Datasets

Example of a range search query

Jure Leskovec, Stanford C246: Mining Massive Datasets

Approach [Weber, Schek, Blott98]

Jure Leskovec, Stanford C246: Mining Massive Datasets

Natural question: How to speed-up linear scan? Answer: Use approximation

Jure Leskovec, Stanford C246: Mining Massive Datasets

Example: Spam filtering

Instance space X: Class Y:

Jure Leskovec, Stanford C246: Mining Massive Datasets

-- - - -- - -- Jure Leskovec, Stanford C246: Mining Massive Datasets

Jure Leskovec, Stanford C246: Mining Massive Datasets

Perceptron: y = sign(wx) How to find parameters w?

Perceptron Convergence Theorem:

How long would it take to converge? Perceptron Cycling Theorem:

How to provide robustness, more expressivity?

If more than 2 classes:

Highest activation wins: