Professional Documents
Culture Documents
Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords 2013 by MB
Event Data
Finance Gaming Monitoring
Advertisment
Sensor Networks
Social Media
Attribution: flickr users kenteegardin, fguillen, torkildr, Docklandsboy, brewbooks, ellbrown, JasonAHowie Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords 2013 by MB
100 events per second 360k events per hour 8.6M events per day 260M events per month 3.2B events per year
Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords 2013 by MB
http://wordle.net
http://www.flickr.com/photos/arenamontanus/269158554/
Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords 2013 by MB
Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords 2013 by MB
answer stream queries with finite resources how often does an item appear in a stream? how many distinct elements are in the stream? what are the top-k most frequent items?
Continuous Stream of Data
Typical examples:
Stream Queries
Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords 2013 by MB
The Trade-Off
Big Data
Stream Mining Map Reduce and friends
Fast
Exact
First seen here: http://www.slideshare.net/acunu/realtime-analytics-with-apache-cassandra Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords 2013 by MB
Count activities over large item sets (millions, even more, e.g. IP addresses, Twitter users) Interested in most active elements only.
Case 1: element already in data base 142 142 12 132 142 432 553 712 023 15 12 8 5 3 2 713 3 Case 2: new element 713 023 2 13
Metwally, Agrawal, Abbadi, Efficient computation of Frequent and Top-k Elements in Data Streams, Internation Conference on Database Theory, 2005
Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords 2013 by MB
Time
Keep quite a big log (a month?) Constant write/erase in database Alternative: Exponential decay
DB
Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords 2013 by MB
Exponential Decay
Exponential Decay
Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords 2013 by MB
Count-Min Sketches
Summarize histograms over large feature sets Like bloom filters, but better
m bins 0 1 0 2 0 1 5 4 3 0 3 5 0 2 2 0 0 0 1 0 2 3 3 2 0 5 7 0 0 2 3 8 n different hash functions
Query result: 1
G. Cormode and S. Muthukrishnan. An improved data stream summary: The count-min sketch and its applications. LATIN 2004, J. Algorithm 55(1): 58-75 (2005) .
Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords 2013 by MB
Counting is statistics!
Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords 2013 by MB
Counting is Statistics
Empirical mean:
Correlations:
Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords 2013 by MB
Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords 2013 by MB
do
Then, reconstruct
As a reminder:
Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords 2013 by MB
Example 2: Maximum-Likelihood
based on
Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords 2013 by MB
Once you have a model, you can compute p-values (based on recent time frames!)
Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords 2013 by MB
Example 4: Clustering
Online clustering
0 1 0 2
0 1 5 4
3 0 3 5
0 2 2 0
0 0 1 0
2 3 3 2
0 5 7 0
0 2 3 8
Aggarwal, A Framework for Clustering Massive-Domain Data Streams, IEEE International Conference on Data Engineering , 2009
Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords 2013 by MB
Example 5: TF-IDF
for each word: update(word, t, 1.0) for each document: update(#docs, t, 1.0) query: score(word) / score(#docs)
Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords 2013 by MB
class priors
Priors
Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords 2013 by MB
ICML 2003
Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords 2013 by MB
transform TF to log( . + 1) IDF-style normalization square length normalization use complement probability another log normalize those weights again Predict linearly using those weights
Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords 2013 by MB
elements!
Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords 2013 by MB
Streamdrill
Heavy Hitters counting + exponential decay Instant counts & top-k results over time windows. Indices! Snapshots for historical analysis Beta demo available at http://streamdrill.com, launch imminent
Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords 2013 by MB
Architecture Overview
Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords 2013 by MB
http://play.streamdrill.com/vis/
Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords 2013 by MB
Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords 2013 by MB
Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords 2013 by MB
Summary
Doesn't always have to be scaling! Stream mining: Approximate results with finite resources. streamdrill: stream analysis engine
Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords 2013 by MB