Professional Documents
Culture Documents
Le Song
Machine Learning I
CSE 6740, Fall 2013
What is Machine Learning (ML)
Study of algorithms that improve their performance at some
task with experience
2
Common to Industrial scale problems
6 billion photos
3
Organizing Images
Image
Databases
5
Organizing documents
Reading, digesting, and We want:
categorizing a vast text database
is too much for human!
Numeric values:
40 F
Predict Wind: NE at 14 km/h
Humidity: 83%
7
Face Detection
8
Understanding brain activity
9
Product Recommendation
10
Handwritten digit recognition/text annotation
Inter-word
dependency
Inter-character
dependency Aoccdrnig to a sudty at Cmabrigde
Uinervtisy, it deosnt mttaer in waht
oredr the ltteers in a wrod are, the
olny iprmoetnt tihng is taht the frist
and lsat ltteer be at the rghit pclae.
The rset can be a ttoal mses and you
can sitll raed it wouthit a porbelm.
Tihs is bcuseae the huamn mnid
deos not raed ervey lteter by istlef,
What are the desired outcomes?
but the wrod as a wlohe.
What are the inputs (data)?
Audio signals
12
cacatcgctgcgtttcggcagctaattgccttttagaaattattttcccatttcgagaaactcgtgtgggatgccggatgcggctttcaatcacttctggcccgggatcggattgggtcacattgtctgcgggctctattgtctcgatccgc
ggcgcagttcgcgtgcttagcggtcagaaaggcagagattcggttcggattgatgcgctggcagcagggcacaaagatctaatgactggcaaatcgctacaaataaattaaagtccggcggctaattaatgagcggactgaagccactttgg
attaaccaaaaaacagcagataaacaaaaacggcaaagaaaattgccacagagttgtcacgctttgttgcacaaacatttgtgcagaaaagtgaaaagcttttagccattattaagtttttcctcagctcgctggcagcacttgcgaatgta
14
Similar problem: webpage classification
Company homepage vs. University homepage
15
Robot Control
Now cars can find their own ways!
Nonlinear
Decision
Boundaries
Linear SVM
Decision
Boundaries
17
Nonconventional clusters
18
Syllabus
Cover a number of most commonly used machine learning
algorithms in sufficient amount of details in their mechanisms.
Organization
Unsupervised learning (data exploration)
Clustering, dimensionality reduction, density estimation, novelty
detection
19
Prerequisites
Probabilities
Distributions, densities, marginalization, conditioning .
Basic statistics
Moments, classification, regression, maximum likelihood
estimation
Algorithms
Dynamic programming, basic data structures, complexity
Programming
Mostly your choice of language, but Matlab will be very useful
The class will be fast paced
Ability to deal with abstract mathematical concepts
20
Textbooks
Textbooks:
Pattern Recognition and Machine Learning, Chris Bishop
The Elements of Statistical Learning: Data Mining, Inference, and
Prediction, Trevor Hastie, Robert Tibshirani, Jerome Friedman
Machine Learning, Tom Mitchell
21
Grading
6 assignments (60%)
Approximately 1 assignment every 4 lectures
Start early
22
Homeworks
Zero credit after each deadline
Collaboration
You may discuss the questions
Each student writes their own answers
Write on your homework anyone with whom you collaborate
Each student must write their own codes for the programming
part
23
Staff
Instructor: Le Song, Klaus 1340
More information:
http://www.cc.gatech.edu/~lsong/teaching/CSE6740fall13.html
24
Today
Probabilities
Independence
Conditional Independence
25
Random Variables (RV)
Data may contain many different attributes
Age, grade, color, location, coordinate, time
For continuous:
= = 1
0
Shorthand: () for ( = ) 26
Interpretations of probability
Frequentists
() is the frequency of in the limit
Many arguments against this interpretation
What is the frequency of the event it will rain tomorrow?
Subjective interpretation
() is my degree of belief that will happen
What does degree of belief mean?
If () = 0.8, then I am willing to bet
27
Conditional probability
After we have seen , how do we feel will happen?
means ( = | = )
28
Two of the most important rules: I. The chain rule
, =
More generally:
1 , 2 , , = 1 2 1 1 , , 2 , 1
29
Two of the most important rules: II. Bayes rule
likelihood Prior
(,)
= =
Val (,)
30
Independence
and independent, if (|) = ()
= ( )
31
Conditional independence
Independence is rarely true; conditional independence is more
prevalent
if and only if , =
32
Joint distribution, marginalization
Two random variables Grades (G) & Intelligence (I)
G I VH H
, = A 0.7 0.1
B 0.15 0.05
For binary variables, the table (multiway array) gets really big
1 , 2 , , has 2 entries!
33
Marginalization the general case
Compute marginal distribution from
1 , 2 , , , +1 , ,
1 , 2 , , = 1 , 2 , , , +1 ,
+1 ,,
= 1 , , 1 ,
1,,1
34
Example problem
Estimate the probability of landing in heads
using a biased coin
Model: | = 1 1
1 , = 0
(| ) =
, = 1
35
Frequentist Parameter Estimation
Frequentists think of a parameter as a fixed, unknown
constant, not a random variable
36
MLE for Biased Coin
Objective function, log likelihood
; = log = log 1 = log +
log 1
37
Bayesian Parameter Estimation
Bayesian treat the unknown parameters as a random variable,
whose distribution can be inferred using Bayes rule:
() ()
(|) = =
()
For iid data, the likelihood is = =1 ( |)
1 1
= # 1 #
=1 1 = 1
Posterior distribution of
1 ,,
|1 , , =
1 ,,
1 1 1 1 =
+1 1 +1
and are hyperparameters and correspond to the number of
virtual heads and tails (pseudo counts)
39
Bayesian Estimation for Bernoulli
Posterior distribution
1 ,,
|1 , , =
1 ,,
1 1 1 1 = +1 1 +1
Prior strength: = +
A can be interpreted as an imaginary dataset
40