You are on page 1of 33

Introduction to Machine Learning

Ibrahim Sabek
Computer and Systems Engineering Department, Faculty of Engineering, Alexandria University, Egypt

1 / 33

Agenda
1 Machine learning overview and applications 2 Supervised vs. Unsupervised learning 3 Generative vs. Discriminative models 4 Overview of Classication 5 The big picture 6 Bayesian inference 7 Summary 8 Feedback
2 / 33

Machine learning overview and applications

What is Machine Learning (ML)?


Denition: algorithms for inferring unknowns from knowns.
What do you mean by inferring ?? How to get unknowns from knowns??

3 / 33

Machine learning overview and applications

What is Machine Learning (ML)?


Denition: algorithms for inferring unknowns from knowns.
What do you mean by inferring ?? How to get unknowns from knowns??

ML applications
Spam detection Handwriting detection Speech recognition Netix recommendation system

4 / 33

Machine learning overview and applications

What is Machine Learning (ML)?


Denition: algorithms for inferring unknowns from knowns.
What do you mean by inferring ?? How to get unknowns from knowns??

ML applications
Spam detection Handwriting detection Speech recognition Netix recommendation system

Classes of ML models
Supervised vs. Unsupervised. Generative vs. Discriminative

5 / 33

Supervised vs. Unsupervised learning

Supervised vs. Unsupervised


Supervised: Given (x 1 , y 1 ), (x 2 , y 2 ), ......, (x n , y n ), choose a function f (x i ) = y i
x i R 2 , x i = data points y i = class/value

6 / 33

Supervised vs. Unsupervised learning

Supervised vs. Unsupervised


Supervised: Given (x 1 , y 1 ), (x 2 , y 2 ), ......, (x n , y n ), choose a function f (x i ) = y i
x i R 2 , x i = data points y i = class/value Classication: y i {nite set } Regression: y i R

7 / 33

Supervised vs. Unsupervised learning

Supervised vs. Unsupervised


Supervised: Given (x 1 , y 1 ), (x 2 , y 2 ), ......, (x n , y n ), choose a function f (x i ) = y i
x i R 2 , x i = data points y i = class/value Classication: y i {nite set } Regression: y i R

8 / 33

Supervised vs. Unsupervised learning

Supervised vs. Unsupervised


Supervised: Given (x 1 , y 1 ), (x 2 , y 2 ), ......, (x n , y n ), choose a function f (x i ) = y i
x i R 2 , x i = data points y i = class/value Classication: y i {nite set } Regression: y i R

9 / 33

Supervised vs. Unsupervised learning

Supervised vs. Unsupervised


Unsupervised: Given (x 1 , x 2 , ..., x n ), nd patterns in the data.
x i R 2 , x i = data points

10 / 33

Supervised vs. Unsupervised learning

Supervised vs. Unsupervised


Unsupervised: Given (x 1 , x 2 , ..., x n ), nd patterns in the data.
x i R 2 , x i = data points Clustering Density estimation Dimensional reduction

11 / 33

Supervised vs. Unsupervised learning

Supervised vs. Unsupervised


Unsupervised: Given (x 1 , x 2 , ..., x n ), nd patterns in the data.
x i R 2 , x i = data points Clustering Density estimation Dimensional reduction

12 / 33

Supervised vs. Unsupervised learning

Supervised vs. Unsupervised


Unsupervised: Given (x 1 , x 2 , ..., x n ), nd patterns in the data.
x i R 2 , x i = data points Clustering Density estimation Dimensional reduction

13 / 33

Supervised vs. Unsupervised learning

Supervised vs. Unsupervised


Unsupervised: Given (x 1 , x 2 , ..., x n ), nd patterns in the data.
x i R 2 , x i = data points Clustering Density estimation Dimensional reduction

14 / 33

Supervised vs. Unsupervised learning

Variations on Supervised and Unsupervised


Semi-supervised: Given (x 1 , y 1 ), (x 2 , y 2 ), ......, (x k , y k ), x k +1 , x k +2 , ..., x n , predict y k +1 , y k +2 , ..., y n

15 / 33

Supervised vs. Unsupervised learning

Variations on Supervised and Unsupervised


Semi-supervised: Given (x 1 , y 1 ), (x 2 , y 2 ), ......, (x k , y k ), x k +1 , x k +2 , ..., x n , predict y k +1 , y k +2 , ..., y n Active learning:

16 / 33

Supervised vs. Unsupervised learning

Variations on Supervised and Unsupervised


Decision theory: measure the prediction performance of unlabeled data

17 / 33

Supervised vs. Unsupervised learning

Variations on Supervised and Unsupervised


Decision theory: measure the prediction performance of unlabeled data Reinforcement learning:
maximize rewards (minimize losses) by actions maximize overall lifetime reward

18 / 33

Generative vs. Discriminative models

Generative vs. Discriminative models


Given (x 1 , y 1 ), (x 2 , y 2 ), ......, (x n , y n ), and a new point (x , y )

19 / 33

Generative vs. Discriminative models

Generative vs. Discriminative models


Given (x 1 , y 1 ), (x 2 , y 2 ), ......, (x n , y n ), and a new point (x , y ) Discriminative:
you want to estimate p (y = 1|x ), p (y = 0|x ) for y {0, 1}

20 / 33

Generative vs. Discriminative models

Generative vs. Discriminative models


Given (x 1 , y 1 ), (x 2 , y 2 ), ......, (x n , y n ), and a new point (x , y ) Discriminative:
you want to estimate p (y = 1|x ), p (y = 0|x ) for y {0, 1}

Generative:
you want to estimate the joint distribution p (x , y )

21 / 33

Overview of Classication

k-Nearest Neighbor classication (kNN)


Given D = {(x 1 , y 1 ), (x 2 , y 2 ), ......, (x n , y n )}, and a new point (x , y ) where x i R , y i {0, 1}

Dissimilarity metric: d (x , x ) = ||x x ||2 for k = 1 Probabilistic interpretation:


Given xed k , p (y ) = fraction of pts x i in Nk (x ) s.t. y i = y y = argmaxy p (y |x , D )
22 / 33

Overview of Classication

Classication trees (CART)


Given D = {(x 1 , y 1 ), (x 2 , y 2 ), ......, (x n , y n )}, and a new x where x i R , y i {0, 1} You build a binary tree Minimize error in each leaf

23 / 33

Overview of Classication

Regression tress (CART)


Given D = {(x 1 , y 1 ), (x 2 , y 2 ), ......, (x n , y n )}, and a new x where xi R, yi R

24 / 33

Overview of Classication

Bootstrap aggregation (Bagging)


Given D = {(x 1 , y 1 ), (x 2 , y 2 ), ......, (x n , y n )} follows P iid , and a new x where x i R , y i R , we need to nd its y value

25 / 33

Overview of Classication

Bootstrap aggregation (Bagging)


Given D = {(x 1 , y 1 ), (x 2 , y 2 ), ......, (x n , y n )} follows P iid , and a new x where x i R , y i R , we need to nd its y value Intuition: averaging makes your prediction close to the true label i i Dierent training datasets , (xk , yk ) follows uniform (D ) iid . The nal label y is the average of generated labels from the dierent datasets.

26 / 33

Overview of Classication

Random forests
Given D = {(x 1 , y 1 ), (x 2 , y 2 ), ......, (x n , y n )} where x i R , y i R For i = 1, ..., B
Choose bootstrap sample Di from D Construct tree Ti using Di s.t. at each node choose random subset of features and only consider splitting on these features.

Given x , take majority vote (for classication) or average (for regression).

27 / 33

The big picture

The big picture


Given the expected loss function EL(y , f (x )) and D = {(x 1 , y 1 ), (x 2 , y 2 ), ......, (x n , y n )} where x i R , y i R , we want to estimate p (y |x ) Discriminative: Estimate p (y |x ) directly using D .
KNN, Trees, SVM

Generative: Estimate p (x , y ) directly using D . and then


p (y |x ) =
p (x ,y ) p (x ) ,

also we have p (x , y ) = p (x |y )p (y )

Params/Latent variables : by including parameters, we have p (x , y |)


for discrete space: p (y |x , D ) = p (y |x , D , )p (|x , D )
p (y |x , D , ) is nice p (|x , D ) is nasty (called posterior dist. on ) summation (or integration in case of continuous space) is nasty and often intractable
28 / 33

The big picture

The big picture


p (y |x , D ) = p (y |x , D , )p (|x , D ) Exact inference:
Multi-variate Gaussian. Graphical models

Point estimate of
Maximum Likelihood Estimation (MLE) Maximum A Prior (MAP) Est . = argmax p (|x , D )

Deterministic Approximation
Laplace Approx. Variational methods

Stochastic Approximation
Importance sampling Gibbs sampling
29 / 33

Bayesian inference

Bayesian inference
Put distributions on everything and then use rules of probability to infer values Aspects of Bayesian inference
Priors: Assuming a prior distribution p () Procedures: Minimizing expected loss (averaging over ) Pros.:
Directly answer questions. Avoid overtting

Cons.:
Must assume prior. Exact computation can be intractable

30 / 33

Bayesian inference

Directed graphical models


Bayesian networks or Conditional independ. diagram:
Why? Tractable inference.

Factorization of the probabilistic model. Notational device Visualization for inference algorithms Example for thinking graphically p (a, b , c ):
p (a, b , c ) = p (c |a, b )p (a, b ) = p (c |a, b )p (b |a)p (a)

31 / 33

Summary

Summary
Machine learning is an essential eld for our life. Machine learning is a broad world, we just started it in this session :D :D.

32 / 33

Feedback

Feedback
Your feedback is welcomed on alex.acm.org/feedback/machine/

33 / 33

You might also like