You are on page 1of 23

BMI 704 – Machine Learning

Lab
030719
Topics
• Introduction to Supervised Learning
• Introduction to Unsupervised Learning

• Algorithms and Packages


Supervised Learning
• Outcome
• You know the outcome (labelled variables; Y)
Your model
• Continuous or binary

• Features
• i.e. variables (Xs)
• Inputs you are using to predict outcome

• Model Models
• 1) Pick a guy Diabetes = 0.5*age + 0.2*sex + 2.1*BMI + …
• 2) sub his features into the model
Height = 0.2*age + 0.8*sex + 1.3*weight + …
• 3) now you know his outcome
Where is the predicting model come from?
• 1) Pick an algorithm
• Linear model
• Y = X1 + X2 + X3

• 2) Split your data set into train and test (e.g. 80/20,
70/30)

• 3) Build your model using the training data set


• Cross validation find best model parameters

• 4) Run your optimized model using the test data set

• 5) Report model performance and your results


Measurement of how well your algorithm did?
Loss function
• Objective metric, max or
min

Simple Regression
• R2 - amount of variance
explained

Multiple regression with


varying model size
• Adjusted R2
• AIC/BIC/Cp
Measurement of how well your algorithm did?
Classification (Y = binary)
• Receiver operating
characteristic (ROC) curve
and area under the curve
(AUC)

• If Y = 1 or 0;
• High sensitivity:
• Y = 1; ➙ Y^ = 1
• High specificity:
• Y = 0; ➙ Y^ = 0
Which model (algorithm) should you use?
Unsupervised Learning
• Not interest in predicting Y but exploratory analysis (Xs)
• discovering patterns
• Find subgroups that you don’t know
• Visualize the results

• Hard to validate results

• Principle component analysis


• X1, X2, X3, X4 … Xn
• ➙ create latent variables (PCs)

• A few latent variables to capture the most of the information of the data
• i.e. the variance explained

• Variance explained: PC1 > PC2 > PC3 …


Score plot loading plot

loading x%
Score x%
Unsupervised Learning
• Clustering
• PCA looks to find a low-dimensional representation of the observations that
explain a good fraction of the variance;
• Clustering looks to find homogeneous subgroups among the observations.

• K-means clustering
• hierarchical clustering
K-means clustering
• partitioning a data set into K distinct, non-overlapping clusters.
• Specify how many clusters do you want
• The algorithm looks for
local optimum
• Run a few times to see
the different
hierarchical clustering
• tree-based representation of the
observations, called a
dendrogram.
• bottom-up clustering
Algorithms and Packages
• ML Algorithms (many, many, many!)
• Basics: linear-based
• Shrinkage Methods
• Lasso and Ridge regression
• ElasticNet
• Non-linear methods
• Spline
• Support Vector Machines
• Tree based methods
• Decision trees
• Random Forests
• Packages in R
• Individual packages for each algorithm - glmnet
• Meta packages – caret
Unsupervised Learning (con’t)
• Clustering
• Partitional methods
• K-means: partition {x1,…xn} into K clusters where K is
predefined.
• Build a new partition by associating each point with the nearest
centroid
• Compute the centroid (mean point) for each set. Repeat until
converge.
• “kmeans” function in R.
Unsupervised Learning
• Not interest in predicting but discovering patterns
• Find subgroups that you don’t know
• Visualize the results
• Principle component
• Clustering
• Hierarchical clustering– Build a hierarchy of clusters
• Agglomerative: A “bottom up” approach. You start with each element in a separate
cluster, then merge them according to a given property.
• Divisive: A “top down” approach. All elements start in one all-inclusive cluster, then you
split recursively.

You might also like