You are on page 1of 18

random forests

Amalia Goia
January 12, 2016
Babes, -Bolyai University, Cluj-Napoca
Aims

Provide an overview of how Random Forests work


Understand the approach taken regarding aspects like variable
importance and performance measurement
Present recent research topics in extending the original method
Present experiments and setups where the technique is
recommended to be used
Analyze advantages and disadvantages

1
Motivation

No silver bullet in ML - however RF seem to perform good in


practice => curiosity, what ideas are they based on?
Theoretical: high level of randomness provides good results (see
extra-trees) - why?
Practical: suitable for medicine, -omics and for wide data
Somehow simple principle of functioning (few parameters to tune)
Because I already know a bit about SVMs and next on the list of
to-(roughly)-understand were tree-based methods

2
Learning

Originally, for supervised learning (from examples to hypothesis =>


inductive) - Eager learner
For unsupervised learning, using synthetic data for computing a
dissimilarity measure (Horvath et al. in Unsupervised Learning
With Random Forest Predictors [2006] for detecting tumor sample
clusters)

3
background
Ensemble learning

better than single classiers


reduce variance in generalization error
from a statistical point of view, averaging predictions lowers risk of
choosing a bad hypothesis or falling in local optima
(combining multiple hypotheses) > (single one) when representing
the true function
RF = bagging + random subspace method)

5
OOB testing and attribute importance

OOB = out-of-bag data


for one data point - count only votes of trees for which it was not
used in growing
as good an estimator as having a test set equal in size with the
training set
attribute importance - scramble values between data examples,
measure damage
still up: how to deal with missing values

6
recent research
Oblique Random Forests

Menze et al. in On Oblique Random Forests, [2011]


oblique - the hyperplanes used to separate data not parallel to
axes

different from original RF: out of all random features chose for
splitting, dont use a single best one
instead, use combination of all (with non-random weights)
better performance (especially for spectral and numerical data),
slower training
8
Extremely Randomized Trees

Geurts et al., [2006]


same data set for each tree
randomly chosen cutting point
extreme version: randomly chosen attribute (no search for
maximizing one) to split
compared with Single CART, bagging, random subspace method
and RF
never score last out of the mentioned methods
advantages: speed, triviality of implementation

9
Decision Jungles

Shoton et al., [2013]


why - huge trees when data sets with high number of data points
how - using DAGs with root
from root to leaf - more than one single path
merging nodes improves generalization capacity
... and uses less memory
However, take longer to train

10
experiments
Cardiac arrhythmia

improve diagnosis of cardiac arrhythmia


452 samples, each having 279 features
aim - distinguish between 14 types of arrhythmia
some classes have less than 15 samples
feature selection - leave only 25 discriminant ones
train 150 trees, 10-fold cross-validation
compute accuracy using the confusion matrix -> 90%

12
Mortality prediction

... short term, for hospitalized patients with Systemic Lupus


Erythematosus

why - identifying the high risk patients helps better monitoring


and avoiding death
data set: information about patients older than 18, collected
between 1996 and 2000

13
Mortality prediction

a total of 47 features (age, sex, presence of anemia, nephritis,


psychosis, etc.)
node level split done by choosing best out of 7 randomly chosen
features
use Gini index to compute purity
average of 10 forests, each consisting of 500 trees
109 (2.8%) dying patients out of 3839
most important variables proved to be: Charlson Index, respiratory
failure, the SLE Comorbidity Index, age, sepsis
obtained classication error rate of 10.4%

14
Other usage scenarios

Predicting in-vitro drug sensitivity


Real time 3D face analysis
Astronomy - hyperspectral data

15
Conclusions

+ few parameters to learn and set (the number of attributes used


at node splitting and the number of trees to build)
+ easy to parallelize
+ able to deal with small sample sizes with high number of
features (as it is often the case in the -omics data)
- data with many records results in huge trees
- their theoretical foundation is not so well-known or explored,
but advances are being done in this respect, as it is the case of
proving their consistency (Scornet et al., [2015]).

16
Thank you for your attention! :)

17

You might also like