Random Forest

RANDOM FOREST
ALGORITHM
Contents
2016
1. Algorithm and general presentation...........................3

2. Importance and practical applications.......................4
3.Known results and issues............................................6
4.Conclusion..................................................................7
5.References...................................................................8
RANDOM FOREST ALGORITHM
1. Algorithm and general presentation
Random forests is a notion of the general technique of random decision forests that
are an ensemble learning method for classification, regression and other tasks, that operate by
constructing a multitude of decision trees at training time and outputting the class that is
the mode of the classes (classification) or mean prediction (regression) of the individual trees.
Random decision forests correct for decision trees' habit of overfitting to their training set.
The algorithm for inducing Breiman's random forest was developed by Leo
Breiman and Adele Cutler, and "Random Forests" is their trademark.[6] The method combines
Breiman's "bagging" idea and the random selection of features, introduced independently by
Ho and Amit and Geman in order to construct a collection of decision trees with controlled
variance.
Each tree is constructed using the following algorithm:
1. Let the number of training cases be N, and the number of variables in the
classifier be M.
2. We are told the number m of input variables to be used to determine the
decision at a node of the tree; m should be much less than M.
3. Choose a training set for this tree by choosing n times with replacement from
all N available training cases (i.e. take a bootstrap sample). Use the rest of the
cases to estimate the error of the tree, by predicting their classes.
4. For each node of the tree, randomly choose m variables on which to base the
decision at that node. Calculate the best split based on these m variables in the
training set.
5. Each tree is fully grown and not pruned (as may be done in constructing a
normal tree classifier).
Nowadays, a machine learning algorithm called Random Forest (RF) is widely

considered to be a one of most accurate classifiers that attracts attention of many researchers
in the area. This work is aimed to investigate its properties, capture behavior on two datasets
and evaluate the algorithm classification performance.
A Random Forest consists of a collection or ensemble of simple tree predictors, each
capable of producing a response when presented with a set of predictor values. For
classification problems, this response takes the form of a class membership, which associates,
or classifies, a set of independent predictor values with one of the categories present in
the dependent variable. Alternatively, for regression problems, the tree response is an
estimate of the dependent variable given the predictors.
A Random Forest consists of an arbitrary number of simple trees, which are used to
determine the final outcome. For classification problems, the ensemble of simple trees vote
for the most popular class. In the regression problem, their responses are averaged to obtain
an estimate of the dependent variable. Using tree ensembles can lead to significant
improvement in prediction accuracy.
Input: dataset T = (x, y), number of trees m, number of random features k

Output: RF, a set of grown trees
Initialize RF for i = 1 to m do
T bootstrap(T)
Tree trainDT(T, k)
add Tree to RF
end for
2. Importance and practical applications
Splits are chosen according to a purity measure:

E.g. squared error (regression), Gini index or deviance (classification)
How to select N trees?
Build trees until the error no longer decreases
How to select M trees?
Try to recommend defaults, half of them and twice of them and pick the best.
After each tree is built, all of the data are run down the tree, and proximities are
computed for each pair of cases. If two cases occupy the same terminal node, their proximity
is increased by one. At the end of the run, the proximities are normalized by dividing by the
number of trees. Proximities are used in replacing missing data, locating outliers, and
producing illuminating low-dimensional views of the data.
In every tree grown in the forest, put down the oob cases and count the number of
votes cast for the correct class. Now randomly permute the values of variable m in the oob
cases and put these cases down the tree. Subtract the number of votes for the correct class in
the variable-m-permuted oob data from the number of votes for the correct class in the
untouched oob data. The average of this number over all trees in the forest is the raw
importance score for variable m.
If the values of this score from tree to tree are independent, then the standard error can
be computed by a standard computation. The correlations of these scores between trees have
been computed for a number of data sets and proved to be quite low, therefore we compute
standard errors in the classical way, divide the raw score by its standard error to get a z-score,
ands assign a significance level to the z-score assuming normality.
If the number of variables is very large, forests can be run once with all the variables,
then run again using only the most important variables from the first run.
For each case, consider all the trees for which it is oob. Subtract the percentage of
votes for the correct class in the variable-m-permuted oob data from the percentage of votes
for the correct class in the untouched oob data.
Random forest comes at the expense of a some loss of interpretability, but generally
greatly boosts the performance of the final model.
Estimating the importance of each predictor:
Denote by the OOB estimate of the loss when using original training set, D.
For each predictor xp where p{1,..,k}
Randomly permute pth predictor to generate a new set of samples D'
={(y1,x'1),,(yN,x'N)}
Compute OOB estimate k of prediction error with the new samples
A measure of importance of predictor xp is k , the increase in error due to random
perturbation of pth predictor.
The number of trees necessary for good performance grows with the number
of predictors. The best way to determine how many trees are necessary is to compare
predictions made by a forest to predictions made by a subset of a forest. When the
subsets work as well as the full forest, you have enough trees.
3.Known results and issues
Estimating the test error:

While growing forest, estimate test error from training samples
For each tree grown, 33-36% of samples are not selected in bootstrap, called out of
bootstrap (OOB) samples
Using OOB samples as input to the corresponding tree, predictions are made as if they
were novel test samples
Through book-keeping, majority vote (classification), average (regression) is

computed for all OOB samples from all trees.
Such estimated test error is very accurate in practice, with reasonable N
In general, the more trees you use the better get the results. However, the
improvement decreases as the number of trees increases, i.e. at a certain point the
benefit in prediction performance from learning more trees will be lower than the cost
in computation time for learning these additional trees.
Random forests are ensemble methods, and you average over many trees.
Similarly, if you want to estimate an average of a real-valued random variable you can
take a sample. The expected variance will decrease as the square root of the sample
size, and at a certain point the cost of collecting a larger sample will be higher than
the benefit in accuracy obtained from such larger sample.
4.Conclusion
Random Forest is fast to build. Even faster to predict!
Automatic predictor selection from large number of candidates
Resistance to over training
Ability to handle data without preprocessing
data does not need to be rescaled, transformed, or modified
resistant to outliers
automatic handling of missing values
Cluster identification can be used to generate tree-based clusters through
sample proximity
5.References
Anne-Laure Boulesteix, Silke Janitza, Jochen Kruppa, Inke R. Knig,
Overview of Random Forest Methodology and Practical Guidance with Emphasis on
Computational Biology and Bioinformatics, July 25th 2012
Anne-Laure Boulesteix, Silke Janitza, Jochen Kruppa, Inke R. Knig --Overview of

Random Forest Methodology and Practical Guidance with Emphasis on
Computational Biology and Bioinformatics, July 25th 2012
Slaford Systems Random Forest for Beginners
Yanli Liu, Yourong Wang, Jian Zhang - New Machine Learning Algorithm: Random
Forest
http://stats.stackexchange.com/questions/2344/best-way-to-present-a-randomforest-in-a-publication
https://www.researchgate.net/post/How_to_determine_the_number_of_trees_to_
be_generated_in_Random_Forest_algorithm
https://citizennet.com/blog/2012/11/10/random-forests-ensembles-andperformance-metrics/
https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#features
http://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-10-213
http://www.listendata.com/2014/11/random-forest-with-r.html
http://wgrass.media.osaka-cu.ac.jp/gisideas10/viewpaper.php?id=342
https://en.wikipedia.org/wiki/Random_forest#Algorithm
https://epub.ub.uni-muenchen.de/13766/1/TR.pdf
http://www.bios.unc.edu/~dzeng/BIOS740/randomforest.pdf

Random Forest

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Random Forest

Uploaded by

Copyright:

Available Formats

RANDOM FOREST

1. Algorithm and general presentation...........................3

RANDOM FOREST ALGORITHM

1. Algorithm and general presentation

normal tree classifier).

Nowadays, a machine learning algorithm called Random Forest (RF) is widely

Input: dataset T = (x, y), number of trees m, number of random features k

2. Importance and practical applications

Splits are chosen according to a purity measure:

3.Known results and issues

Estimating the test error:

were novel test samples

Through book-keeping, majority vote (classification), average (regression) is

Anne-Laure Boulesteix, Silke Janitza, Jochen Kruppa, Inke R. Knig --Overview of

Slaford Systems Random Forest for Beginners

You might also like