Professional Documents
Culture Documents
ALGORITHM
Contents
2016
Random forests is a notion of the general technique of random decision forests that
are an ensemble learning method for classification, regression and other tasks, that operate by
constructing a multitude of decision trees at training time and outputting the class that is
the mode of the classes (classification) or mean prediction (regression) of the individual trees.
Random decision forests correct for decision trees' habit of overfitting to their training set.
The algorithm for inducing Breiman's random forest was developed by Leo
Breiman and Adele Cutler, and "Random Forests" is their trademark.[6] The method combines
Breiman's "bagging" idea and the random selection of features, introduced independently by
Ho and Amit and Geman in order to construct a collection of decision trees with controlled
variance.
Each tree is constructed using the following algorithm:
1. Let the number of training cases be N, and the number of variables in the
classifier be M.
2. We are told the number m of input variables to be used to determine the
decision at a node of the tree; m should be much less than M.
3. Choose a training set for this tree by choosing n times with replacement from
all N available training cases (i.e. take a bootstrap sample). Use the rest of the
cases to estimate the error of the tree, by predicting their classes.
4. For each node of the tree, randomly choose m variables on which to base the
decision at that node. Calculate the best split based on these m variables in the
training set.
5. Each tree is fully grown and not pruned (as may be done in constructing a
end for
then run again using only the most important variables from the first run.
For each case, consider all the trees for which it is oob. Subtract the percentage of
votes for the correct class in the variable-m-permuted oob data from the percentage of votes
for the correct class in the untouched oob data.
Random forest comes at the expense of a some loss of interpretability, but generally
greatly boosts the performance of the final model.
Estimating the importance of each predictor:
Denote by the OOB estimate of the loss when using original training set, D.
For each predictor xp where p{1,..,k}
Randomly permute pth predictor to generate a new set of samples D'
={(y1,x'1),,(yN,x'N)}
Compute OOB estimate k of prediction error with the new samples
A measure of importance of predictor xp is k , the increase in error due to random
perturbation of pth predictor.
The number of trees necessary for good performance grows with the number
of predictors. The best way to determine how many trees are necessary is to compare
predictions made by a forest to predictions made by a subset of a forest. When the
subsets work as well as the full forest, you have enough trees.
4.Conclusion
Random Forest is fast to build. Even faster to predict!
Automatic predictor selection from large number of candidates
Resistance to over training
Ability to handle data without preprocessing
data does not need to be rescaled, transformed, or modified
resistant to outliers
automatic handling of missing values
Cluster identification can be used to generate tree-based clusters through
sample proximity
5.References
Anne-Laure Boulesteix, Silke Janitza, Jochen Kruppa, Inke R. Knig,
Overview of Random Forest Methodology and Practical Guidance with Emphasis on
Computational Biology and Bioinformatics, July 25th 2012
Yanli Liu, Yourong Wang, Jian Zhang - New Machine Learning Algorithm: Random
Forest
http://stats.stackexchange.com/questions/2344/best-way-to-present-a-randomforest-in-a-publication
https://www.researchgate.net/post/How_to_determine_the_number_of_trees_to_
be_generated_in_Random_Forest_algorithm
https://citizennet.com/blog/2012/11/10/random-forests-ensembles-andperformance-metrics/
https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#features
http://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-10-213
http://www.listendata.com/2014/11/random-forest-with-r.html
http://wgrass.media.osaka-cu.ac.jp/gisideas10/viewpaper.php?id=342
https://en.wikipedia.org/wiki/Random_forest#Algorithm
https://epub.ub.uni-muenchen.de/13766/1/TR.pdf
http://www.bios.unc.edu/~dzeng/BIOS740/randomforest.pdf