You are on page 1of 2

DECISION TREES volume and mass, the decision tree

GENERAL INFO STOPPING CRITERIA

will split the data as a series of
Decision trees are a type of splits on single features Additional splits can occur until a
supervised machine learning stopping criteria is reached
They use known training data to Common stopping criteria are
create a process that predicts the Maximum Depth: the maximum
results of that data. That process number of splits in a row has been
can then be used to predict the reached
results for unknown data
A decision tree processes data into
groups based on the value of the
Maximum Leafs: dont allow any
data and the features it is provided
more leafs
At each decision gate, the data gets
Min Samples Split: Only split a node
split into two separate branches,
if it has at least this many samples
which might be split again
Min Samples Leaf: Only split a node
The final grouping of the data is
if both children resulting have at
called Leaf Nodes It will not do what a human might
least this many samples
do, shown below
Min Weight Fraction: Only split a
node if it has at least this
percentage of the total samples

RANDOMIZATION
When a single decision tree is run,
it usually looks at every point in the
available data.
However some algorithms combine
multiple decision trees to capture
their benefits while mitigating over
fitting
Two widely used algorithms that
Decision trees can be used for use multiple decision trees are
regression to get a real numeric Time Complexity Random Forests and Gradient
value. Or they can be used for For a single decision tree, the most Boosted Trees
classification to split data into expensive operation is picking the Random Forests use a large number
different distinct categories. best splits of slightly different decision trees
Picking the best split requires run in parallel and averaged to get a
evaluating every possible split final result.
PICKING THE SPLITS Finding every possible split requires Gradient Boosted Trees use
sorting all the data on every feature decision trees run in series, with
Decision trees look through every That means for M features and N later trees correcting errors on
possible split and pick the best split points, this takes M * N lg(N) time previous trees.
between two adjacent points in a For machine learning algorithms If multiple trees are combined, it
given feature that use multiple decision trees, can be advantageous to have a
If the feature is in the real range, those sorts can be done a single random element in the creation of
the split occurs halfway between time and cached the trees, which can mitigate over
the two points fitting
The split are always selected on a Overfitting Bagging (short for Bootstrap
single feature only, not any Over fitting, which means too aggregating) is drawing samples
interaction between multiple closely matching the training data from the data set with
features at the expense of the test data, is a replacement.
That means if there is a relationship key concern for decision trees If you had 100 data points, you
between multiple features, for Different stopping criteria should would randomly draw 100 points
instance, if density is the critical be evaluated with cross-validation and on average get 63.2% unique
feature, but the data is provided in to mitigate over fitting points and 36.8% repeats

Visit www.FairlyNerdy.com for more statistics and machine learning information

Sub-Sampling - is drawing a smaller that error, sum the results and take Classification Information Gain
set than your data set. For the average Classification trees measure their
instance, if you have 100 points, information gain either by
randomly draw 60 of them EXAMPLE REGRESSION Entropy or Gini Impurity
(typically without replacement) The best result for those values is
Random Forests tend to use This is an example of a regression also zero, and the worst result
bagging, Gradient Boosted Trees tree attempting to match a sine occurs when every category has an
tend to use sub-sampling. wave using a single split equal likelihood of being at any
Additionally the features the tree given node
splits on can be randomized Gini equation
for the optimum split, a subset of
the features can be evaluated
(frequenlty the square root of the
number of features are evaluated) Where p is the probability of having
a given data class in your data set.
Entropy Equation
REGRESSION DECISION TREES
Below is an example of a depth 2
Regression trees default to splitting regression tree, which gets 3 splits
data based on mean squared error (i.e. 4 final groups)
(MSE) Those results end up being fairly
To calculate MSE: take the average similar
value of all points in each group,
and for each point in that group
subtract the average value from the
true value, square the error, sum all
of the squares and take the
average.
The chart below shows data that
was split in order to generate two
new groups that had the lowest
possible MSE
INFORMATION GAIN
FEATURE IMPORTANCES
Decision trees pick their splits in
order to maximize Information Decision trees lend themselves to
Gain identifying which features were the
Information is a metric which tells most important
you how much error you probably This is by calculating which feature
have in your decision tree resulted in the most information
Information gain is how much less gain (weighted by number of data
error you have after the split than points), across all of the splits
before the split The information gain can be MSE /
MSE by default tends to focus on MAE for regression trees or Entropy
points with the highest error and Regression Tree Information Gain / Gini for classification trees
split them into their own groups,
Regression trees use either MSE or
since their error has a squared MAE as their metric
effect.
The information gain is the MSE /
Mean Average Error (MAE) is also MAE before the split minus the MSE
an option for picking splits instead / MAE after the split (weighted by
of MSE. the number of data points that split
To calculate MAE: take the median operated on)
(not mean) of all points in the The best MSE / MAE is zero
group, and for each point subtract
the median value from the true
value, take the absolute value of