Standard Deviation Instead of Mean Absolute Deviation (Not Mad About MAD) - Win-Vector Blog

21/7/2015
Usestandarddeviation(notmadaboutMAD)|WinVectorBlog
Use standard deviation (not mad

about MAD)
Nassim Nicholas Taleb recently wrote an article advocating the
abandonment of the use of standard deviation[1] and advocating the
use of mean absolute deviation. Mean absolute deviation is indeed
an interesting and useful measure- but there is a reason that
standard deviation is important even if you do not like it: it prefers
models that get totals and averages correct. Absolute deviation
measures do not prefer such models. So while MAD may be great for
reporting, it can be a problem when used to optimize models.
Lets suppose we have 2 boxes of 10 lottery tickets: all tickets were
purchased for $1 each for the same game in an identical fashion at
the same time. For our highfalutin data science project lets look at
the payoffs on the tickets in the first box and try to build a best
predictive model for the tickets in the second box (without looking
at the payoffs in the second box). We then use our model to predict
the total value of the 10 tickets in the second box.
Now since all tickets are identical if we are making a mere pointprediction (a single number value estimate for each ticket instead of
a detailed posterior distribution) then there is an optimal prediction
that is a single number V. Lets explore potential values for V and
how they differ if we use different measures of variation (square
error, mean absolute variation and median absolute variation). To
get the ball rolling lets further suppose the payoffs of the tickets in
chromeextension://iooicodkiihhpojmmeghjclgihfjdjhj/front/in_isolation/reformat.html
1/5
21/7/2015
the first box are nine zeros and one $5 payoff. We are going to use a
general measure of model goodness called a loss function[2] or
loss and ignore any issues of parametric modeling, incorporating
prior knowledge or distributional summaries.
Suppose we use mean absolute deviation as our measure of model
quality. Then the loss (or badness) of a value V is loss(V) = 9*|V-0| +
1*|V-5| which is minimized V=$0. That is it says the best model
under mean absolute error is that all the lottery tickets are
worthless. I personally feel that way about lotteries, but the mean
absolute deviation is missing a lot of what is going on. In fact if we
have nine tickets with zero payoff and a single ticket with a nonzero payoff the mean absolute deviation is minimized for V=0 for
any positive payoff on the last ticket. The mean absolute deviation
says the best model for a lottery ticket given 9 non-payoffs and one
$1,000,000 payoff is that tickets are worth $0. Meaning that we may
not want to always think in terms of the mean absolute deviation
summary.
Here is some R[3]-code demonstrating what models (values of V)
total absolute deviation prefers (for our original problem):
library(ggplot2)
d <- data.frame(V=seq(-5,10,by=0.1))
f <- function(V) { 9*abs(V-0) + 1*abs(V-5)}
d$loss <- f(d$V)
ggplot(data=d,aes(x=V,y=loss)) + geom_line()
2/5
21/7/2015
Notice while there is a slope-change at V=$5, but the minimum is at

$V=0.
Suppose instead we use median absolute deviation as our measure
of model quality (the more standard expansion of the MAD
acronym[4]). Things are pretty much as bad: V=$0 is the optimal
model for 10 tickets 9 of which payoff zero no matter what the
payoff of the last ticket is.
Finally suppose instead of trendy MAD measures we use plain old
square error like poor old Karl Pearson used in the 19th century.
Then for our original example we have: loss(V) = 9*(V-0)^2 + 1*(V5)^2 which is minimized at V=$0.5. Which says these lottery tickets
seem to be worth about $0.5 each while they cost $1 each (typical of
lotteries). Also notice we have 10*V equals $5 the actual total value
of all of the tickets in the first box of lottery tickets. This is a key
advantage of RMSE: it gets group totals and averages right even
when it doesnt know how to value individual tickets. You want this
property.
How can we design loss functions that get totals correct? What we
want is a loss function that when we optimize to minimize loss we
3/5
21/7/2015
end up recovering totals in our original data. That is the loss

function, whatever it is, should have a stationary point[5] when we
try to use it to recover a total. So in our original example we should
have: d(10*loss(V))/dV = 0 when V=$0.5 (the total we are trying to
recover). Any loss-function of the form loss(V) = f(9*(V-0)^2 + (V5)^2) has a stationary point at V=$0.5 (just an application of the
chain-rule for derivatives). This is why square error, root mean

square error and the standard deviation all pick the same optimal
V=$0.5. This is the core point of regression and logistic regression
which both emphasize getting totals correct[6]. This is the other
reason you report RMSE: it is what regression optimizers are
minimizing (so it is a good diagnostic). We can also say that in some
sense a loss functions that get totals and averages right have
derivatives that look locally a bit like RMSE (near the average
value); which implies the loss function looks a bit like RMSE (or
some transform of it) near the average value. This is one reason
logistic regression can be related to standard regression by the idea
of iterative re-weighting.
The overall point is: there are a lot of different useful measures of
error and fit. Taleb is correct: the measure you use should not
depend on mere habit or ritual. But the measure you use should
depend on your intended application[7] (in this case preferring
models that get expected values and totals correct) and not merely
on your taste and sophistication. We also like non-variance based
methods (like quantile regression, see this example[8]) but find for
many problems you really have to pick you measure correctly. RMSE
itself is often mis-used: it is not the right measure for scoring
classification and ranking models (you want to prefer something
like precision/recall or deviance[9]).
4/5
21/7/2015
Links
1. http://www.edge.org/response-detail/25401
2. http://en.wikipedia.org/wiki/Loss_function
3. http://cran.r-project.org/
4. http://en.wikipedia.org/wiki/Median_absolute_deviation
5. http://en.wikipedia.org/wiki/Stationary_point
6. http://www.win-vector.com/blog/2011/09/the-simplerderivation-of-logistic-regression/
7. http://www.win-vector.com/blog/2013/05/bayesian-andfrequentist-approaches-ask-the-right-question/
8. http://www.win-vector.com/blog/2009/04/the-data-enrichmentmethod/
9. http://en.wikipedia.org/wiki/Deviance_(statistics)
Consigue una cuenta gratuita de

Evernote para guardar este artculo y
verlo ms tarde desde cualquier
dispositivo.
Crear cuenta
5/5

Standard Deviation Instead of Mean Absolute Deviation (Not Mad About MAD) - Win-Vector Blog

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Standard Deviation Instead of Mean Absolute Deviation (Not Mad About MAD) - Win-Vector Blog

Uploaded by

Copyright:

Available Formats

21/7/2015

Use standard deviation (not mad

Notice while there is a slope-change at V=$5, but the minimum is at

end up recovering totals in our original data. That is the loss

chain-rule for derivatives). This is why square error, root mean

Consigue una cuenta gratuita de

You might also like