Professional Documents
Culture Documents
By David Taylor
Data Scientist
Biotechnologist
Writer
www.prooffreader.com
prooffreader is misspelled: thats the joke!
Tufte
Cairo
Sona
Yau
Chen
Krum
Have you:
Data Visualization
before?
This is a chart
Yep, I m a rebel.
You don t need to program to create good dataviz, but it can magnificently expand
your capabilities. I m mostly familiar with Python, but there are many other
environments.
R: ggplot2, rCharts
JavaScript: d3.js, Highcharts.js, many more
Tableau Public
NetworkX: Graphing
Online tools
Plot.ly, Infoactive.co, many many more
Pattern recognition
Spatial awareness
Aesthetics
Anscombe s Quartet:
Four distributions, same typical summary statistics ...
Both statistics and Data Visualization have a narrative, determined by what you
choose to visualie. Some are benign ...
Credit:Yau
200
200
190
100
180
Thesetwochartsshowthesamedata,butbecausetheoneontheleftdoesnothaveazero
origin,theamountofrelativeinkbetweenthebarsismisleading.
Iveseenhighlyeducatedpostdoctoralfellowsinadvertentlymisrepresentdatathiswayat
conferences.
Justbecauseyouknowscienceoryouknowmath,doesntmeanyouautomaticallyknow
visualization.
Luckily,itseasytolearn
Mark Twain: There are three kinds of lies: lies, damn lies and statistics.
(and Data Visualizations)
100
small slope?
big slope?
The golden rule: use the golden ratio for aspect ratios (a 1.618 : 1 rectangle)
Squaressoundelegant,buttheyhave
experimentallybeenshowntocreate
aperceptionoftension
Rectanglesbreathe
= 0.618
Paintingis1.618:1
Focalpointis61.8%ofheightx61.8%ofwidth
Seurat,BathersatAsnires (1884)
TheParthenonisfullofgoldenratios
Sneakier: exact same bar charts with different aspect ratios of their bars.
The relative amount of ink, area ratios are identical, but they imply different effects.
Humans are better at discriminating some types of visual input than others
What goes in our eyes is not what goes in our brains
Humans are better at discriminating some types of visual input than others
What goes in our eyes is not what goes in our brains
Add an obviously
modified face behind
it, and it s weird, but
still nothing special.
Watch what happens
when we remove the
shapes.
Different
mappings of
quantitative data
to visual
relationships
have different
accuracies
(The ClevelandMcGill scale)
Source:Cairo
The human brain has difficulty judging trends without a common baseline:
It is much easier to
see this as one
plot with one
baseline.
Speaking of relative vs. absolute values, Fox News and conservatives do not have
a monopoly on data visualization chicanery:
A 10 nm change here
Is imperceptible
A 10 nm change here
Is very dramatic
Multihuecolormapscauseperceptual
clusteringofevenlyspaceddata
points.
Theeffectisworsethemorecolorsyou
use,e.g.fullspectrumredtoblue.
Thisisageographicalmapwithacolor
mapautomaticallyassignedbythe
imagingprogramsothatwavelengthis
inlinearproportiontothetotalrange
ofdata(inthiscase,elevation).
Doyourecognizeit?
ItsthesoutheasternUnitedStates,a
factthatismademuchmoreobvious
bymanuallyinsertingacolor
transitionatthenaturalbreakpoint
betweenpositiveandnegative
elevationrelativetosealevel.
Clusteringstillhappensinthecolor
transitionaround1000feetbelowsea
level,however.
Exploratory visualizations
You ve got a data set, and you want to know if there is anything noteworthy in it.
Can t you just get a computer to do it?
Computers are great at answering questions.
Computers suck at asking questions.
With data visualization, you can leverage your brain s visual pattern recognition
ability to understand your data.
Here are three useful tools:
Scatterplot matrix (one category, many variables)
Trellis plot (two variables, many categories)
Histogram
Tosavespace
sometimes
variablenames
gohere;
sometimesother
info
Proportional
relationship
X
Inverse relationship
Opposite cells
show flipped plot
No correlation
Scatterplot matrix for the common Iris machine learning dataset of four flower
variables in three species
Three colors distinguish
three species
Trellis plots show the same data on the x and y axis, but show how it changes in
different pairwise combinations of two variables. For the Iris dataset, this might
be just petal length and sepal width, comparing four temperatures and four
amounts of light
10 C
6hrsoflightperday
12hrsoflightperday
18hrsoflightperday
24hrsoflightperday
20 C
30 C
40 C
This is a trellis plot comparing price and size of diamonds for different qualities of
cut and clarity
Small sample
size
The expensive
ones are of lesser
clarity!
Persistent gap
in size; bias?
Get a bigger
diamond for
not much more
money
Rugplot
Histogram
- Must chose
bin width
Histograms can be
transformed into curves
with a Kernel Density
Estimate. Bandwidth is
like a histogram s bin size
Iveusedpiechartsunapologetically,butonlywhentheresjustoneslicetoshowone
relationofaparttoawhole.Inthiscase,itsabout25%
Whoops,makethat20%.Itworksbestwhenyoustartatthetop.
Example
All you know about this
plot is that the y-axis is
logarithmic and that the
data points are equally
vertically spaced, i.e.
y3
y3 y2 = y2 y1
You know nothing about
the scale or intercept.
You know nothing about
the x-axis at all, even
whether it is quantitative
or categorical
What can you tell me
about the mathematical
relationship between
these points?
y2
y1
Now let:
y1 = 30
y2 = 60
If the y axis were linear
would we know y3?
y3
60
30
120
60
And yet the vast majority of
people who make logarithmic
plots do not know this, let
alone the ones who read it.
Therefore, wrong conclusions
can easily be drawn by those
who think the only function of
a log scale is to make all the
data fit on a graph.
30
If you then force your audience to change their first impression they
might feel: betrayed, stupid, frustrated, disgusted
A visualization that joins references between scientific papers into clusters of subject
matter
ArXiv isanonline
archivethatstores
hundredsofthousands
ofscientificpapersin
physics,mathematics,
andotherfields.The
citationsinthose
paperslinktoone
another,formingaweb,
butyou'renotgoingto
seethoseconnections
justbysiftingthrough
thearchive.
Paperscape isan
interactiveinfographic
thatbeautifullyand
intuitivelychartsthe
papers.
Wickham&Stryjewski 2011
JorgeCamoes,www.excelcharts.com
Credit:Yau
or inline in text
Credit:Tufte