You are on page 1of 73

Introduction to Data Visualization

By David Taylor
Data Scientist
Biotechnologist
Writer

www.prooffreader.com
prooffreader is misspelled: thats the joke!

What is the purpose of this presentation?

1. To describe the discipline of data visualization


2. To communicate some of the hidden
features of data visualization
3. To show you some creative examples
of data visualization
4. To share some resources that can help you
expand your data visualization possibilities

I will be skipping around these purposes as I follow my own narrative flow


Instead of artificially separating this presentation into sections.
Structure vs. narrative: an important theme in data visualization

Some good sources for further reading

Tufte

Cairo

Sona

Yau

Chen

Krum

Have you:

Ever made a graph?

Heard the term

Data Visualization

before?

Heard the term dataviz before?


(There s no non-awkward term for dataviz practitioner , unfortunately)

Ever designed a graph without:


- writing/drawing materials
- Microsoft Excel

What you call a graph, data visualization practitioners call a chart.

This is a graph; it shows network geometry

This is a chart

This is another graph.


We call this a hairball
Its usefulness is...
specialized.

It s okay, I call charts


graphs all the time.
I also use data as a
singular noun.
And I sometimes use
pie charts, which are
frowned upon by the
dataviz cognoscenti.

Yep, I m a rebel.

You don t need to program to create good dataviz, but it can magnificently expand
your capabilities. I m mostly familiar with Python, but there are many other
environments.

Useful Python libraries:

Other common dataviz environments :

Pandas: To create dataframes


(built atop Numpy)

R: ggplot2, rCharts
JavaScript: d3.js, Highcharts.js, many more

Matplotlib: Basic charting


(similar to MATLAB)

Tableau Public

Seaborn: Intermediate charting

Microsoft Excel (hey, don t knock it)

NetworkX: Graphing

Online tools
Plot.ly, Infoactive.co, many many more

Basemap: Geographical mapping


Pillow: Image processing (fork of PIL)
Bokeh: Interactive (JavaScript) charting
Plot.ly (online API)

Pandas with IPython gives you powerful data tables in an


interactive programming environment

Matplotlib is Python s default charting module


Based on MATLAB syntax
Both a function-based and object-based API.

Seaborn has R-style plots with more analytical features

Bokeh exports plots as interactive Javascript that can be


hosted on a web page

Plot.ly is an online service


with APIs for several
programming languages,
including Python and R.
With a free account, all
visualizations must be
public

WHAT IS DATA VISUALIZATION?

The study and practice of the visual presentation of


quantitative information by leveraging the human mind s
facility with:

Pattern recognition
Spatial awareness
Aesthetics

... to tell a story


descriptive or persuasive

Like statistics, Data Visualization is not a summary, it is an abstraction.

Anscombe s Quartet:
Four distributions, same typical summary statistics ...

... but very different visualizations


No method other than visualization exists to efficiently describe
the differences between these distributions

Those with some mathematical skill often think automated,


statistical methods must exist that can impart the same
information as a visualization.

There aren t any. Really, there aren t.

Data visualization isn t something you tack on the end of an


analysis to make it take up less space and be a little easier to
grasp than a table. It s an integral part of analysis.

Both statistics and Data Visualization have a narrative, determined by what you
choose to visualie. Some are benign ...

Three choices of level of abstraction


can show three kinds of periodicity

Simple plot with time on the


x axis not very illuminating

Credit:Yau

... some are not benign, but are misleading

200

200

190

100

180

Thesetwochartsshowthesamedata,butbecausetheoneontheleftdoesnothaveazero
origin,theamountofrelativeinkbetweenthebarsismisleading.
Iveseenhighlyeducatedpostdoctoralfellowsinadvertentlymisrepresentdatathiswayat
conferences.
Justbecauseyouknowscienceoryouknowmath,doesntmeanyouautomaticallyknow
visualization.
Luckily,itseasytolearn

This happens in the real world.

Mark Twain: There are three kinds of lies: lies, damn lies and statistics.
(and Data Visualizations)

More ways to mislead while telling the truth:


Both of these graphs show the same data, but the graphs have different
aspect ratios
100

100

small slope?
big slope?

Slopes are relative; our brains are not.

The golden rule: use the golden ratio for aspect ratios (a 1.618 : 1 rectangle)
Squaressoundelegant,buttheyhave
experimentallybeenshowntocreate
aperceptionoftension

Rectanglesbreathe

What s so special about the golden ratio?


1/ = 1
1
1.618

and it s the limit of the ratio between consecutive term


of the Fibonacci sequence

= 0.618

Paintingis1.618:1

Focalpointis61.8%ofheightx61.8%ofwidth

Seurat,BathersatAsnires (1884)

TheParthenonisfullofgoldenratios

Sneakier: exact same bar charts with different aspect ratios of their bars.
The relative amount of ink, area ratios are identical, but they imply different effects.

It took a Herculean effort to rise from 90 to 100

There s so much whitespace, the


difference is no big whoop.

Humans are better at discriminating some types of visual input than others
What goes in our eyes is not what goes in our brains

The magenta circles are the same size. And yet


bubble charts are relatively common.
Credit: Kaiser Fung, Junk Charts

Humans are better at discriminating some types of visual input than others
What goes in our eyes is not what goes in our brains

The magenta circles are the same size. And yet


bubble charts are relatively common.
Credit: Kaiser Fung, Junk Charts

Seeing is not perception

Due to neighboring cell


inhibition, our eyes
perceive black dots just
outside of our field of
vision

Some of the ways our brains can


respond to visual stimulus are quite
remarkable.
These shapes on their own are not
remarkable.

Add an obviously
modified face behind
it, and it s weird, but
still nothing special.
Watch what happens
when we remove the
shapes.

Ah, that s better. The


more stylized it is, the less
it affects the brain.
The incredible faceprocessing power of the
brain leads to a
phenomenon called
pareidolia, where people
see faces where they
aren t, such as on the
moon or in the burn
marks on a tortilla.
These are more extreme
examples than you
usually get in Data
Visualization, but they
serve to underline that
sight involves more than
receiving photons.

Different
mappings of
quantitative data
to visual
relationships
have different
accuracies
(The ClevelandMcGill scale)

Which of these better represents


the numerical relationship
between 10 and 7 ?

Source:Cairo

The human brain has difficulty judging trends without a common baseline:

<- Absolute values with no common baseline


When was the trade balance the greatest? (the
difference in height between the blue and red
lines?)
The brain is not good at parsing a changing
baseline.

It is much easier to
see this as one
plot with one
baseline.

It s much easier to see the


trend in the red series, so it is
emphasized in the brain
more than the others

Speaking of relative vs. absolute values, Fox News and conservatives do not have
a monopoly on data visualization chicanery:

Relative values, from U.S.


Department of Labor Statistics
Absolute values

Heatmaps are becoming


more an more popular as
scientific journals allow color
graphics
Human color perception
requires special care, however.
Red-black-green diverging is
the most common color
scheme used, but there s a big
problem with it ...

Colorblindedness: 8.5% of men, 0.5% of women

There are many kinds and


degrees of colorblindedness; redblack-green is problematic in ALL
of them.
The solution: use orange-whiteblue

This took 30 seconds in


Photoshop

Spectral color perception is subject to perceptual distortion

A 10 nm change here
Is imperceptible

A 10 nm change here
Is very dramatic

Multihuecolormapscauseperceptual
clusteringofevenlyspaceddata
points.
Theeffectisworsethemorecolorsyou
use,e.g.fullspectrumredtoblue.
Thisisageographicalmapwithacolor
mapautomaticallyassignedbythe
imagingprogramsothatwavelengthis
inlinearproportiontothetotalrange
ofdata(inthiscase,elevation).
Doyourecognizeit?

ItsthesoutheasternUnitedStates,a
factthatismademuchmoreobvious
bymanuallyinsertingacolor
transitionatthenaturalbreakpoint
betweenpositiveandnegative
elevationrelativetosealevel.
Clusteringstillhappensinthecolor
transitionaround1000feetbelowsea
level,however.

All color maps except monochromatic will cause clustering

The other golden rule:


Know your audience.

Explorational data visualization: the audience is you


Informal data visualization: the audience is your peers
Formal data visualization: the audience is your betters
Public data visualization: the audience is everybody

Exploratory visualizations
You ve got a data set, and you want to know if there is anything noteworthy in it.
Can t you just get a computer to do it?
Computers are great at answering questions.
Computers suck at asking questions.

With data visualization, you can leverage your brain s visual pattern recognition
ability to understand your data.
Here are three useful tools:
Scatterplot matrix (one category, many variables)
Trellis plot (two variables, many categories)
Histogram

A Scatterplot matrix presents every pairwise relationship between a set of variables.


If you have 4 variables, w, x, y, and z, it shows the relationships between:
w+x, w+y, w+z, x+y, x+z and y+z
wxyz
w

Tosavespace
sometimes
variablenames
gohere;
sometimesother
info

Proportional
relationship

X
Inverse relationship

Opposite cells
show flipped plot

No correlation

Scatterplot matrix for the common Iris machine learning dataset of four flower
variables in three species
Three colors distinguish
three species

You can see both


clustering and linear
relationships

Trellis plots show the same data on the x and y axis, but show how it changes in
different pairwise combinations of two variables. For the Iris dataset, this might
be just petal length and sepal width, comparing four temperatures and four
amounts of light
10 C

6hrsoflightperday

12hrsoflightperday

18hrsoflightperday

24hrsoflightperday

20 C

30 C

40 C

This is a trellis plot comparing price and size of diamonds for different qualities of
cut and clarity

Small sample
size

The expensive
ones are of lesser
clarity!

Persistent gap
in size; bias?

Get a bigger
diamond for
not much more
money

What do you do if you just have a 1D series of data?

Rugplot

Histogram
- Must chose
bin width

Histograms are dependent on


bin widths to show coarse or
fine features
There is no generic formula to
determine optimal bin width

Histograms can be
transformed into curves
with a Kernel Density
Estimate. Bandwidth is
like a histogram s bin size

Scatter plot with a


histogram and KDE for
each axis

Key concepts from dataviz pioneer Edward Tufte:


Maximize data to ink ratio
Minimize chart junk

Beware pie charts!


Most of them aren t
this bad, but they
rely on perception of
area and angle,
which are
particularly weak in
the Cleveland-McGill
hierarchy.
Plus, dataviz
professionals
instinctively sneer at
them, so you risk
making yourself look
amateurish.
If you do use them,
make absolutely sure
they represent parts
of the whole, and
that all of those parts
add up to 100%.

Here s why dataviz professionals can t stand them.


Angle, area, no common baseline.

Iveusedpiechartsunapologetically,butonlywhentheresjustoneslicetoshowone
relationofaparttoawhole.Inthiscase,itsabout25%

Whoops,makethat20%.Itworksbestwhenyoustartatthetop.

The jobs of a data


visualization
(from Alberto Cairo)

PRESENT the variables


Allow COMPARISON of variables
ORGANIZE categories
Show CORRELATIONS or
RELATIONSHIPS

This is a chord diagram .

Logarithmic scales: a cautionary tale


Or: Know your audience
These are not trick questions.
Well, they re sort of trick questions, but the good kind.
To demonstrate a useful point, not to make you feel bad.
How many of you:
- Have ever heard of a logarithm
- Know what a logarithm is
- Consider themselves to understand what a logarithm is
- Know when to use a logarithmic scale on a graph?
- What does a logarithmic plot show?
- What do you do with the intercept? (log 0 = undefined)

Example
All you know about this
plot is that the y-axis is
logarithmic and that the
data points are equally
vertically spaced, i.e.

y3

y3 y2 = y2 y1
You know nothing about
the scale or intercept.
You know nothing about
the x-axis at all, even
whether it is quantitative
or categorical
What can you tell me
about the mathematical
relationship between
these points?

y2

y1

Now let:
y1 = 30
y2 = 60
If the y axis were linear
would we know y3?

Given that the y axis is


logarithmic, do we
know y3? Is there any
additional information
we need?

y3

60

30

Answer: y3 = 120. In a logarithmic plot, equal linear distance implies equal


proportions (e.g., every time you go up an inch, you double the value)
It does not matter whether you
use log 10, log 2 or log e. All
that changes is the position of
the tick marks (if you choose to
include them)

120

This is the basic, fundamental


nature of logarithms

60
And yet the vast majority of
people who make logarithmic
plots do not know this, let
alone the ones who read it.
Therefore, wrong conclusions
can easily be drawn by those
who think the only function of
a log scale is to make all the
data fit on a graph.

30

The other other golden rule:


Don t betray your audience s first impression.

A recent study showed the first impression of a data visualization is


formed in a period of time briefer than a visual saccade (a.k.a. a blink).
And first impressions are difficult to change.

If you then force your audience to change their first impression they
might feel: betrayed, stupid, frustrated, disgusted

They will no longer be focusing on your message.

Enough of the heavy stuff, the


rest of the presentation is made
up of stuff I find cool.
BTW, dont use Comic Sans. It has become associated
with people who know nothing about aesthetics.
Thats probably not the message you want to send.

A visualization that joins references between scientific papers into clusters of subject
matter

ArXiv isanonline
archivethatstores
hundredsofthousands
ofscientificpapersin
physics,mathematics,
andotherfields.The
citationsinthose
paperslinktoone
another,formingaweb,
butyou'renotgoingto
seethoseconnections
justbysiftingthrough
thearchive.
Paperscape isan
interactiveinfographic
thatbeautifullyand
intuitivelychartsthe
papers.

Newspapers are full of


examples in how not to
make a visualization.
Why is the time axis
curved, exactly?

Boxplots are an extremely effective tool for abstracting 1D distributions for


comparison. Unfortunately, their communicative value is limited because they are
poorly understood; violin plots are an alternative in this case.

Wickham&Stryjewski 2011

Small multiples allow


coarse comparisons
between many
categories

JorgeCamoes,www.excelcharts.com

Small multiples from the year 1626

Radar charts can be used to show categorical data

Credit:Yau

Sparklines can be used to


show small multiples in
tables

or inline in text

Credit:Tufte

A line graph from


the New York
Times where the
x-axis does not
represent time,
but imputes it.
Time is
represented by
the lines
between data
points, not by the
x axis

Dashboards can show


an overview of data
Widely used in
business to get an
abstraction of a
complex situation at a
glance

Heatmaps can be used as a


calendar

Finally, my absolute favourite data visualization:

You might also like