Professional Documents
Culture Documents
Data Science?
The future belongs to the companies
and people that turn data into products
Data scientists............................................................. 8
2010 OReilly Media, Inc. OReilly logo is a registered trademark of OReilly Media, Inc.
ii : An OReilly Radar Report: What is Data Science? All other trademarks are the property of their respective owners. 10478.1
What is
Data Science?
T
he Web is full of data-driven apps. Almost any Google is a master at creating data products. Here are a
e-commerce application is a data-driven applica- few examples:
tion. Theres a database behind a web front end,
and middleware that talks to a number of other n Googles breakthrough was realizing that a search
databases and data services (credit card processing engine could use input other than the text on the page.
companies, banks, and so on). But merely using data isnt Googles PageRank algorithm was among the first to
really what we mean by data science. A data application use data outside of the page itself, in particular, the
acquires its value from the data itself, and creates more number of links pointing to a page. Tracking links made
data as a result. Its not just an application with data; Google searches much more useful, and PageRank has
its a data product. Data science enables the creation of been a key ingredient to the companys success.
data products.
One of the earlier data products on the Web was the n Spell checking isnt a terribly difficult problem, but by
CDDB database. The developers of CDDB realized that any suggesting corrections to misspelled searches, and
CD had a unique signature, based on the exact length (in observing what the user clicks in response, Google
samples) of each track on the CD. Gracenote built a data- made it much more accurate. Theyve built a dictionary
base of track lengths, and coupled it to a database of of common misspellings, their corrections, and the
album metadata (track titles, artists, album titles). If youve contexts in which they occur.
ever used iTunes to rip a CD, youve taken advantage of
this database. Before it does anything else, iTunes reads n Speech recognition has always been a hard problem,
the length of every track, sends it to CDDB, and gets back and it remains difficult. But Google has made huge
the track titles. If you have a CD thats not in the database strides by using the voice data theyve collected, and
(including a CD youve made yourself), you can create an has been able to integrate voice search into their core
entry for an unknown album. While this sounds simple search engine.
enough, its revolutionary: CDDB views music as data, not
as audio, and creates new value in doing so. Their business n During the Swine Flu epidemic of 2009, Google was
is fundamentally different from selling music, sharing able to track the progress of the epidemic by follow-
music, or analyzing musical tastes (though these can also ing searches for flu-related topics.
be data products). CDDB arises entirely from viewing a
musical problem as a data problem.
Google was able to spot trends in the Swine Flu epidemic roughly two weeks before the Center for Disease Control by analyzing
searches that people were making in different regions of the country.
Google isnt the only company that knows how to use In the last few years, there has been an explosion in
data. Facebook and LinkedIn use patterns of friendship the amount of data thats available. Whether were talking
relationships to suggest other people you may know, or about web server logs, tweet streams, online transaction
should know, with sometimes frightening accuracy. records, citizen science, data from sensors, government
Amazon saves your searches, correlates what you search data, or some other source, the problem isnt finding data,
for with what other users search for, and uses it to create its figuring out what to do with it. And its not just com-
surprisingly appropriate recommendations. These recom- panies using their own data, or the data contributed by
mendations are data products that help to drive Amazons their users. Its increasingly common to mashup data
more traditional retail business. They come about because from a number of sources. Data Mashups in R analyzes
Amazon understands that a book isnt just a book, a camera mortgage foreclosures in Philadelphia County by taking
isnt just a camera, and a customer isnt just a customer; a public report from the county sheriffs office, extracting
customers generate a trail of data exhaust that can be addresses and using Yahoo! to convert the addresses to
mined and put to use, and a camera is a cloud of data that latitude and longitude, then using the geographical data
can be correlated with the customers behavior, the data to place the foreclosures on a map (another data source),
they leave every time they visit the site. and group them by neighborhood, valuation, neighbor-
The thread that ties most of these applications together hood per-capita income, and other socio-economic factors.
is that data collected from users provides added value. The question facing every company today, every
Whether that data is search terms, voice samples, or startup, every non-profit, every project site that wants to
product reviews, the users are in a feedback loop in attract a community, is how to use data effectivelynot
which they contribute to the products they use. Thats just their own data, but all the data thats available and
the beginning of data science. relevant. Using data effectively requires something differ-
ent from traditional statistics, where actuaries in business
Its not easy to get a handle on jobs in data science. However, data from OReilly Research shows a steady year-over-year increase in
Hadoop and Cassandra job listings, which are good proxies for the data science market as a whole. This graph shows the increase in
Cassandra jobs, and the companies listing Cassandra positions, over time.
think outside the box to come up with new ways to view ence of millions of travellers, or studying the URLs that
the problem, or to work with very broadly defined prob- people pass to others, the next generation of successful
lems: heres a lot of data, what can you make from it? businesses will be built around data. The part of Hal
The future belongs to the companies who figure out Varians quote that nobody remembers says it all:
how to collect and use data successfully. Google,
The ability to take datato be able to understand it,
Amazon, Facebook, and LinkedIn have all tapped into
to process it, to extract value from it, to visualize it,
their datastreams and made that the core of their suc-
to communicate itthats going to be a hugely
cess. They were the vanguard, but newer companies like
important skill in the next decades.
bit.ly are following their path. Whether its mining your
personal biology, building maps from the shared experi- Data is indeed the new Intel Inside.
1
The NASA article denies this, but also says that in 1984, they decided that the low values (which went back to the 70s)
were real. Whether humans or software decided to ignore anomalous data, it appears that data was ignored.
2
Information Platforms as Dataspaces, by Jeff Hammerbacher (in Beautiful Data)
3
Information Platforms as Dataspaces, by Jeff Hammerbacher (in Beautiful Data)
Smart companies are betting on data-driven insight to understand customer behavior, create better products, and gain
true competitive advantage in the marketplace. Strata is a new conference focusing on the people, tools, and technolo-
gies putting data to work. Happening February 1-3, 2011 in Santa Clara, CA, Strata will bring together decision makers,
managers, and data practitioners for three days of training, sessions, discussions, events, and exhibits showcasing the
new data ecosystem.
For details, visit strataconf.com. Use discount code str11clpdf when you register and save 20%.
radar.oreilly.com/data