You are on page 1of 8

What is

Data Science?
The future belongs to the companies
and people that turn data into products

An OReilly Radar Report


By Mike Loukides
Be at the forefront of
the data revolution.
February 28 March 1, 2012
Santa Clara, CA

Strata offers the nuts-and-bolts of building a data-driven business.

n See the latest tools and technologies you need to make data work
n Find new ways to leverage data across industries and disciplines
n Understand the career opportunities for data professionals
n Tracks include: Data Science, Business & Industry, Visualization &
Interface, Hadoop & Big Data, Policy & Privacy, and Domain Data
Strata Conference is for developers, data scientists, data analysts, and other
data professionals.

Registration is now open at strataconf.com


Save 20% with code REPORT20
Contents What is
Where data comes from.......................................... 3

Working with data at scale..................................... 5


Data Science?
Making data tell its story........................................ 7

T
Data scientists............................................................. 8
he Web is full of data-driven apps. Almost any Google is a master at creating data products. Here are a
e-commerce application is a data-driven applica- few examples:
tion. Theres a database behind a web front end,
and middleware that talks to a number of other n Googles breakthrough was realizing that a search
databases and data services (credit card processing engine could use input other than the text on the page.
companies, banks, and so on). But merely using data isnt Googles PageRank algorithm was among the first to
really what we mean by data science. A data application use data outside of the page itself, in particular, the
acquires its value from the data itself, and creates more number of links pointing to a page. Tracking links made
data as a result. Its not just an application with data; Google searches much more useful, and PageRank has
its a data product. Data science enables the creation of been a key ingredient to the companys success.
data products.
One of the earlier data products on the Web was the n Spell checking isnt a terribly difficult problem, but by
CDDB database. The developers of CDDB realized that any suggesting corrections to misspelled searches, and
CD had a unique signature, based on the exact length (in observing what the user clicks in response, Google
samples) of each track on the CD. Gracenote built a data- made it much more accurate. Theyve built a dictionary
base of track lengths, and coupled it to a database of of common misspellings, their corrections, and the
album metadata (track titles, artists, album titles). If youve contexts in which they occur.
ever used iTunes to rip a CD, youve taken advantage of
this database. Before it does anything else, iTunes reads n Speech recognition has always been a hard problem,
the length of every track, sends it to CDDB, and gets back and it remains difficult. But Google has made huge
the track titles. If you have a CD thats not in the database strides by using the voice data theyve collected, and
(including a CD youve made yourself), you can create an has been able to integrate voice search into their core
entry for an unknown album. While this sounds simple search engine.
enough, its revolutionary: CDDB views music as data, not
as audio, and creates new value in doing so. Their business n During the Swine Flu epidemic of 2009, Google was
is fundamentally different from selling music, sharing able to track the progress of the epidemic by follow-
music, or analyzing musical tastes (though these can also ing searches for flu-related topics.
be data products). CDDB arises entirely from viewing a
musical problem as a data problem.

2010 OReilly Media, Inc. OReilly logo is a registered trademark of OReilly Media, Inc.
ii : An OReilly Radar Report: What is Data Science? All other trademarks are the property of their respective owners. 10478.1 An OReilly Radar Report: What is Data Science? :1
Flu trends suits perform arcane but fairly well-defined kinds of has moved from $1,000/MB to roughly $25/GBa price
analysis. What differentiates data science from statistics is reduction of about 40000, to say nothing of the reduction
that data science is a holistic approach. Were increasingly in size and increase in speed. Hitachi made the first giga-
finding data in the wild, and data scientists are involved byte disk drives in 1982, weighing in at roughly 250 pounds;
with gathering data, massaging it into a tractable form, now terabyte drives are consumer equipment, and a 32 GB
making it tell its story, and presenting that story to others. microSD card weighs about half a gram. Whether you look
To get a sense for what skills are required, lets look at at bits per gram, bits per dollar, or raw capacity, storage has
the data life cycle: where it comes from, how you use it, more than kept pace with the increase of CPU speed.
and where it goes.

Where data comes from


Data is everywhere: your government, your web server,
your business partners, even your body. While we arent
drowning in a sea of data, were finding that almost every-
Google was able to spot trends in the Swine Flu epidemic roughly two weeks before the Center for Disease Control by analyzing thing can (or has) been instrumented. At OReilly, we
searches that people were making in different regions of the country.
frequently combine publishing industry data from Nielsen
BookScan with our own sales data, publicly available
Google isnt the only company that knows how to use In the last few years, there has been an explosion in Amazon data, and even job data to see whats happening
data. Facebook and LinkedIn use patterns of friendship the amount of data thats available. Whether were talking in the publishing industry. Sites like Infochimps and
relationships to suggest other people you may know, or about web server logs, tweet streams, online transaction Factual provide access to many large datasets, including
should know, with sometimes frightening accuracy. records, citizen science, data from sensors, government climate data, MySpace activity streams, and game logs
Amazon saves your searches, correlates what you search data, or some other source, the problem isnt finding data, from sporting events. Factual enlists users to update and
for with what other users search for, and uses it to create its figuring out what to do with it. And its not just com- improve its datasets, which cover topics as diverse as
surprisingly appropriate recommendations. These recom- panies using their own data, or the data contributed by endocrinologists to hiking trails.
mendations are data products that help to drive Amazons their users. Its increasingly common to mashup data Much of the data we currently work with is the direct
more traditional retail business. They come about because from a number of sources. Data Mashups in R analyzes consequence of Web 2.0, and of Moores Law applied to
Amazon understands that a book isnt just a book, a camera mortgage foreclosures in Philadelphia County by taking data. The Web has people spending more time online,
isnt just a camera, and a customer isnt just a customer; a public report from the county sheriffs office, extracting and leaving a trail of data wherever they go. Mobile
customers generate a trail of data exhaust that can be addresses and using Yahoo! to convert the addresses to applications leave an even richer data trail, since many of
mined and put to use, and a camera is a cloud of data that latitude and longitude, then using the geographical data them are annotated with geolocation, or involve video or
can be correlated with the customers behavior, the data to place the foreclosures on a map (another data source), audio, all of which can be mined. Point-of-sale devices
One of the first commercial disk drives from IBM. It has a 5 MB
they leave every time they visit the site. and group them by neighborhood, valuation, neighbor- and frequent-shoppers cards make it possible to capture capacity and its stored in a cabinet roughly the size of a luxury
The thread that ties most of these applications together hood per-capita income, and other socio-economic factors. all of your retail transactions, not just the ones you make refrigerator. In contrast, a 32 GB microSD card measures around
5/8 x 3/8 inch and weighs about 0.5 gram.
is that data collected from users provides added value. The question facing every company today, every online. All of this data would be useless if we couldnt
Whether that data is search terms, voice samples, or startup, every non-profit, every project site that wants to store it, and thats where Moores Law comes in. Since the (Photo: Mike Loukides. Disk drive on display at IBM Almaden Research)
product reviews, the users are in a feedback loop in attract a community, is how to use data effectivelynot early 80s, processor speed has increased from 10 MHz to
which they contribute to the products they use. Thats just their own data, but all the data thats available and 3.6 GHzan increase of 360 (not counting increases in
the beginning of data science. relevant. Using data effectively requires something differ- word length and number of cores). But weve seen much
ent from traditional statistics, where actuaries in business bigger increases in storage capacity, on every level. RAM

2 : An OReilly Radar Report: What is Data Science? An OReilly Radar Report: What is Data Science? :3
The importance of Moores law as applied to data isnt the missing points? That isnt always possible. If data is Working with data at scale To store huge datasets effectively, weve seen a new
just geek pyrotechnics. Data expands to fill the space you incongruous, do you decide that something is wrong Weve all heard a lot about big data, but big is really a red breed of databases appear. These are frequently called
have to store it. The more storage is available, the more with badly behaved data (after all, equipment fails), or that herring. Oil companies, telecommunications companies, NoSQL databases, or Non-Relational databases, though
data you will find to put into it. The data exhaust you leave the incongruous data is telling its own story, which may be and other data-centric industries have had huge datasets for neither term is very useful. They group together funda-
behind whenever you surf the Web, friend someone on more interesting? Its reported that the discovery of ozone a long time. And as storage capacity continues to expand, mentally dissimilar products by telling you what they
Facebook, or make a purchase in your local supermarket, layer depletion was delayed because automated data todays big is certainly tomorrows medium and next arent. Many of these databases are the logical descendants
is all carefully collected and analyzed. Increased storage collection tools discarded readings that were too low1. In weeks small. The most meaningful definition Ive heard: of Googles BigTable and Amazons Dynamo, and are
capacity demands increased sophistication in the analysis data science, what you have is frequently all youre going big data is when the size of the data itself becomes part designed to be distributed across many nodes, to provide
and use of that data. Thats the foundation of data science. to get. Its usually impossible to get better data, and you of the problem. Were discussing data problems ranging eventual consistency but not absolute consistency, and
So, how do we make that data useful? The first step of have no alternative but to work with the data at hand. from gigabytes to petabytes of data. At some point, tradi- to have very flexible schema. While there are two dozen or
any data analysis project is data conditioning, or getting If the problem involves human language, understand- tional techniques for working with data run out of steam. so products available (almost all of them open source), a
data into a state where its usable. We are seeing more data ing the data adds another dimension to the problem. What are we trying to do with data thats different? few leaders have established themselves:
in formats that are easier to consume: Atom data feeds, Roger Magoulas, who runs the data analysis group at According to Jeff Hammerbacher2 (@hackingdata), were
n Cassandra: Developed at Facebook, in production
web services, microformats, and other newer technologies OReilly, was recently searching a database for Apple job trying to build information platforms or dataspaces.
use at Twitter, Rackspace, Reddit, and other large
provide data in formats thats directly machine-consumable. listings requiring geolocation skills. While that sounds like Information platforms are similar to traditional data ware- sites. Cassandra is designed for high performance,
But old-style screen scraping hasnt died, and isnt going to a simple task, the trick was disambiguating Apple from houses, but different. They expose rich APIs, and are reliability, and automatic replication. It has a very
die. Many sources of wild data are extremely messy. They many job postings in the growing Apple industry. To do it designed for exploring and understanding the data rather flexible data model. A new startup, Riptano, provides
arent well-behaved XML files with all the metadata nicely well you need to understand the grammatical structure than for traditional analysis and reporting. They accept all commercial support.
in place. The foreclosure data used in Data Mashups in R of a job posting; you need to be able to parse the English. data formats, including the most messy, and their schemas n HBase: Part of the Apache Hadoop project, and
was posted on a public website by the Philadelphia county And that problem is showing up more and more frequently. evolve as the understanding of the data changes. modelled on Googles BigTable. Suitable for extremely
sheriffs office. This data was presented as an HTML file that Try using Google Trends to figure out whats happening Most of the organizations that have built data platforms large databases (billions of rows, millions of columns),
was probably generated automatically from a spreadsheet. with the Cassandra database or the Python language, and have found it necessary to go beyond the relational data- distributed across thousands of nodes. Along with
If youve ever seen the HTML thats generated by Excel, you youll get a sense of the problem. Google has indexed base model. Traditional relational database systems stop Hadoop, commercial support is provided by Cloudera.
know thats going to be fun to process. many, many websites about large snakes. Disambiguation being effective at this scale. Managing sharding and repli-
Data conditioning can involve cleaning up messy HTML is never an easy task, but tools like the Natural Language cation across a horde of database servers is difficult and Storing data is only part of building a data platform,
with tools like Beautiful Soup, natural language processing Toolkit library can make it simpler. slow. The need to define a schema in advance conflicts though. Data is only useful if you can do something with it,
to parse plain text in English and other languages, or even When natural language processing fails, you can replace with reality of multiple, unstructured data sources, in and enormous datasets present computational problems.
getting humans to do the dirty work. Youre likely to be artificial intelligence with human intelligence. Thats where which you may not know whats important until after Google popularized the MapReduce approach, which is
dealing with an array of data sources, all in different forms. services like Amazons Mechanical Turk come in. If you can youve analyzed the data. Relational databases are designed basically a divide-and-conquer strategy for distributing an
It would be nice if there was a standard set of tools to do split your task up into a large number of subtasks that are for consistency, to support complex transactions that can extremely large problem across an extremely large com-
the job, but there isnt. To do data conditioning, you have easily described, you can use Mechanical Turks market- easily be rolled back if any one of a complex set of opera- puting cluster. In the map stage, a programming task is
to be ready for whatever comes, and be willing to use place for cheap labor. For example, if youre looking at job tions fails. While rock-solid consistency is crucial to many divided into a number of identical subtasks, which are then
anything from ancient Unix utilities such as awk to XML listings, and want to know which originated with Apple, applications, its not really necessary for the kind of analysis distributed across many processors; the intermediate
parsers and machine learning libraries. Scripting languages, you can have real people do the classification for roughly were discussing here. Do you really care if you have 1,010 results are then combined by a single reduce task. In
such as Perl and Python, are essential. $0.01 each. If you have already reduced the set to 10,000 or 1,012 Twitter followers? Precision has an allure, but in hindsight, MapReduce seems like an obvious solution to
Once youve parsed the data, you can start thinking postings with the word Apple, paying humans $0.01 to most data-driven applications outside of finance, that Googles biggest problem, creating large searches. Its easy
about the quality of your data. Data is frequently missing classify them only costs $100. allure is deceptive. Most data analysis is comparative: to distribute a search across thousands of processors, and
or incongruous. If data is missing, do you simply ignore if youre asking whether sales to Northern Europe are then combine the results into a single set of answers.
increasing faster than sales to Southern Europe, you arent Whats less obvious is that MapReduce has proven to be
concerned about the difference between 5.92 percent widely applicable to many large data problems, ranging
annual growth and 5.93 percent. from search to machine learning.

4 : An OReilly Radar Report: What is Data Science? An OReilly Radar Report: What is Data Science? :5
The most popular open source implementation of require soft real-time; reports on trending topics dont While I havent stressed traditional statistics, building But thats not really what concerns us here. Visualization
MapReduce is the Hadoop project. Yahoo!s claim that they require millisecond accuracy. As with the number of statistical models plays an important role in any data is crucial to each stage of the data scientist. According to
had built the worlds largest production Hadoop application, followers on Twitter, a trending topics report only needs analysis. According to Mike Driscoll (@dataspora), statistics Martin Wattenberg (@wattenberg, founder of Flowing
with 10,000 cores running Linux, brought it onto center to be current to within five minutesor even an hour. is the grammar of data science. It is crucial to making data Media), visualization is key to data conditioning: if you
stage. Many of the key Hadoop developers have found a According to Hilary Mason (@hmason), data scientist at speak coherently. Weve all heard the joke that eating want to find out just how bad your data is, try plotting it.
home at Cloudera, which provides commercial support. bit.ly, its possible to precompute much of the calculation, pickles causes death, because everyone who dies has Visualization is also frequently the first step in analysis.
Amazons Elastic MapReduce makes it much easier to put then use one of the experiments in real-time MapReduce eaten pickles. That joke doesnt work if you understand Hilary Mason says that when she gets a new data set, she
Hadoop to work without investing in racks of Linux to get presentable results. what correlation means. More to the point, its easy to starts by making a dozen or more scatter plots, trying to
machines, by providing preconfigured Hadoop images for Machine learning is another essential tool for the data notice that one advertisement for R in a Nutshell generated get a sense of what might be interesting. Once youve
its EC2 clusters. You can allocate and de-allocate processors scientist. We now expect web and mobile applications to 2 percent more conversions than another. But it takes gotten some hints at what the data might be saying, you
as needed, paying only for the time you use them. incorporate recommendation engines, and building a statistics to know whether this difference is significant, or can follow it up with more detailed analysis.
Hadoop goes far beyond a simple MapReduce imple- recommendation engine is a quintessential artificial just a random fluctuation. Data science isnt just about the There are many packages for plotting and presenting
mentation (of which there are several); its the key compo- intelligence problem. You dont have to look at many existence of data, or making guesses about what that data data. GnuPlot is very effective; R incorporates a fairly
nent of a data platform. It incorporates HDFS, a distributed modern web applications to see classification, error might mean; its about testing hypotheses and making comprehensive graphics package; Casey Reas and Ben
filesystem designed for the performance and reliability detection, image matching (behind Google Goggles and sure that the conclusions youre drawing from the data are Frys Processing is the state of the art, particularly if you
requirements of huge datasets; the HBase database; SnapTell) and even face detectionan ill-advised mobile valid. Statistics plays a role in everything from traditional need to create animations that show how things change
Hive, which lets developers explore Hadoop datasets application lets you take someones picture with a cell business intelligence (BI) to understanding how Googles over time. At IBMs Many Eyes, many of the visualizations
using SQL-like queries; a high-level dataflow language phone, and look up that persons identity using photos ad auctions work. Statistics has become a basic skill. It isnt are full-fledged interactive applications.
called Pig; and other components. If anything can be available online. Andrew Ngs Machine Learning course at superseded by newer techniques from machine learning Nathan Yaus FlowingData blog is a great place to
called a one-stop information platform, Hadoop is it. http://www.youtube.com/watch?v=UzxYlbK2c7E is one and other disciplines; it complements them. look for creative visualizations. One of my favorites is
Hadoop has been instrumental in enabling agile of the most popular courses in computer science at While there are many commercial statistical packages, the animation of the growth of Walmart over time
data analysis. In software development, agile practices Stanford, with hundreds of students. the open source R languageand its comprehensive http://flowingdatacom/2010/04/07watching-the-
are associated with faster product cycles, closer interaction There are many libraries available for machine learning: package library, CRANis an essential tool. Although R is growth-of-walmart-now-with-100-more-sams-club/.
between developers and consumers, and testing. PyBrain in Python, Elefant, Weka in Java, and Mahout an odd and quirky language, particularly to someone with And this is one place where art comes in: not just the
Traditional data analysis has been hampered by extremely (coupled to Hadoop). Google has just announced their a background in computer science, it comes close to aesthetics of the visualization itself, but how you under-
long turn-around times. If you start a calculation, it might Prediction API, which exposes their machine learning providing one-stop shopping for most statistical work. It stand it. Does it look like the spread of cancer throughout
not finish for hours, or even days. But Hadoop (and particu- algorithms for public use via a RESTful interface. For com- has excellent graphics facilities; CRAN includes parsers for a body? Or the spread of a flu virus through a population?
larly Elastic MapReduce) make it easy to build clusters that puter vision, the OpenCV library is a de-facto standard. many kinds of data; and newer extensions extend R into Making data tell its story isnt just a matter of presenting
can perform computations on long datasets quickly. Faster Mechanical Turk is also an important part of the toolbox. distributed computing. If theres a single tool that provides results; it involves making connections, then going back
computations make it easier to test different assumptions, Machine learning almost always requires a training set, or an end-to-end solution for statistics work, R is it. to other data sources to verify them. Does a successful
different datasets, and different algorithms. Its easer to a significant body of known data with which to develop retail chain spread like an epidemic, and if so, does that
consult with clients to figure out whether youre asking the and tune the application. The Turk is an excellent way to Making data tell its story give us new insights into how economies work? Thats not
right questions, and its possible to pursue intriguing pos- develop training sets. Once youve collected your training A picture may or may not be worth a thousand words, a question we could even have asked a few years ago.
sibilities that youd otherwise have to drop for lack of time. data (perhaps a large collection of public photos from but a picture is certainly worth a thousand numbers. The There was insufficient computing power, the data was all
Hadoop is essentially a batch system, but Hadoop Twitter), you can have humans classify them inexpen- problem with most data analysis algorithms is that they locked up in proprietary sources, and the tools for working
Online Prototype (HOP) is an experimental project that sivelypossibly sorting them into categories, possibly generate a set of numbers. To understand what the with the data were insufficient. Its the kind of question we
enables stream processing. Hadoop processes data as it drawing circles around faces, cars, or whatever interests numbers mean, the stories they are really telling, you now ask routinely.
arrives, and delivers intermediate results in (near) real- you. Its an excellent way to classify a few thousand data need to generate a graph. Edward Tuftes Visual Display of
time. Near real-time data analysis enables features like points at a cost of a few cents each. Even a relatively Quantitative Information is the classic for data visualization,
trending topics on sites like Twitter. These features only large job only costs a few hundred dollars. and a foundational text for anyone practicing data science.

6 : An OReilly Radar Report: What is Data Science? An OReilly Radar Report: What is Data Science? :7
Data scientists profiles, LinkedIns data scientists started looking at events Hiring trends for data science
Data science requires skills ranging from traditional that members attended. Then at books members had in
computer science to mathematics to art. Describing the their libraries. The result was a valuable data product that
data science group he put together at Facebook (possibly analyzed a huge databasebut it was never conceived
the first data science group at a consumer-oriented web as such. It started small, and added value iteratively. It
property), Jeff Hammerbacher said: was an agile, flexible process that built toward its goal
on any given day, a team member could incrementally, rather than tackling a huge mountain of
author a multistage processing pipeline in data all at once.
Python, design a hypothesis test, perform a This is the heart of what Patil calls data jiujitsu
regression analysis over data samples with R, using smaller auxiliary problems to solve a large, difficult
design and implement an algorithm for some problem that appears intractable. CDDB is a great example
data-intensive product or service in Hadoop, or of data jiujitsu: identifying music by analyzing an audio
communicate the results of our analyses to stream directly is a very difficult problem (though not
other members of the organization3 unsolvablesee midomi, for example). But the CDDB
staff used data creatively to solve a much more tractable
Where do you find the people this versatile? According problem that gave them the same result. Computing a
to DJ Patil, chief scientist at LinkedIn (@dpatil), the best signature based on track lengths, and then looking up
data scientists tend to be hard scientists, particularly that signature in a database, is trivially simple.
physicists, rather than computer science majors. Physicists Entrepreneurship is another piece of the puzzle. Patils
Its not easy to get a handle on jobs in data science. However, data from OReilly Research shows a steady year-over-year increase in
have a strong mathematical background, computing skills, first flippant answer to what kind of person are you look- Hadoop and Cassandra job listings, which are good proxies for the data science market as a whole. This graph shows the increase in
and come from a discipline in which survival depends on ing for when you hire a data scientist? was someone you Cassandra jobs, and the companies listing Cassandra positions, over time.
getting the most from the data. They have to think about would start a company with. Thats an important insight:
the big picture, the big problem. When youve just spent a were entering the era of products that are built on data.
lot of grant money generating data, you cant just throw We dont yet know what those products are, but we do think outside the box to come up with new ways to view ence of millions of travellers, or studying the URLs that
the data out if it isnt as clean as youd like. You have to know that the winners will be the people, and the compa- the problem, or to work with very broadly defined prob- people pass to others, the next generation of successful
make it tell its story. You need some creativity for when the nies, that find those products. Hilary Mason came to the lems: heres a lot of data, what can you make from it? businesses will be built around data. The part of Hal
story the data is telling isnt what you think its telling. same conclusion. Her job as scientist at bit.ly is really to The future belongs to the companies who figure out Varians quote that nobody remembers says it all:
Scientists also know how to break large problems up investigate the data that bit.ly is generating, and find out how to collect and use data successfully. Google,
The ability to take datato be able to understand it,
into smaller problems. Patil described the process of how to build interesting products from it. No one in the Amazon, Facebook, and LinkedIn have all tapped into
to process it, to extract value from it, to visualize it,
creating the group recommendation feature at LinkedIn. nascent data industry is trying to build the 2012 Nissan their datastreams and made that the core of their suc-
to communicate itthats going to be a hugely
It would have been easy to turn this into a high-ceremony Stanza or Office 2015; theyre all trying to find new prod- cess. They were the vanguard, but newer companies like
important skill in the next decades.
development project that would take thousands of hours ucts. In addition to being physicists, mathematicians, bit.ly are following their path. Whether its mining your
of developer time, plus thousands of hours of computing programmers, and artists, theyre entrepreneurs. personal biology, building maps from the shared experi- Data is indeed the new Intel Inside.
time to do massive correlations across LinkedIns member- Data scientists combine entrepreneurship with
ship. But the process worked quite differently: it started patience, the willingness to build data products incremen-
out with a relatively small, simple program that looked at tally, the ability to explore, and the ability to iterate over a
members profiles and made recommendations accord- solution. They are inherently interdisciplinary. They can
The NASA article denies this, but also says that in 1984, they decided that the low values (which went back to the 70s)
1

ingly. Asking things like, did you go to Cornell? Then you tackle all aspects of a problem, from initial data collection were real. Whether humans or software decided to ignore anomalous data, it appears that data was ignored.
might like to join the Cornell Alumni group. It then and data conditioning to drawing conclusions. They can Information Platforms as Dataspaces, by Jeff Hammerbacher (in Beautiful Data)
2

branched out incrementally. In addition to looking at Information Platforms as Dataspaces, by Jeff Hammerbacher (in Beautiful Data)
3

8 : An OReilly Radar Report: What is Data Science? An OReilly Radar Report: What is Data Science? :9
OReilly publications related to data science
found at oreilly.com

Data Analysis with Open Source Tools Beautiful Data


This book shows you how to think about Learn from the best data practitioners in
data and the results you want to achieve the field about how wide-ranging
with it. and beautifulworking with data can be.

Programming Collective Intelligence Beautiful Visualization


Learn how to build web applications that This book demonstrates why visualizations
mine the data created by people on the are beautiful not only for their aesthetic
Internet. design, but also for elegant layers of detail.

R in a Nutshell Head First Statistics


A quick and practical reference to learn This book teaches statistics through
what is becoming the standard for puzzles, stories, visual aids, and real-world
developing statistical software. examples.

Statistics in a Nutshell Head First Data Analysis


An introduction and reference for anyone Learn how to collect your data, sort the
with no previous background in statistics. distractions from the truth, and find
meaningful patterns.

radar.oreilly.com

You might also like