Introduction To Data Science

INTRODUCTION TO DATA SCIENCE
Step 1) Brush up on your math

At a minimum, I would recommend a strong understanding in
Multivariate Calculus: https://www.quora.com/What-are-the-best-resources-formastering-multivariable-calculus
Numerical Linear Algebra / Computational Linear Algebra / Matrix Algebra:Linear Algebra
A lot of probability, statistics, and machine learning algorithms revolve around a strong
understanding of the above.
Step 2) Setup your tools
Install Python, iPython, and related libraries (guide)
Install R and RStudio
Install Sublime Text
Step 3) Learn to use your tools
Learn R with swirl
Learn Python with codecademy
What's the best way to learn to use Sublime Text?
What is the best way to learn SQL basics? (I don't think there's too much of a need to
install it on your computer, but just learning the syntax will be helpful for the job)
Step 4) Learn Probability and Statistics

Be sure to go through a course that involves heavy application in R or Python.
Python Application: Think Stats: Probability and Statistics for

Programmers (Python focus)
R Applications: Introduction to Statistical Learning (Book) (MOOC)
Step 5) Complete Harvard's Data Science Course

This course is developed in part by a fellow Quora user, Professor Joe Blitzstein.
Intro to the class
What is it like to design a data science class?
What is it like to take CS 109/Statistics 121 (Data Science) at Harvard?

Lectures and Slides
(2013) Lecture Videos
(2013) Slides
(2014) Lecture Videos
(2014) Slides
2013 Assignments
Intro to Python, Numpy, Matplotlib (Homework 0) (solutions)
Poll aggregation, web scraping, plotting, model evaluation, and

forecasting (Homework 1) (solutions)
Data prediction, manipulation, and evaluation (Homework 2) (solutions)
Predictive modeling, model calibration, sentiment analysis(Homework 3)

(solutions)
Recommendation engines, Using mapreduce (Homework 4) (solutions)
Network visualization and analysis (Homework 5) (solutions)

2014 Assignments
Data manipulation, modeling, plotting (Homework 1)(solutions)

2013 Labs
Lab 2: Web Scraping
Lab 3: EDA, Pandas, Matplotlib
Lab 4: Scikit-Learn, Regression, PCA
Lab 5: Bias, Variance, Cross-Validation
Lab 6: Bayes, Linear Regression, and Metropolis Sampling
Lab 7: Gibbs Sampling
Lab 8: MapReduce
Lab 9: Networks
Lab 10: Support Vector Machines
Step 6) Do most of Kaggle's Getting Started and Playground

Competitions
I would NOT recommend doing any of the prize-money competitions. They usually have
datasets that are too large, complicated, or annoying, and are not good for learning
(Kaggle.com)
Start by learning scikit-learn, playing around, reading through tutorials and forums at Data
Science London + Scikit-learn for a simple, synthetic, binary classification task.
Next, play around some more and check out the tutorials for Titanic: Machine Learning
from Disaster with a slightly more complicated binary classification task (with
categorical variables, missing values, etc.)
Afterwards, try some multi-class classification with Forest Cover Type Prediction.
Now, try a regression task Bike Sharing Demand that involves incorporating timestamps.
Try out some natural language processing with Sentiment Analysis on Movie Reviews.
Finally, try out any of the other knowledge-based competitions that interest you!
Step 7) Learn More

A/B Testing is just a rebranded version of what pharmaceutical companies have been doing
for decades.
Learn more about A/B testing here: The Ultimate Guide to A/B Testing - Smashing
Magazine
Step 8) Do Side Projects
What are some good "toy problems" in data science?
How can I start building a recommendation engine?
What are some ideas for a quick weekend Python project?
What is a good measure of the influence of a Twitter user?
Where can I find large datasets open to the public?
What are some good algorithms for a prioritized inbox?
Step 9) Code in Public

Create public github respositories, make a blog, and post your work, side projects, Kaggle
solutions, insights, and thoughts! This helps you gain visibility, build a portfolio for your
resume, and connect with other people working on the same tasks
Step 10) Attend a local meetup

Check out Meetup to find some that interest you! Attend an interest talk, and meet people
around your area that share your interests and aspirations. Perhaps even get some leads for
a job!
Step 11) Think like a Data Scientist

In addition to the concrete steps I listed above to develop the skillset of a data scientist, I
include seven challenges below so you can learn to think like a data scientist and
develop the right attitude to become one.
(1) Satiate your curiosity through data
As a data scientist you write your own questions and answers.

Data scientists are naturally curious about the data that they're looking at, and are creative
with ways to approach and solve whatever problem needs to be solved.
Much of data science is not the analysis itself, but discovering an interesting
question and figuring out how to answer it.
Here are two great examples:
Hilary: the most poisoned baby name in US history
A Look at Fire Response Data

Challenge: Think of a problem or topic you're interested in and answer it with data!
(2) Read news with a skeptical eye

Much of the contribution of a data scientist (and why it's really hard to replace a data
scientist with a machine), is that a data scientist will tell you what's important and what's
spurious. This persistent skepticism is healthy in all sciences, and is especially necessarily in
a fast-paced environment where it's too easy to let a spurious result be misinterpreted.
You can adopt this mindset yourself by reading news with a critical eye.Many news
articles have inherently flawed main premises. Try these two articles. Sample
answers are available in the comments.
Easier: You Love Your iPhone. Literally.
Harder: Who predicted Russias military intervention?
Challenge: Do this every day when you encounter a news article. Comment on the article
and point out the flaws.
(3) See data as a tool to improve consumer products

Visit a consumer internet product (probably that you know doesn't do extensive A/B testing
already), and then think about their main funnel. Do they have a checkout funnel? Do they
have a signup funnel? Do they have a virility mechanism? Do they have an engagement
funnel?
Go through the funnel multiple times and hypothesize about different ways it could do
better to increase a core metric (conversion rate, shares, signups, etc.). Design an
experiment to verify if your suggested change can actually change the core metric.
Challenge: Share it with the feedback email for the consumer internet site!
(4) Think like a Bayesian

To think like a Bayesian, avoid the Base rate fallacy. This means to form new beliefs you
must incorporate both newly observed information AND prior information formed through
intuition and experience.
Checking your dashboard, user engagement numbers are significantly down
today. Which of the following is most likely?
1. Users are suddenly less engaged
2. Feature of site broke
3. Logging feature broke
Even though explanation #1 completely explains the drop, #2 and #3 should be more likely
because they have a much higher prior probability.
You're in senior management at Tesla, and five of Tesla's Model S's have caught
fire in the last five months. Which is more likely?
1. Manufacturing quality has decreased and Teslas should now be deemed unsafe.
2. Safety has not changed and fires in Tesla Model S's are still much rarer than their
counterparts in gasoline cars.
While #1 is an easy explanation (and great for media coverage), your prior should be strong
on #2 because of your regular quality testing. However, you should still be seeking
information that can update your beliefs on #1 versus #2 (and still find ways to improve
safety). Question for thought: what information should you seek?
Challenge: Identify the last time you committed the Base rate fallacy. Avoid committing
the fallacy from now on.
(5) Know the limitations of your tools

Knowledge is knowing that a tomato is a fruit, wisdom is not putting it in a fruit salad. Miles Kington
Knowledge is knowing how to perform a ordinary linear regression, wisdom is realizing how
rare it applies cleanly in practice.
Knowledge is knowing five different variations of K-means clustering, wisdom is realizing

how rarely actual data can be cleanly clustered, and how poorly K-means clustering can
work with too many features.
Knowledge is knowing a vast range of sophisticated techniques, but wisdom is being able to
choose the one that will provide the most amount of impact for the company in a reasonable
amount of time.
You may develop a vast range of tools while you go through your Coursera or EdX courses,
but your toolbox is not useful until you know which tools to use.
Challenge: Apply several tools to a real dataset and discover the tradeoffs and limitations
of each tools. Which tools worked best, and can you figure out why?
(6) Teach a complicated concept

How does Richard Feynman distinguish which concepts he understands and which concepts
he doesn't?
Feynman was a truly great teacher. He prided himself on being able to devise ways to
explain even the most profound ideas to beginning students. Once, I said to him, "Dick,
explain to me, so that I can understand it, why spin one-half particles obey Fermi-Dirac
statistics." Sizing up his audience perfectly, Feynman said, "I'll prepare a freshman lecture
on it." But he came back a few days later to say, "I couldn't do it. I couldn't reduce it to the
freshman level. That means we don't really understand it." - David L. Goodstein, Feynman's
Lost Lecture: The Motion of Planets Around the Sun
What distinguished Richard Feynman was his ability to distill complex concepts into
comprehendible ideas. Similarly, what distinguishes top data scientists is their ability to
cogently share their ideas and explain their analyses.
Check out Edwin Chen's answers to these questions for examples of cogently-explained
technical concepts:
Is there any summary of top models for the Netflix prize?
What is a good explanation of Latent Dirichlet Allocation?
What is Least Angle Regression and when should it be used?

Challenge: Teach a technical concept to a friend or on a public forum, like Quora or
YouTube.
(7) Convince others about what's important

Perhaps even more important than a data scientist's ability to explain their analysis is their
ability to communicate the value and potential impact of the actionable insights.
Certain tasks of data science will be commoditized as data science tools become
better and better. New tools will make obsolete certain tasks such as writing dashboards,
unnecessary data wrangling, and even specific kinds of predictive modeling.
However, the need for a data scientist to extract out and communicate what's
important will never be made obsolete. With increasing amounts of data and potential
insights, companies will always need data scientists (or people in data science-like roles), to
triage all that can be done and prioritize tasks based on impact.
The data scientist's role in the company is the serve as the ambassador between the
data and the company. The success of a data scientist is measured by how well he/she
can tell a story and make an impact. Every other skill is amplified by this ability.
Challenge: Tell a story with statistics. Communicate the important findings in a dataset.
Make a convincing presentation that your audience cares about.
If you liked this material, please consider following:

1) My personal blog, Storytelling with Statistics
2) Learn Data Science, where I am curating material on Quora that is relevant for anyone
seeking to become a data scientist!

Introduction To Data Science

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Introduction To Data Science

Uploaded by

Copyright:

Available Formats

INTRODUCTION TO DATA SCIENCE

Step 1) Brush up on your math

Step 2) Setup your tools

Install Python, iPython, and related libraries (guide)

Install R and RStudio

Install Sublime Text

Step 3) Learn to use your tools

Learn R with swirl

Learn Python with codecademy

What's the best way to learn to use Sublime Text?

Step 4) Learn Probability and Statistics

Python Application: Think Stats: Probability and Statistics for

Step 5) Complete Harvard's Data Science Course

What is it like to design a data science class?

What is it like to take CS 109/Statistics 121 (Data Science) at Harvard?

(2013) Lecture Videos

(2014) Lecture Videos

Intro to Python, Numpy, Matplotlib (Homework 0) (solutions)

Poll aggregation, web scraping, plotting, model evaluation, and

Data prediction, manipulation, and evaluation (Homework 2) (solutions)

Predictive modeling, model calibration, sentiment analysis(Homework 3)

Recommendation engines, Using mapreduce (Homework 4) (solutions)

Network visualization and analysis (Homework 5) (solutions)

Data manipulation, modeling, plotting (Homework 1)(solutions)

Lab 2: Web Scraping

Lab 3: EDA, Pandas, Matplotlib

Lab 4: Scikit-Learn, Regression, PCA

Lab 5: Bias, Variance, Cross-Validation

Lab 6: Bayes, Linear Regression, and Metropolis Sampling

Lab 7: Gibbs Sampling

Lab 10: Support Vector Machines

Step 6) Do most of Kaggle's Getting Started and Playground

Step 7) Learn More

Step 8) Do Side Projects

What are some good "toy problems" in data science?

How can I start building a recommendation engine?

What are some ideas for a quick weekend Python project?

What is a good measure of the influence of a Twitter user?

Where can I find large datasets open to the public?

What are some good algorithms for a prioritized inbox?

Step 9) Code in Public

Step 10) Attend a local meetup

Step 11) Think like a Data Scientist

(1) Satiate your curiosity through data

As a data scientist you write your own questions and answers.

Hilary: the most poisoned baby name in US history

A Look at Fire Response Data

(2) Read news with a skeptical eye

(3) See data as a tool to improve consumer products

(4) Think like a Bayesian

(5) Know the limitations of your tools

Knowledge is knowing five different variations of K-means clustering, wisdom is realizing

(6) Teach a complicated concept

Is there any summary of top models for the Netflix prize?

What is a good explanation of Latent Dirichlet Allocation?

What is Least Angle Regression and when should it be used?

(7) Convince others about what's important

If you liked this material, please consider following:

You might also like