You are on page 1of 73

WelcometoIST380!

DataScience
Programming
Wedon'thavestrongenoughwordstodescribethisclass.
- US News and Course Report
an advocate of
concrete computing
and HMC's mascot

Whenthecoursewasover,Iknewitwasagoodthing.
- New York Times Review of Courses

Wegivethiscoursetwothumbs!
- Ebert and Roeper

WelcometoIST380!

DataScience
Programming

an advocate of
concrete computing
and HMC's mascot

Aboutmyself
Who

ZachDodds

Where

HarveyMuddCollege

What

Researchincludesroboticsandcomputervision

When

Mondays710pmhereinACB119

Contact
Information

dodds@cs.hmc.edu
9096070867
OfficeHours:

Fridaymornings,911am

orsetupatime...
HMC Beckman B111

TMI?

fan of low-tech games


fan of low-level AI

IST380~thebigpicture
Whatisit?

Whyme?

IST380~thebigpicture
Whatisit?
Data Science
Venn Diagram

Hmmm where am I
on this diagram?

Data?!

Neighbor'sname
Aplacetheyconsiderhome
Aretheyworkingatacompanynow?

Where?

HowmanyU.S.stateshavetheyvisited?
Theirfavoriteunhealthyfood?
Dotheyhaveany"DataScience"background?
(statistics,machinelearning,CS)

statereminders

Neighbor'sname

Data!

Zachary Dodds

Aplacetheyconsiderhome

Pittsburgh, PA

Aretheyworkingatacompanynow?

Where?

HowmanyU.S.stateshavetheyvisited?
Theirfavoriteunhealthyfood?

Harvey Mudd
44

M&Ms

Dotheyhaveany"DataScience"background?
(statistics,machinelearning,CS)
mostly CS for me

Neighbor'sname

Zachary Dodds

Aplacetheyconsiderhome

Data!

Pittsburgh, PA

y
l
u
r
t
s
i
s
s
a
This cl
Aretheyworkingatacompanynow?
Where? Harvey Mudd
m
'
I
:
e
l
y
seminar-st
n
i
,
e
r
a
u
HowmanyU.S.stateshavetheyvisited?
o
44
here, as y
s
t
h
g
i
s
n
i
n
i
a
g
o
t
r
e
d
or
Theirfavoriteunhealthyfood?
w
e
M&Ms
n
y
r
e
v
s
into thi
field .
Dotheyhaveany"DataScience"background?
(statistics,machinelearning,CS)
mostly CS for me

be sure to set up your login + profile for the submission site

DataScienceconcerns

Is "Data Science"
important or just trendy?

DataScienceconcerns

Hmmm

thecompaniesareexpandingasfastasthedata!

Data, data everywhere

1.8 ZB

8.0 ZB

800 EB

Data produced each year


161 EB

1
Exabyte

logarithmic scale

1
Zettabyt
e

There's certainly a lot of it!

5 EB

120 PB

100-years of HD video + audio

1
Petabyte
1 Petabyte == 1000

60 PB

Human brain's capacity

14 PB

2002

2006

2009

2011

2015

1 TB =TB
1000 GB
References
(2015) 8 ZB: http://www.emc.com/collateral/analyst-reports/idc-extracting-valuefrom-chaos-ar.pdf
(2011) 1.8 ZB: http://www.emc.com/leadership/programs/digital-universe.htm
(2009) 800 EB: http://www.emc.com/collateral/analyst-reports/idc-digital-universeare-you-ready.pdf
(2006) 161 EB: http://www.emc.com/collateral/analyst-reports/expanding-digital-idcwhite-paper.pdf

(2002) 5 EB: http://www2.sims.berkeley.edu/research/projects/how-much-info2003/execsum.htm


(life in video) 60 PB: in 4320p resolution, extrapolated from 16MB for 1:21
of 640x480 video (w/sound) almost certainly a gross overestimate, as
sleep can be compressed significantly!
(brain) 14 PB: http://www.quora.com/Neuroscience-1/How-much-data-can-thehuman-brain-store

I'd call it
data, not
information

wisdom
knowledge
information
data

BigData?

I agree with this

Make data easier to use ~ by using it!

It may be true that


Data Science isn't a
science but that
doesn't mean it's
not useful!

IST380~thebigpicture
What?
Data Science
Programming

Why?

Data Rules

Allofourinsightslargeandsmall,permanentand
ephemeral,naturalandartificialcomeabout
throughtheintegrationoflotsofdata.
DataSciencesimplyrecognizesthattherulesand
skillsbehindthoseinsightsarewidelyapplicable

Afewexamples
Make3d

AndrewNg~
Computersand
Thoughtaward,
2009

How is this being done?


and how do we succeed?

Data Science is at the heart of computer science

Afewexamples
Learningto
Powerslide

Stanford's
Autonomous
Vehiclesproject
(Thrunetal.)

Data Science is at the heart of computer science

Afewexamples
Learningground
fromobstacles

"mysummerwas
findingthatredline"

Data Science is at the heart of computer science

Afewexamples

classification

segmentation

Learninggroundfromobstacles

Insightsbeyondscience

Marketing

Visualization

Motivation

Recommender Systems

predicting
movie
ratings

Netflix Prize

(I don't know this guy)Bob Bell, winner of the "Netflix prize"

Napoleon 1.2
Finding Nemo ??
Dynamite = 2
= ??
.75 to predict Lord of the
BatmanSome
Begins
=difficult
films are

Netflix Prize

(I don't know this guy)Bob Bell, winner of the "Netflix prize"

Napoleon 1.2
Finding Nemo .
Dynamite = 2
= 67
Batman
Begins
= .75
Lordare
ofeasier!
the .
Some films
are difficult
to predict and others

WhyIST380?
Specificskills:
R statistical environment (and the S programming
language)
Experience with several statistical analyses
(descriptive statistics)
Experience with predictive statistics (modeling)
and machine learning algorithms

WhyIST380?
Specificskills:
R statistical environment (and the S programming
language)
Experience with several statistical analyses
(descriptive statistics)
Experience with predictive statistics (modeling)
and machine learning algorithms

Broadbackground:
Final project ~ open-ended with datasets of your
choice
You'llbeconfidentandcapablewithwhateverdatasetsyou
encounterinthefutureonyourownoraspartofateam.

AboutIST380

Details
WebPage:
http://www.cs.hmc.edu/~dodds/IST380

Assignments,onlinetext,necessaryfiles,lectureslidesarelinked
Firstweek'sassignment:GettingstartedwithR

Textbook

AnintroductiontoDataScience

freely available online

jsresearch.net/groups/teachdatascience/

andmanyonlineresources

Programming:R
www.r-project.org/

Grabbothof
thesenow

Homepage
Gotothecoursepage

GrabRandthetextfrom
thesetwolinks

http://www.cs.hmc.edu/~dodds/IST380/

Homework
Assignments
~25problems/week~100pointsextracredit,often
DueTuesdayofthefollowingweekby11:59pm.
Assignment1dueTuesday,February5.

1 week + 1 day

Homework
Assignments
~25problems/week~100pointsextracredit,often
DueTuesdayofthefollowingweekby11:59pm.
Assignment1dueTuesday,February5.

Workingonprograms:

Onyourownoringroupsof2.
Dividetheworkatthekeyboardevenly!

Submittingprograms:atthesubmissionwebsite
Today'sLab:

installsoftwareensureaccountsareworking
tryoutRthefirstHWisofficiallydueon2/5

Outline
approximate!

Weeks 1-5
"Data Science"

Weeks 6-10
"Machine Learning"

Weeks 11-15

using R
descriptive statistics
predictive statistics
probability distributions
statistical modeling
support vector machines (SVMs)
nearest neighbors (NN)
random forests
No breaks?!
k-means algorithm

Final Project

Grading
Grades
Basedonpointspercentage
~800pointsforassignments
~400pointsforthefinalproject

if score >= 0.95: grade = "A"


if score >= 0.90: grade = "A-"
if score >= 0.86: grade = "B+"

seethecoursesyllabusforthefulllist...

Finalproject
thelast~4weekswillworktowardsalarger,finalproject
therewillbeashortdesignphaseandashortfinalpresentation
chooseyourownproblemtostudy(I'llhavesomesuggestions,too.)
I'dencourageyoutoconnectRandourDataSciencetechniques
tootherdatasetsorprojectsthatyouuse/need/like,etc.

AcademicHonesty
ThiscourseoperatesunderCGU's(andallofClaremontSchools')
AcademicHonestypolicies
Yourworkmustbeyourown.Thismustbetrueforthewhole
team,ifyou'reworkinginapair.
Consultingwithothers(exceptteammembersormyself)is
encouraged,buthastobelimitedtodiscussionanddebugging
ofproblems.Sharingofwritten,electronic,orverbal
solutions/files/codeisaviolationofCGUsacademichonesty
policy.
Areasonableguideline:Workisyourownifyoucoulddelete
allofitandrecreateityourself.

Thoughts?

GettingtoknowR

GettingtoknowR

http://langindex.sourceforge.net/#categ

R is the programmer's toolkit for statistics; SAS, Stata,


SPSS are preferred by those in business intelligence

GettingtoknowR

Freeandverywellsupportedonline

GettingtoknowR

Risresponsive,uptodate,andflexible:DataSciencevs.Statistics

GettingtoknowR
1)FindtheIST380coursewebpage
www.cs.hmc.edu/~dodds/IST380/

2)DownloadandinstallR
3)RunRandtrysomebasiccommandsattheprompt:
6 * 7
rnorm(10)
x <- 380

Try

it!

Gettingstarted!
1)OpenMatloff'sWhyR?notes
2)Skipaheadtopage7,the"5minuteexamplesession"
3)Tryoutthecommandsinsection2.2togetstarted
4)Whenyoufinish,saveyoursessionandsubmitit!

Thisisproblem1thisweek

Savingyoursession
1)Createafoldernamedhw1,perhapsonyourdesktop
2)UsetheSavetofile(Windows)orSaveas
(Mac)inordertosaveyourcurrentconsolesessioninto
hw1
3)Namethatfilepr1.txt
4)Fromyouroperatingsystem,openupthatfilein
ordertoconfirmitcontainsyourwholesession!

Thisisproblem1thisweek

Submittingyourwork
1)Zipuphw1intohw1.zip
2)Fromthecoursewebpage,clickonthesubmission
sitelink.
3)Chooseasubmissionsiteloginname&letmeknow!
4)Onceyouraccountismade,login,changeyourpassword
tosomethingyouknow,andsubmithw1.zip
5)Youcansubmitagainallcopiesaresaved
You'vecompletedProblem1!

troubles? email me!


This webserver can be
spacey -- I should know!

Reflection
Assignme
nt?
Creating a vector?
Printing?
Average and standard
deviation?
Comment
s?
Comment
s?

Rtypes

Youcanusemode()toviewthetypeofavariable.

Where'sthebigdata?
c ~ concatenate

Vectors are R lists of a single type of

Where'sthebigdata?
c ~ concatenate

the colon :
also creates
vectors

Vectors are R lists of a single type of

Analyzingvectorstrythese

Square brackets [] can "subset" (or

Analyzingvectors

you can use


a boolean
vector to
subset
another
Square brackets [] can "subset" (or vector

NA
R uses NA to represent data that is "not
available"
The function is.na( ) tests for NA
What is going on here?

NA
R uses NA to represent data that is "not
available"
The function is.na( ) tests for NA
What is going on here?

This uses subsetting to remove NA values!

Dataframes

R's fundamental data structures are data


frames
The next tutorial will introduce them

Irises

virginica

setosa

data() yields many built-in data files. This is iris

Subsettingirisdata
df[rows,cols]

As with vectors, you can "subset" data


frames.

Lab
The2ndpartofeachclassmeetingdedicatedtolabwork.
I welcome you to stay for the lab, but it is
not required.
Today'slab:
WorkthroughSantoricoandShin'sTutorialfortheR
StatisticalPackageandsubmittheconsolesessionsas
pr2_1.txt,pr2_1.txt,pr2_1.txt,pr2_1.txt,andpr2_1.txt.
Thisisanicereinforcementofvectors,introductionto
dataframes,andalookatthegraphicsthatRsupports.

Homework
Problem3:ChallengeexercisesinR
Thesewillreinforcethe"subsetting"anddata
analysisintroductionfrompr2'stutorial.

Problem4:IntroductiontoDataScience,earlychapters
ThisisafullerbackgroundonRandthefield
ofdatascience

(submityourconsolesessionforbothofthese)

Lab!

CSvs.ISandIT?
greater integration
system-wide issues

smaller details
machine specifics

www.acm.org/education/curric_vols/CC2005_Final_Report2.pdf

CSvs.ISandIT?
Where will IS go?

CSvs.ISandIT?

IT?
Where will IT go?

IT?

Thebiggerpicture
Weeks 10-12

Weeks 13-15

Objects

Final Projects

Week 10

Week 13

classesvs.objects

finalprojects

Week 11

Week 14

methodsanddata

finalprojects

Week 12

Week 15

inheritance

finalexam

Data?!

Neighbor'sname
Aplacetheyconsiderhome
Aretheyworkingatacompanynow?

Where?

HowmanyU.S.stateshavetheyvisited?
Theirfavoriteunhealthyfood?
Dotheyhaveany"DataScience"
(statistics,machinelearning,CS)
background?

statereminders

Neighbor'sname

Data!

Zachary Dodds

Aplacetheyconsiderhome

Pittsburgh, PA

Aretheyworkingatacompanynow?

Where?

HowmanyU.S.stateshavetheyvisited?
Theirfavoriteunhealthyfood?

M&Ms

Dotheyhaveany"DataScience"
(statistics,machinelearning,CS)
background?

mostly CS for me

Harvey Mudd
44

Neighbor'sname

Data!

Zachary Dodds

Aplacetheyconsiderhome

Pittsburgh, PA

Aretheyworkingatacompanynow?

Where?

HowmanyU.S.stateshavetheyvisited?
Theirfavoriteunhealthyfood?

44

M&Ms

Dotheyhaveany"DataScience"
(statistics,machinelearning,CS)
background?

Harvey Mudd

mostly CS for me

This class is truly


seminar-style:
we're devloping
expertise in this
field together.

be sure to set up your login + profile for the submission site

You might also like