Professional Documents
Culture Documents
DataScience
Programming
Wedon'thavestrongenoughwordstodescribethisclass.
- US News and Course Report
an advocate of
concrete computing
and HMC's mascot
Whenthecoursewasover,Iknewitwasagoodthing.
- New York Times Review of Courses
Wegivethiscoursetwothumbs!
- Ebert and Roeper
WelcometoIST380!
DataScience
Programming
an advocate of
concrete computing
and HMC's mascot
Aboutmyself
Who
ZachDodds
Where
HarveyMuddCollege
What
Researchincludesroboticsandcomputervision
When
Mondays710pmhereinACB119
Contact
Information
dodds@cs.hmc.edu
9096070867
OfficeHours:
Fridaymornings,911am
orsetupatime...
HMC Beckman B111
TMI?
IST380~thebigpicture
Whatisit?
Whyme?
IST380~thebigpicture
Whatisit?
Data Science
Venn Diagram
Hmmm where am I
on this diagram?
Data?!
Neighbor'sname
Aplacetheyconsiderhome
Aretheyworkingatacompanynow?
Where?
HowmanyU.S.stateshavetheyvisited?
Theirfavoriteunhealthyfood?
Dotheyhaveany"DataScience"background?
(statistics,machinelearning,CS)
statereminders
Neighbor'sname
Data!
Zachary Dodds
Aplacetheyconsiderhome
Pittsburgh, PA
Aretheyworkingatacompanynow?
Where?
HowmanyU.S.stateshavetheyvisited?
Theirfavoriteunhealthyfood?
Harvey Mudd
44
M&Ms
Dotheyhaveany"DataScience"background?
(statistics,machinelearning,CS)
mostly CS for me
Neighbor'sname
Zachary Dodds
Aplacetheyconsiderhome
Data!
Pittsburgh, PA
y
l
u
r
t
s
i
s
s
a
This cl
Aretheyworkingatacompanynow?
Where? Harvey Mudd
m
'
I
:
e
l
y
seminar-st
n
i
,
e
r
a
u
HowmanyU.S.stateshavetheyvisited?
o
44
here, as y
s
t
h
g
i
s
n
i
n
i
a
g
o
t
r
e
d
or
Theirfavoriteunhealthyfood?
w
e
M&Ms
n
y
r
e
v
s
into thi
field .
Dotheyhaveany"DataScience"background?
(statistics,machinelearning,CS)
mostly CS for me
DataScienceconcerns
Is "Data Science"
important or just trendy?
DataScienceconcerns
Hmmm
thecompaniesareexpandingasfastasthedata!
1.8 ZB
8.0 ZB
800 EB
1
Exabyte
logarithmic scale
1
Zettabyt
e
5 EB
120 PB
1
Petabyte
1 Petabyte == 1000
60 PB
14 PB
2002
2006
2009
2011
2015
1 TB =TB
1000 GB
References
(2015) 8 ZB: http://www.emc.com/collateral/analyst-reports/idc-extracting-valuefrom-chaos-ar.pdf
(2011) 1.8 ZB: http://www.emc.com/leadership/programs/digital-universe.htm
(2009) 800 EB: http://www.emc.com/collateral/analyst-reports/idc-digital-universeare-you-ready.pdf
(2006) 161 EB: http://www.emc.com/collateral/analyst-reports/expanding-digital-idcwhite-paper.pdf
I'd call it
data, not
information
wisdom
knowledge
information
data
BigData?
IST380~thebigpicture
What?
Data Science
Programming
Why?
Data Rules
Allofourinsightslargeandsmall,permanentand
ephemeral,naturalandartificialcomeabout
throughtheintegrationoflotsofdata.
DataSciencesimplyrecognizesthattherulesand
skillsbehindthoseinsightsarewidelyapplicable
Afewexamples
Make3d
AndrewNg~
Computersand
Thoughtaward,
2009
Afewexamples
Learningto
Powerslide
Stanford's
Autonomous
Vehiclesproject
(Thrunetal.)
Afewexamples
Learningground
fromobstacles
"mysummerwas
findingthatredline"
Afewexamples
classification
segmentation
Learninggroundfromobstacles
Insightsbeyondscience
Marketing
Visualization
Motivation
Recommender Systems
predicting
movie
ratings
Netflix Prize
Napoleon 1.2
Finding Nemo ??
Dynamite = 2
= ??
.75 to predict Lord of the
BatmanSome
Begins
=difficult
films are
Netflix Prize
Napoleon 1.2
Finding Nemo .
Dynamite = 2
= 67
Batman
Begins
= .75
Lordare
ofeasier!
the .
Some films
are difficult
to predict and others
WhyIST380?
Specificskills:
R statistical environment (and the S programming
language)
Experience with several statistical analyses
(descriptive statistics)
Experience with predictive statistics (modeling)
and machine learning algorithms
WhyIST380?
Specificskills:
R statistical environment (and the S programming
language)
Experience with several statistical analyses
(descriptive statistics)
Experience with predictive statistics (modeling)
and machine learning algorithms
Broadbackground:
Final project ~ open-ended with datasets of your
choice
You'llbeconfidentandcapablewithwhateverdatasetsyou
encounterinthefutureonyourownoraspartofateam.
AboutIST380
Details
WebPage:
http://www.cs.hmc.edu/~dodds/IST380
Assignments,onlinetext,necessaryfiles,lectureslidesarelinked
Firstweek'sassignment:GettingstartedwithR
Textbook
AnintroductiontoDataScience
jsresearch.net/groups/teachdatascience/
andmanyonlineresources
Programming:R
www.r-project.org/
Grabbothof
thesenow
Homepage
Gotothecoursepage
GrabRandthetextfrom
thesetwolinks
http://www.cs.hmc.edu/~dodds/IST380/
Homework
Assignments
~25problems/week~100pointsextracredit,often
DueTuesdayofthefollowingweekby11:59pm.
Assignment1dueTuesday,February5.
1 week + 1 day
Homework
Assignments
~25problems/week~100pointsextracredit,often
DueTuesdayofthefollowingweekby11:59pm.
Assignment1dueTuesday,February5.
Workingonprograms:
Onyourownoringroupsof2.
Dividetheworkatthekeyboardevenly!
Submittingprograms:atthesubmissionwebsite
Today'sLab:
installsoftwareensureaccountsareworking
tryoutRthefirstHWisofficiallydueon2/5
Outline
approximate!
Weeks 1-5
"Data Science"
Weeks 6-10
"Machine Learning"
Weeks 11-15
using R
descriptive statistics
predictive statistics
probability distributions
statistical modeling
support vector machines (SVMs)
nearest neighbors (NN)
random forests
No breaks?!
k-means algorithm
Final Project
Grading
Grades
Basedonpointspercentage
~800pointsforassignments
~400pointsforthefinalproject
seethecoursesyllabusforthefulllist...
Finalproject
thelast~4weekswillworktowardsalarger,finalproject
therewillbeashortdesignphaseandashortfinalpresentation
chooseyourownproblemtostudy(I'llhavesomesuggestions,too.)
I'dencourageyoutoconnectRandourDataSciencetechniques
tootherdatasetsorprojectsthatyouuse/need/like,etc.
AcademicHonesty
ThiscourseoperatesunderCGU's(andallofClaremontSchools')
AcademicHonestypolicies
Yourworkmustbeyourown.Thismustbetrueforthewhole
team,ifyou'reworkinginapair.
Consultingwithothers(exceptteammembersormyself)is
encouraged,buthastobelimitedtodiscussionanddebugging
ofproblems.Sharingofwritten,electronic,orverbal
solutions/files/codeisaviolationofCGUsacademichonesty
policy.
Areasonableguideline:Workisyourownifyoucoulddelete
allofitandrecreateityourself.
Thoughts?
GettingtoknowR
GettingtoknowR
http://langindex.sourceforge.net/#categ
GettingtoknowR
Freeandverywellsupportedonline
GettingtoknowR
Risresponsive,uptodate,andflexible:DataSciencevs.Statistics
GettingtoknowR
1)FindtheIST380coursewebpage
www.cs.hmc.edu/~dodds/IST380/
2)DownloadandinstallR
3)RunRandtrysomebasiccommandsattheprompt:
6 * 7
rnorm(10)
x <- 380
Try
it!
Gettingstarted!
1)OpenMatloff'sWhyR?notes
2)Skipaheadtopage7,the"5minuteexamplesession"
3)Tryoutthecommandsinsection2.2togetstarted
4)Whenyoufinish,saveyoursessionandsubmitit!
Thisisproblem1thisweek
Savingyoursession
1)Createafoldernamedhw1,perhapsonyourdesktop
2)UsetheSavetofile(Windows)orSaveas
(Mac)inordertosaveyourcurrentconsolesessioninto
hw1
3)Namethatfilepr1.txt
4)Fromyouroperatingsystem,openupthatfilein
ordertoconfirmitcontainsyourwholesession!
Thisisproblem1thisweek
Submittingyourwork
1)Zipuphw1intohw1.zip
2)Fromthecoursewebpage,clickonthesubmission
sitelink.
3)Chooseasubmissionsiteloginname&letmeknow!
4)Onceyouraccountismade,login,changeyourpassword
tosomethingyouknow,andsubmithw1.zip
5)Youcansubmitagainallcopiesaresaved
You'vecompletedProblem1!
Reflection
Assignme
nt?
Creating a vector?
Printing?
Average and standard
deviation?
Comment
s?
Comment
s?
Rtypes
Youcanusemode()toviewthetypeofavariable.
Where'sthebigdata?
c ~ concatenate
Where'sthebigdata?
c ~ concatenate
the colon :
also creates
vectors
Analyzingvectorstrythese
Analyzingvectors
NA
R uses NA to represent data that is "not
available"
The function is.na( ) tests for NA
What is going on here?
NA
R uses NA to represent data that is "not
available"
The function is.na( ) tests for NA
What is going on here?
Dataframes
Irises
virginica
setosa
Subsettingirisdata
df[rows,cols]
Lab
The2ndpartofeachclassmeetingdedicatedtolabwork.
I welcome you to stay for the lab, but it is
not required.
Today'slab:
WorkthroughSantoricoandShin'sTutorialfortheR
StatisticalPackageandsubmittheconsolesessionsas
pr2_1.txt,pr2_1.txt,pr2_1.txt,pr2_1.txt,andpr2_1.txt.
Thisisanicereinforcementofvectors,introductionto
dataframes,andalookatthegraphicsthatRsupports.
Homework
Problem3:ChallengeexercisesinR
Thesewillreinforcethe"subsetting"anddata
analysisintroductionfrompr2'stutorial.
Problem4:IntroductiontoDataScience,earlychapters
ThisisafullerbackgroundonRandthefield
ofdatascience
(submityourconsolesessionforbothofthese)
Lab!
CSvs.ISandIT?
greater integration
system-wide issues
smaller details
machine specifics
www.acm.org/education/curric_vols/CC2005_Final_Report2.pdf
CSvs.ISandIT?
Where will IS go?
CSvs.ISandIT?
IT?
Where will IT go?
IT?
Thebiggerpicture
Weeks 10-12
Weeks 13-15
Objects
Final Projects
Week 10
Week 13
classesvs.objects
finalprojects
Week 11
Week 14
methodsanddata
finalprojects
Week 12
Week 15
inheritance
finalexam
Data?!
Neighbor'sname
Aplacetheyconsiderhome
Aretheyworkingatacompanynow?
Where?
HowmanyU.S.stateshavetheyvisited?
Theirfavoriteunhealthyfood?
Dotheyhaveany"DataScience"
(statistics,machinelearning,CS)
background?
statereminders
Neighbor'sname
Data!
Zachary Dodds
Aplacetheyconsiderhome
Pittsburgh, PA
Aretheyworkingatacompanynow?
Where?
HowmanyU.S.stateshavetheyvisited?
Theirfavoriteunhealthyfood?
M&Ms
Dotheyhaveany"DataScience"
(statistics,machinelearning,CS)
background?
mostly CS for me
Harvey Mudd
44
Neighbor'sname
Data!
Zachary Dodds
Aplacetheyconsiderhome
Pittsburgh, PA
Aretheyworkingatacompanynow?
Where?
HowmanyU.S.stateshavetheyvisited?
Theirfavoriteunhealthyfood?
44
M&Ms
Dotheyhaveany"DataScience"
(statistics,machinelearning,CS)
background?
Harvey Mudd
mostly CS for me