Professional Documents
Culture Documents
RPubsDataProcessingwithdplyr&tidyr
DataProcessingwithdplyr&tidyr
BradBoehmke
February13th,2015
Introduction
AnalyticProcess
DataManipulation
WhyUsetidyr&dplyr
%>%Operator
tidyrOperations
gather()function:
separate()function:
unite()function:
spread()function:
dplyrOperations
select()function:
filter()function:
group_by()function:
summarise()function:
arrange()function:
join()functions:
mutate()function:
AdditionalResources
Thistutorialcanbeaccessedathttp://rpubs.com/bradleyboehmke/data_wrangling
Introduction
AnalyticProcess
Analyststendtofollow4fundamentalprocessestoturndataintounderstanding,knowledge&insight:
1.Datamanipulation
2.Datavisualization
3.Statisticalanalysis/modeling
4.Deploymentofresults
http://rpubs.com/bradleyboehmke/data_wrangling
1/25
10/27/2016
RPubsDataProcessingwithdplyr&tidyr
Thistutorialwillfocusondatamanipulation
DataManipulation
Itisoftensaidthat80%ofdataanalysisisspentontheprocessofcleaningand
preparingthedata.(DasuandJohnson,2003)
Wellstructureddataservestwopurposes:
1.Makesdatasuitableforsoftwareprocessingwhetherthatbemathematicalfunctions,visualization,etc.
2.Revealsinformationandinsights
HadleyWickhamspaperonTidyDataprovidesagreatexplanationbehindtheconceptoftidydata
WhyUsetidyr&dplyr
AlthoughmanyfundamentaldataprocessingfunctionsexistinR,theyhavebeenabitconvolutedto
dateandhavelackedconsistentcodingandtheabilitytoeasilyflowtogetherleadstodifficultto
readnestedfunctionsand/orchoppycode.
RStudioisdrivingalotofnewpackagestocollatedatamanagementtasksandbetterintegratethem
withotheranalysisactivitiesledbyHadleyWickham&theRStudioteamGarrett
Grolemund,WinstonChang,YihuiXieamongothers.
Asaresult,alotofdataprocessingtasksarebecomingpackagedinmorecohesiveandconsistent
waysleadsto:
Moreefficientcode
Easiertoremembersyntax
Easiertoreadsyntax
PackagesUtilized
library(dplyr)
http://rpubs.com/bradleyboehmke/data_wrangling
2/25
10/27/2016
RPubsDataProcessingwithdplyr&tidyr
library(tidyr)
tidyranddplyrpackagesprovidefundamentalfunctionsforCleaning,Processing,&Manipulating
Data
tidyr
gather()
spread()
separate()
unite()
dplyr
select()
filter()
group_by()
summarise()
arrange()
join()
mutate()
Gototop
%>%Operator
Althoughnotrequired,thetidyranddplyrpackagesmakeuseofthepipeoperator %>% developedbyStefan
MiltonBacheintheRpackagemagrittr.Althoughallthefunctionsintidyranddplyrcanbeusedwithoutthe
pipeoperator,oneofthegreatconveniencesthesepackagesprovideistheabilitytostringmultiplefunctions
togetherbyincorporating %>% .
Thisoperatorwillforwardavalue,ortheresultofanexpression,intothenextfunctioncall/expression.For
instanceafunctiontofilterdatacanbewrittenas:
filter(data,variable==numeric_value)
or
data%>%filter(variable==numeric_value)
http://rpubs.com/bradleyboehmke/data_wrangling
3/25
10/27/2016
RPubsDataProcessingwithdplyr&tidyr
NestedOption:
arrange(
summarize(
filter(data,variable==numeric_value),
Total=sum(variable)
),
desc(Total)
)
or
MultipleObjectOption:
a<filter(data,variable==numeric_value)
b<summarise(a,Total=sum(variable))
c<arrange(b,desc(Total))
or
%>%Option:
data%>%
filter(variable==value)%>%
summarise(Total=sum(variable))%>%
arrange(desc(Total))
Gototop
http://rpubs.com/bradleyboehmke/data_wrangling
4/25
10/27/2016
RPubsDataProcessingwithdplyr&tidyr
tidyrOperations
Therearefourfundamentalfunctionsofdatatidying:
gather() takesmultiplecolumns,andgathersthemintokeyvaluepairs:itmakeswidedatalonger
spread() takestwocolumns(key&value)andspreadsintomultiplecolumns,itmakeslongdata
wider
separate() splitsasinglecolumnintomultiplecolumns
unite() combinesmultiplecolumnsintoasinglecolumn
gather()function:
Objective:Reshapingwideformattolongformat
Description:Therearetimeswhenourdataisconsideredunstackedandacommonattributeofconcernis
spreadoutacrosscolumns.Toreformatthedatasuchthatthesecommonattributesaregatheredtogetheras
asinglevariable,the gather() functionwilltakemultiplecolumnsandcollapsethemintokeyvaluepairs,
duplicatingallothercolumnsasneeded.
Complementto: spread()
Function:gather(data,key,value,...,na.rm=FALSE,convert=FALSE)
Sameas:data%>%gather(key,value,...,na.rm=FALSE,convert=FALSE)
http://rpubs.com/bradleyboehmke/data_wrangling
5/25
10/27/2016
RPubsDataProcessingwithdplyr&tidyr
Arguments:
data:dataframe
key:columnnamerepresentingnewvariable
value:columnnamerepresentingvariablevalues
...:namesofcolumnstogather(ornotgather)
na.rm:optiontoremoveobservationswithmissingvalues(representedbyN
As)
convert:ifTRUEwillautomaticallyconvertvaluestological,integer,nume
ric,complexor
factorasappropriate
Example
Wellstartwiththefollowingdataset:
##Source:localdataframe[12x6]
##
##GroupYearQtr.1Qtr.2Qtr.3Qtr.4
##11200615161917
##21200712132723
##31200822222420
##41200910142016
##52200612132518
##62200716142119
##72200813112915
##82200923202620
##93200611122216
##103200713112721
##113200817122319
##12320091493124
Thisdataisconsideredwidesincethetimevariable(representedasquarters)isstructuredsuchthateach
quarterrepresentsavariable.Torestructurethetimecomponentasanindividualvariable,we
cangathereachquarterwithinonecolumnvariableandalsogatherthevaluesassociatedwitheachquarter
inasecondcolumnvariable.
long_DF<DF%>%gather(Quarter,Revenue,Qtr.1:Qtr.4)
head(long_DF,24)#note,forbrevity,Ionlyshowthedataforthefirsttwoyears
##Source:localdataframe[24x4]
##
##GroupYearQuarterRevenue
##112006Qtr.115
##212007Qtr.112
##312008Qtr.122
http://rpubs.com/bradleyboehmke/data_wrangling
6/25
10/27/2016
RPubsDataProcessingwithdplyr&tidyr
##412009Qtr.110
##522006Qtr.112
##622007Qtr.116
##722008Qtr.113
##822009Qtr.123
##932006Qtr.111
##1032007Qtr.113
##..............
Theseallproducethesameresults:
DF%>%gather(Quarter,Revenue,Qtr.1:Qtr.4)
DF%>%gather(Quarter,Revenue,Group,Year)
DF%>%gather(Quarter,Revenue,3:6)
DF%>%gather(Quarter,Revenue,Qtr.1,Qtr.2,Qtr.3,Qtr.4)
Alsonotethatifyoudonotsupplyargumentsforna.rmorconvertvaluesthenthedefaults
areused
Gototop
separate()function:
Objective:Splittingasinglevariableintotwo
Description:Manytimesasinglecolumnvariablewillcapturemultiplevariables,orevenpartsofavariable
youjustdontcareabout.Someexamplesinclude:
##Grp_IndYr_MoCity_StateFirst_LastExtra_variable
##11.a2006_JanDayton(OH)GeorgeWashingtonXX01person_1
##21.b2006_FebGrandForks(ND)JohnAdamsXX02person_2
##31.c2006_MarFargo(ND)ThomasJeffersonXX03person_3
##42.a2007_JanRochester(MN)JamesMadisonXX04person_4
##52.b2007_FebDubuque(IA)JamesMonroeXX05person_5
##62.c2007_MarFt.Collins(CO)JohnAdamsXX06person_6
##73.a2008_JanLakeCity(MN)AndrewJacksonXX07person_7
##83.b2008_FebRushford(MN)MartinVanBurenXX08person_8
##93.c2008_MarUnknownWilliamHarrisonXX09person_9
Ineachofthesecases,ourobjectivemaybetoseparatecharacterswithinthevariablestring.Thiscanbe
accomplishedusingthe separate() functionwhichturnsasinglecharactercolumnintomultiplecolumns.
Complementto: unite()
http://rpubs.com/bradleyboehmke/data_wrangling
7/25
10/27/2016
RPubsDataProcessingwithdplyr&tidyr
Function:separate(data,col,into,sep="",remove=TRUE,convert=FALSE)
Sameas:data%>%separate(col,into,sep="",remove=TRUE,convert=FALSE)
Arguments:
data:dataframe
col:columnnamerepresentingcurrentvariable
into:namesofvariablesrepresentingnewvariables
sep:howtoseparatecurrentvariable(char,num,orsymbol)
remove:ifTRUE,removeinputcolumnfromoutputdataframe
convert:ifTRUEwillautomaticallyconvertvaluestological,integer,nume
ric,complexor
factorasappropriate
Example
Wecangobacktoourlong_DFdataframewecreatedaboveinwhichwaymaydesiretocleanupor
separatetheQuartervariable.
##Source:localdataframe[6x4]
##
##GroupYearQuarterRevenue
##112006Qtr.115
##212007Qtr.112
##312008Qtr.122
##412009Qtr.110
##522006Qtr.112
##622007Qtr.116
##Source:localdataframe[10x5]
##
##GroupYearTime_IntervalInterval_IDRevenue
##112006Qtr115
##212007Qtr112
##312008Qtr122
##412009Qtr110
##522006Qtr112
##622007Qtr116
##722008Qtr113
##822009Qtr123
##932006Qtr111
##1032007Qtr113
http://rpubs.com/bradleyboehmke/data_wrangling
8/25
10/27/2016
RPubsDataProcessingwithdplyr&tidyr
Theseproducethesameresults:
long_DF%>%separate(Quarter,c("Time_Interval","Interval_ID"))
long_DF%>%separate(Quarter,c("Time_Interval","Interval_ID"),sep="\\.")
Gototop
unite()function:
Objective:Mergingtwovariablesintoone
Description:Theremaybeatimeinwhichwewouldliketocombinethevaluesoftwovariables.
The unite() functionisaconveniencefunctiontopastetogethermultiplevariablevaluesintoone.In
essence,itcombinestwovariablesofasingleobservationintoonevariable.
Complementto: separate()
Function:unite(data,col,...,sep="",remove=TRUE)
Sameas:data%>%unite(col,...,sep="",remove=TRUE)
Arguments:
data:dataframe
col:columnnameofnew"merged"column
...:namesofcolumnstomerge
sep:separatortousebetweenmergedvalues
remove:ifTRUE,removeinputcolumnfromoutputdataframe
Example
Usingtheseparate_DFdataframewecreatedabove,wecanreunite
theTime_IntervalandInterval_IDvariableswecreatedandrecreatetheoriginalQuartervariablewehadin
thelong_DFdataframe.
unite_DF<separate_DF%>%unite(Quarter,Time_Interval,Interval_ID,sep=".")
head(unite_DF,10)
##Source:localdataframe[10x4]
##
##GroupYearQuarterRevenue
##112006Qtr.115
##212007Qtr.112
##312008Qtr.122
##412009Qtr.110
http://rpubs.com/bradleyboehmke/data_wrangling
9/25
10/27/2016
RPubsDataProcessingwithdplyr&tidyr
##522006Qtr.112
##622007Qtr.116
##722008Qtr.113
##822009Qtr.123
##932006Qtr.111
##1032007Qtr.113
Theseproducethesameresults:
separate_DF%>%unite(Quarter,Time_Interval,Interval_ID,sep="_")
separate_DF%>%unite(Quarter,Time_Interval,Interval_ID)
Ifnospearatorisidentified,"_"willautomaticallybeused
Gototop
spread()function:
Objective:Reshapinglongformattowideformat
Description:Therearetimeswhenwearerequiredtoturnlongformatteddataintowideformatteddata.
The spread() functionspreadsakeyvaluepairacrossmultiplecolumns.
Complementto: gather()
Function:spread(data,key,value,fill=NA,convert=FALSE)
Sameas:data%>%spread(key,value,fill=NA,convert=FALSE)
Arguments:
data:dataframe
key:columnvaluestoconverttomultiplecolumns
value:singlecolumnvaluestoconverttomultiplecolumns'values
fill:Ifthereisn'tavalueforeverycombinationoftheothervariables
andthekey
column,thisvaluewillbesubstituted
convert:ifTRUEwillautomaticallyconvertvaluestological,integer,nume
ric,complexor
factorasappropriate
Example
wide_DF<unite_DF%>%spread(Quarter,Revenue)
head(wide_DF,24)
##Source:localdataframe[12x6]
http://rpubs.com/bradleyboehmke/data_wrangling
10/25
10/27/2016
RPubsDataProcessingwithdplyr&tidyr
##
##GroupYearQtr.1Qtr.2Qtr.3Qtr.4
##11200615161917
##21200712132723
##31200822222420
##41200910142016
##52200612132518
##62200716142119
##72200813112915
##82200923202620
##93200611122216
##103200713112721
##113200817122319
##12320091493124
Gototop
dplyrOperations
Therearesevenfundamentalfunctionsofdatatransformation:
select() selectingvariables
filter() providesbasicfilteringcapabilities
group_by() groupsdatabycategoricallevels
summarise() summarisedatabyfunctionsofchoice
arrange() orderingdata
join() joiningseparatedataframes
mutate() createnewvariables
FortheseexampleswellusethefollowingcensusdatawhichincludestheK12publicschoolexpenditures
bystate.Thisdataframecurrentlyis50x16andincludesexpendituredatafor14uniqueyears.
Lefthalfofdata:
##DivisionStateX1980X1990X2000X2001X2002X2003
##16Alabama114671322752334176082435479444443904657643
##29Alaska3779478280511183499122903612848541326226
##38Arizona94975322586604288739484610553958145892227
##47Arkansas66694914045452380331250517928228772923401
##59California91721582148578238129479429087874626554447983402
##68Colorado124304924518334401010475817351510035551506
http://rpubs.com/bradleyboehmke/data_wrangling
11/25
10/27/2016
RPubsDataProcessingwithdplyr&tidyr
Righthalfofdata:
##X2004X2005X2006X2007X2008X2009X2010X2011
##148124795164406569907662450316832439668384366705176592925
##213548461442269152964516343161918375200731920840192201270
##360717856579957713034178157208403221872675584825528340211
##431096443546999380801139977014156368424083944599104578136
##54921586650918654534361035735259961570555600809295824866257526835
##656661915994440636828965790537338766718726774293027409462
Gototop
select()function:
Objective:Reducedataframesizetoonlydesiredvariablesforcurrenttask
Description:Whenworkingwithasizabledataframe,oftenwedesiretoonlyassessspecificvariables.
The select() functionallowsyoutoselectand/orrenamevariables.
Function:select(data,...)
Sameas:data%>%select(...)
Arguments:
data:dataframe
...:callvariablesbynameorbyfunction
Specialfunctions:
starts_with(x,ignore.case=TRUE):namesstartswithx
ends_with(x,ignore.case=TRUE):namesendsinx
contains(x,ignore.case=TRUE):selectsallvariableswhosenamecontainsx
matches(x,ignore.case=TRUE):selectsallvariableswhosenamematchesthere
gularexpressionx
ExampleLetssayourgoalistoonlyassessthe5mostrecentyearsworthofexpendituredata.Applying
the select() functionwecanselectonlythevariablesofconcern.
sub.exp<expenditures%>%select(Division,State,X2007:X2011)
head(sub.exp)#forbrevityonlydisplayfirst6rows
##DivisionStateX2007X2008X2009X2010X2011
##16Alabama62450316832439668384366705176592925
##29Alaska16343161918375200731920840192201270
##38Arizona78157208403221872675584825528340211
http://rpubs.com/bradleyboehmke/data_wrangling
12/25
10/27/2016
RPubsDataProcessingwithdplyr&tidyr
##47Arkansas39977014156368424083944599104578136
##59California5735259961570555600809295824866257526835
##68Colorado65790537338766718726774293027409462
##X1980X1990X2000X2001X2002X2003X2004X2005
##111467132275233417608243547944444390465764348124795164406
##2377947828051118349912290361284854132622613548461442269
##39497532258660428873948461055395814589222760717856579957
##46669491404545238033125051792822877292340131096443546999
##5917215821485782381294794290878746265544479834024921586650918654
##612430492451833440101047581735151003555150656661915994440
##X2006X2007X2008X2009X2010X2011
##1569907662450316832439668384366705176592925
##2152964516343161918375200731920840192201270
##3713034178157208403221872675584825528340211
##4380801139977014156368424083944599104578136
##5534361035735259961570555600809295824866257526835
##6636828965790537338766718726774293027409462
Youcanalsodeselectvariablesbyusing""priortonameorfunction.Thefollowingpro
ducestheinverseoffunctionsabove
expenditures%>%select(X1980:X2006)
expenditures%>%select(starts_with("X"))
Gototop
filter()function:
Objective:Reducerows/observationswithmatchingconditions
Description:Filteringdataisacommontasktoidentify/selectobservationsinwhichaparticularvariable
matchesaspecificvalue/condition.The filter() functionprovidesthiscapability.
Function:filter(data,...)
Sameas:data%>%filter(...)
Arguments:
http://rpubs.com/bradleyboehmke/data_wrangling
13/25
10/27/2016
RPubsDataProcessingwithdplyr&tidyr
data:dataframe
...:conditionstobemet
Examples
Continuingwithoursub.expdataframewhichincludesonlytherecent5yearsworthofexpenditures,wecan
filterbyDivision:
sub.exp%>%filter(Division==3)
##DivisionStateX2007X2008X2009X2010X2011
##13Illinois2032659121874484234952712469577324554467
##23Indiana94970779281709968089599212439687949
##33Michigan1701325917053521172175841722751516786444
##43Ohio1825136118892374193873181980167019988921
##53Wisconsin902966093661349696228996624410333016
Forinstance,wecanfilterforDivision3andexpendituresin2011thatweregreaterthan$10B.Thisresults
inIndiana,whichisinDivision3,beingexcludedsinceitsexpenditureswere<$10B(FYItherawcensus
dataarereportedinunitsof$1,000).
sub.exp%>%filter(Division==3,X2011>10000000)#Rawcensusdataareinunitsof$1,0
00
##DivisionStateX2007X2008X2009X2010X2011
##13Illinois2032659121874484234952712469577324554467
##23Michigan1701325917053521172175841722751516786444
##33Ohio1825136118892374193873181980167019988921
##43Wisconsin902966093661349696228996624410333016
Gototop
http://rpubs.com/bradleyboehmke/data_wrangling
14/25
10/27/2016
RPubsDataProcessingwithdplyr&tidyr
group_by()function:
Objective:Groupdatabycategoricalvariables
Description:Often,observationsarenestedwithingroupsorcategoriesandourgoalsistoperform
statisticalanalysisbothattheobservationlevelandalsoatthegrouplevel.The group_by() functionallows
ustocreatethesecategoricalgroupings.
Function:group_by(data,...)
Sameas:data%>%group_by(...)
Arguments:
data:dataframe
...:variablestogroup_by
*Useungroup(x)toremovegroups
head(group.exp)
##Source:localdataframe[6x7]
##Groups:Division
##
##DivisionStateX2007X2008X2009X2010X2011
##16Alabama62450316832439668384366705176592925
##29Alaska16343161918375200731920840192201270
##38Arizona78157208403221872675584825528340211
##47Arkansas39977014156368424083944599104578136
##59California5735259961570555600809295824866257526835
##68Colorado65790537338766718726774293027409462
Gototop
summarise()function:
http://rpubs.com/bradleyboehmke/data_wrangling
15/25
10/27/2016
RPubsDataProcessingwithdplyr&tidyr
summarise()function:
Objective:Performsummarystatisticsonvariables
Description:Obviouslythegoalofallthisdatawranglingistobeabletoperformstatisticalanalysisonour
data.The summarise() functionallowsustoperformthemajorityoftheinitialsummarystatisticswhen
performingexploratorydataanalysis.
Function:summarise(data,...)
Sameas:data%>%summarise(...)
Arguments:
data:dataframe
...:Namevaluepairsofsummaryfunctionslikemin(),mean(),max()et
c.
*DeveloperisfromNewZealand...canuse"summarise(x)"or"summarize(x)"
Examples
Letsgetthemeanexpenditurevalueacrossallstatesin2011
sub.exp%>%summarise(Mean_2011=mean(X2011))
##Mean_2011
##110513678
Nottoobad,letsgetsomemoresummarystats
sub.exp%>%summarise(Min=min(X2011,na.rm=TRUE),
Median=median(X2011,na.rm=TRUE),
Mean=mean(X2011,na.rm=TRUE),
Var=var(X2011,na.rm=TRUE),
SD=sd(X2011,na.rm=TRUE),
Max=max(X2011,na.rm=TRUE),
N=n())
##MinMedianMeanVarSDMaxN
##110497726527404105136781.48619e+14121909385752683550
Thisinformationisuseful,butbeingabletocomparesummarystatisticsatmultiplelevelsiswhenyoureally
starttogathersomeinsights.Thisiswherethe group_by() functioncomesin.First,letsgroup
http://rpubs.com/bradleyboehmke/data_wrangling
16/25
10/27/2016
RPubsDataProcessingwithdplyr&tidyr
byDivisionandseehowthedifferentregionscomparedinby2010and2011.
sub.exp%>%
group_by(Division)%>%
summarise(Mean_2010=mean(X2010,na.rm=TRUE),
Mean_2011=mean(X2011,na.rm=TRUE))
##Source:localdataframe[9x3]
##
##DivisionMean_2010Mean_2011
##1151210035222277
##223241545732877923
##331632248916270159
##4446723324672687
##551097519411023526
##6661619676267490
##771491684315000139
##8838940033882159
##991554068115468173
Nowwerestartingtoseesomedifferencespopout.HowaboutwecomparestateswithinaDivision?We
canstarttoapplymultiplefunctionswevelearnedsofartogetthe5yearaverageforeachstatewithin
Division3.
sub.exp%>%
gather(Year,Expenditure,X2007:X2011)%>%#thisturnsourwidedatatoalongf
ormat
filter(Division==3)%>%#weonlywanttocomparestateswithi
nDivision3
group_by(State)%>%#wewanttosummarizedataatthesta
televel
summarise(Mean=mean(Expenditure),
SD=sd(Expenditure))
##Source:localdataframe[5x3]
##
##StateMeanSD
##1Illinois229893171867527.7
##2Indiana9613775238971.6
##3Michigan17059665180245.0
##4Ohio19264329705930.2
##5Wisconsin9678256507461.2
Gototop
http://rpubs.com/bradleyboehmke/data_wrangling
17/25
10/27/2016
RPubsDataProcessingwithdplyr&tidyr
arrange()function:
Objective:Ordervariablevalues
Description:Often,wedesiretoviewobservationsinrankorderforaparticularvariable(s).
The arrange() functionallowsustoorderdatabyvariablesinaccendingordescendingorder.
Function:arrange(data,...)
Sameas:data%>%arrange(...)
Arguments:
data:dataframe
...:Variable(s)toorder
*usedesc(x)tosortvariableindescendingorder
Examples
Forinstance,inthesummariseexamplewecomparedthethemeanexpendituresforeachdivision.Wecould
applythe arrange() functionattheendtoorderthedivisionsfromlowesttohighestexpenditurefor2011.
ThismakesiteasiertoseethesignificantdifferencesbetweenDivisions8,4,1&6ascomparedtoDivisions
5,7,9,3&2.
sub.exp%>%
group_by(Division)%>%
summarise(Mean_2010=mean(X2010,na.rm=TRUE),
Mean_2011=mean(X2011,na.rm=TRUE))%>%
arrange(Mean_2011)
##Source:localdataframe[9x3]
##
##DivisionMean_2010Mean_2011
##1838940033882159
##2446723324672687
##3151210035222277
##4661619676267490
##551097519411023526
##671491684315000139
##791554068115468173
##831632248916270159
##923241545732877923
http://rpubs.com/bradleyboehmke/data_wrangling
18/25
10/27/2016
RPubsDataProcessingwithdplyr&tidyr
Wecanalsoapplyandescendingargumenttorankorderfromhighesttolowest.Thefollowingshowsthe
samedatabutindescendingorderbyapplying desc() withinthe arrange() function.
sub.exp%>%
group_by(Division)%>%
summarise(Mean_2010=mean(X2010,na.rm=TRUE),
Mean_2011=mean(X2011,na.rm=TRUE))%>%
arrange(desc(Mean_2011))
##Source:localdataframe[9x3]
##
##DivisionMean_2010Mean_2011
##123241545732877923
##231632248916270159
##391554068115468173
##471491684315000139
##551097519411023526
##6661619676267490
##7151210035222277
##8446723324672687
##9838940033882159
Gototop
join()functions:
Objective:Jointwodatasetstogether
Description:Oftenwehaveseparatedataframesthatcanhavecommonanddifferingvariablesforsimilar
observationsandwewishtojointhesedataframestogether.Themultiple xxx_join() functionsprovide
multiplewaystojoindataframes.
Description:Jointwodatasets
Function:
inner_join(x,y,by=NULL)
left_join(x,y,by=NULL)
semi_join(x,y,by=NULL)
anti_join(x,y,by=NULL)
Arguments:
x,y:dataframestojoin
by:acharactervectorofvariablestojoinby.IfNULL,thedefault,jo
inwilldoanaturaljoin,usingall
http://rpubs.com/bradleyboehmke/data_wrangling
19/25
10/27/2016
RPubsDataProcessingwithdplyr&tidyr
variableswithcommonnamesacrossthetwotables.
Example
Ourpubliceducationexpendituredatarepresentsthenyeardollars.Tomakeanyaccurateassessmentsof
longitudinaltrendsandcomparisonweneedtoadjustforinflation.Ihavethefollowingdataframewhich
providesinflationadjustmentfactorsforbaseyear2012dollars(obviouslyIshoulduse2014valuesbutIhad
theseeasilyaccessableanditonlyservesforillustrativepurposes).
##YearAnnualInflation
##282007207.3420.9030811
##292008215.3030.9377553
##302009214.5370.9344190
##312010218.0560.9497461
##322011224.9390.9797251
##332012229.5941.0000000
TojointomyexpendituredataIobviouslyneedtogetmyexpendituredataintheproperformthatallowsmy
tojointhesetwodataframes.Icanapplythefollowingfunctionstoaccomplishthis:
long.exp<sub.exp%>%
gather(Year,Expenditure,X2007:X2011)%>%#turntolongformat
separate(Year,into=c("x","Year"),sep="X")%>%#separate"X"fromyearvalue
select(x)#remove"x"column
long.exp$Year<as.numeric(long.exp$Year)#convertfromcharactertonumeric
head(long.exp)
##DivisionStateYearExpenditure
##16Alabama20076245031
##29Alaska20071634316
##38Arizona20077815720
##47Arkansas20073997701
##59California200757352599
##68Colorado20076579053
20/25
10/27/2016
RPubsDataProcessingwithdplyr&tidyr
head(join.exp)
##YearDivisionStateExpenditureAnnualInflation
##120076Alabama6245031207.3420.9030811
##220079Alaska1634316207.3420.9030811
##320078Arizona7815720207.3420.9030811
##420077Arkansas3997701207.3420.9030811
##520079California57352599207.3420.9030811
##620078Colorado6579053207.3420.9030811
Toillustratetheotherjoiningmethodswecanusethesetwosimpledateframes:
Dataframex:
##nameinstrument
##1Johnguitar
##2Paulbass
##3Georgeguitar
##4Ringodrums
##5Stuartbass
##6Petedrums
Dataframey:
##nameband
##1JohnTRUE
##2PaulTRUE
##3GeorgeTRUE
##4RingoTRUE
##5BrianFALSE
inner_join() :Includeonlyrowsinbothxandythathaveamatchingvalue
inner_join(x,y)
##nameinstrumentband
##1JohnguitarTRUE
##2PaulbassTRUE
##3GeorgeguitarTRUE
##4RingodrumsTRUE
http://rpubs.com/bradleyboehmke/data_wrangling
21/25
10/27/2016
RPubsDataProcessingwithdplyr&tidyr
left_join() :Includeallofx,andmatchingrowsofy
left_join(x,y)
##nameinstrumentband
##1JohnguitarTRUE
##2PaulbassTRUE
##3GeorgeguitarTRUE
##4RingodrumsTRUE
##5Stuartbass<NA>
##6Petedrums<NA>
semi_join() :Includerowsofxthatmatchybutonlykeepthecolumnsfromx
semi_join(x,y)
##nameinstrument
##1Johnguitar
##2Paulbass
##3Georgeguitar
##4Ringodrums
anti_join() :Oppositeofsemi_join
anti_join(x,y)
##nameinstrument
##1Petedrums
##2Stuartbass
Gototop
mutate()function:
Objective:Createsnewvariables
Description:Oftenwewanttocreateanewvariablethatisafunctionofthecurrentvariablesinour
dataframeorevenjustaddanewvariable.The mutate() functionallowsustoaddnewvariableswhile
preservingtheexistingvariables.
http://rpubs.com/bradleyboehmke/data_wrangling
22/25
10/27/2016
RPubsDataProcessingwithdplyr&tidyr
Function:
mutate(data,...)
Sameas:data%>%mutate(...)
Arguments:
data:dataframe
...:Expression(s)
Examples
Ifwegobacktoourpreviousjoin.expdataframe,rememberthatwejoinedinflationratestoournoninflation
adjustedexpendituresforpublicschools.Thedataframelookslike:
##YearDivisionStateExpenditureAnnualInflation
##120076Alabama6245031207.3420.9030811
##220079Alaska1634316207.3420.9030811
##320078Arizona7815720207.3420.9030811
##420077Arkansas3997701207.3420.9030811
##520079California57352599207.3420.9030811
##620078Colorado6579053207.3420.9030811
head(inflation_adj)
##YearDivisionStateExpenditureAnnualInflationAdj_Exp
##120076Alabama6245031207.3420.90308116915249
##220079Alaska1634316207.3420.90308111809711
##320078Arizona7815720207.3420.90308118654505
##420077Arkansas3997701207.3420.90308114426735
##520079California57352599207.3420.903081163507696
##620078Colorado6579053207.3420.90308117285119
Letssaywewantedtocreateavariablethatrankordersstatelevelexpenditures(inflationadjusted)forthe
year2010fromthehighestlevelofexpenditurestothelowest.
rank_exp<inflation_adj%>%
filter(Year==2010)%>%
arrange(desc(Adj_Exp))%>%
http://rpubs.com/bradleyboehmke/data_wrangling
23/25
10/27/2016
RPubsDataProcessingwithdplyr&tidyr
mutate(Rank=1:length(Adj_Exp))
head(rank_exp)
##YearDivisionStateExpenditureAnnualInflationAdj_ExpRank
##120109California58248662218.0560.9497461613307741
##220102NewYork50251461218.0560.9497461529104172
##320107Texas42621886218.0560.9497461448771383
##420103Illinois24695773218.0560.9497461260025014
##520102NewJersey24261392218.0560.9497461255451355
##620105Florida23349314218.0560.9497461245847976
##YearDivisionStateExpenditureAnnualInflationAdj_ExpPerc_Chg
##120073Ohio18251361207.3420.903081120210102NA
##220083Ohio18892374215.3030.9377553201463780.003153057
##320093Ohio19387318214.5370.9344190207479920.029862103
##420103Ohio19801670218.0560.9497461208494360.004889357
##520113Ohio19988921224.9390.9797251204025820.021432441
YoucouldalsolookatwhatpercentofallUSexpenditureseachstatemadeupin2011.Inthiscasewe
use mutate() totakeeachstatesinflationadjustedexpenditureanddividebythesumoftheentireinflation
adjustedexpenditurecolumn.Wealsoapplyasecondfunctionwithin mutate() thatprovidesthe
cummalativepercentinrankorder.Thisshowsthatin2011,thetop8stateswiththehighestexpenditures
representedover50%ofthetotalU.S.expendituresinK12publicschools.(Iremovethenoninflation
adjustedExpenditure,Annual&Inflationcolumnssothatthecolumnsdontwraponthescreenview)
perc.of.whole<inflation_adj%>%
filter(Year==2011)%>%
arrange(desc(Adj_Exp))%>%
mutate(Perc_of_Total=Adj_Exp/sum(Adj_Exp),
Cum_Perc=cumsum(Perc_of_Total))%>%
select(Expenditure,Annual,Inflation)
head(perc.of.whole,8)
http://rpubs.com/bradleyboehmke/data_wrangling
24/25
10/27/2016
RPubsDataProcessingwithdplyr&tidyr
##YearDivisionStateAdj_ExpPerc_of_TotalCum_Perc
##120119California587173240.109432370.1094324
##220112NewYork525752440.097985280.2074177
##320117Texas437513460.081540050.2889577
##420113Illinois250626090.046709570.3356673
##520115Florida243640700.045407690.3810750
##620112NewJersey241284840.044968620.4260436
##720112Pennsylvania239712180.044675520.4707191
##820113Ohio204025820.038024600.5087437
Gototop
AdditionalResources
Thistutorialsimplytouchesonthebasicsthatthesetwopackagescando.Thereareseveralotherresources
youcancheckouttolearnmore.Inaddition,muchofwhatIhavelearnedand,therefore,muchofthe
contentinthistutorialissimplyamodifiedregurgitationofthewonderfulresourcesprovidedbyR
Studio,HadleyWickham,andGarrettGrolemund.
RStudiosDatawranglingwithRandRStudiowebinar
RStudiosDatawranglingGitHubrepository
RStudiosDatawranglingcheatsheet
HadleyWickhamsdplyrtutorialatuseR!2014,Part1
HadleyWickhamsdplyrtutorialatuseR!2014,Part2
HadleyWickhamspaperonTidyData
Gototop
SpecialthankstoTomFilloonandJasonFreelsforprovidingconstructivecommentswhiledevelopingthistutorial.
http://rpubs.com/bradleyboehmke/data_wrangling
25/25