You are on page 1of 25

10/27/2016

RPubsDataProcessingwithdplyr&amptidyr

DataProcessingwithdplyr&tidyr
BradBoehmke
February13th,2015
Introduction
AnalyticProcess
DataManipulation
WhyUsetidyr&dplyr
%>%Operator
tidyrOperations
gather()function:
separate()function:
unite()function:
spread()function:
dplyrOperations
select()function:
filter()function:
group_by()function:
summarise()function:
arrange()function:
join()functions:
mutate()function:
AdditionalResources

Thistutorialcanbeaccessedathttp://rpubs.com/bradleyboehmke/data_wrangling

Introduction
AnalyticProcess
Analyststendtofollow4fundamentalprocessestoturndataintounderstanding,knowledge&insight:
1.Datamanipulation
2.Datavisualization
3.Statisticalanalysis/modeling
4.Deploymentofresults
http://rpubs.com/bradleyboehmke/data_wrangling

1/25

10/27/2016

RPubsDataProcessingwithdplyr&amptidyr

Thistutorialwillfocusondatamanipulation

DataManipulation
Itisoftensaidthat80%ofdataanalysisisspentontheprocessofcleaningand
preparingthedata.(DasuandJohnson,2003)
Wellstructureddataservestwopurposes:
1.Makesdatasuitableforsoftwareprocessingwhetherthatbemathematicalfunctions,visualization,etc.
2.Revealsinformationandinsights
HadleyWickhamspaperonTidyDataprovidesagreatexplanationbehindtheconceptoftidydata

WhyUsetidyr&dplyr
AlthoughmanyfundamentaldataprocessingfunctionsexistinR,theyhavebeenabitconvolutedto
dateandhavelackedconsistentcodingandtheabilitytoeasilyflowtogetherleadstodifficultto
readnestedfunctionsand/orchoppycode.
RStudioisdrivingalotofnewpackagestocollatedatamanagementtasksandbetterintegratethem
withotheranalysisactivitiesledbyHadleyWickham&theRStudioteamGarrett
Grolemund,WinstonChang,YihuiXieamongothers.
Asaresult,alotofdataprocessingtasksarebecomingpackagedinmorecohesiveandconsistent
waysleadsto:
Moreefficientcode
Easiertoremembersyntax
Easiertoreadsyntax

PackagesUtilized
library(dplyr)
http://rpubs.com/bradleyboehmke/data_wrangling

2/25

10/27/2016

RPubsDataProcessingwithdplyr&amptidyr

library(tidyr)

tidyranddplyrpackagesprovidefundamentalfunctionsforCleaning,Processing,&Manipulating
Data
tidyr
gather()
spread()
separate()
unite()

dplyr
select()
filter()
group_by()
summarise()
arrange()
join()
mutate()

Gototop

%>%Operator
Althoughnotrequired,thetidyranddplyrpackagesmakeuseofthepipeoperator %>% developedbyStefan
MiltonBacheintheRpackagemagrittr.Althoughallthefunctionsintidyranddplyrcanbeusedwithoutthe
pipeoperator,oneofthegreatconveniencesthesepackagesprovideistheabilitytostringmultiplefunctions
togetherbyincorporating %>% .
Thisoperatorwillforwardavalue,ortheresultofanexpression,intothenextfunctioncall/expression.For
instanceafunctiontofilterdatacanbewrittenas:
filter(data,variable==numeric_value)
or
data%>%filter(variable==numeric_value)

Bothfunctionscompletethesametaskandthebenefitofusing %>% isnotevidenthowever,whenyou


desiretoperformmultiplefunctionsitsadvantagebecomesobvious.Forinstance,ifwewanttofiltersome
data,summarizeit,andthenorderthesummarizedresultswewouldwriteitoutas:

http://rpubs.com/bradleyboehmke/data_wrangling

3/25

10/27/2016

RPubsDataProcessingwithdplyr&amptidyr

NestedOption:
arrange(
summarize(
filter(data,variable==numeric_value),
Total=sum(variable)
),
desc(Total)
)

or
MultipleObjectOption:
a<filter(data,variable==numeric_value)
b<summarise(a,Total=sum(variable))
c<arrange(b,desc(Total))

or
%>%Option:
data%>%
filter(variable==value)%>%
summarise(Total=sum(variable))%>%
arrange(desc(Total))

Asyourfunctiontasksgetlongerthe %>% operatorbecomesmoreefficientandmakesyourcodemore


legible.Inaddition,althoughnotcoveredinthistutorial,the %>% operatorallowsyoutoflowfromdata
manipulationtasksstraightintovizualizationfunctions(viaggplotandggvis)andalsointomanyanalytic
functions.
Tolearnmoreaboutthe %>% operatorandthemagrittrpackagevisitanyofthefollowing:
http://cran.rproject.org/web/packages/magrittr/vignettes/magrittr.html
http://www.rbloggers.com/simplerrcodingwithpipesthepresentandfutureofthemagrittrpackage/
http://blog.revolutionanalytics.com/2014/07/magrittrsimplifyingrcodewithpipes.html

Gototop

http://rpubs.com/bradleyboehmke/data_wrangling

4/25

10/27/2016

RPubsDataProcessingwithdplyr&amptidyr

tidyrOperations
Therearefourfundamentalfunctionsofdatatidying:
gather() takesmultiplecolumns,andgathersthemintokeyvaluepairs:itmakeswidedatalonger
spread() takestwocolumns(key&value)andspreadsintomultiplecolumns,itmakeslongdata

wider
separate() splitsasinglecolumnintomultiplecolumns
unite() combinesmultiplecolumnsintoasinglecolumn

gather()function:
Objective:Reshapingwideformattolongformat
Description:Therearetimeswhenourdataisconsideredunstackedandacommonattributeofconcernis
spreadoutacrosscolumns.Toreformatthedatasuchthatthesecommonattributesaregatheredtogetheras
asinglevariable,the gather() functionwilltakemultiplecolumnsandcollapsethemintokeyvaluepairs,
duplicatingallothercolumnsasneeded.
Complementto: spread()

Function:gather(data,key,value,...,na.rm=FALSE,convert=FALSE)
Sameas:data%>%gather(key,value,...,na.rm=FALSE,convert=FALSE)

http://rpubs.com/bradleyboehmke/data_wrangling

5/25

10/27/2016

RPubsDataProcessingwithdplyr&amptidyr

Arguments:
data:dataframe
key:columnnamerepresentingnewvariable
value:columnnamerepresentingvariablevalues
...:namesofcolumnstogather(ornotgather)
na.rm:optiontoremoveobservationswithmissingvalues(representedbyN
As)
convert:ifTRUEwillautomaticallyconvertvaluestological,integer,nume
ric,complexor
factorasappropriate

Example
Wellstartwiththefollowingdataset:
##Source:localdataframe[12x6]
##
##GroupYearQtr.1Qtr.2Qtr.3Qtr.4
##11200615161917
##21200712132723
##31200822222420
##41200910142016
##52200612132518
##62200716142119
##72200813112915
##82200923202620
##93200611122216
##103200713112721
##113200817122319
##12320091493124

Thisdataisconsideredwidesincethetimevariable(representedasquarters)isstructuredsuchthateach
quarterrepresentsavariable.Torestructurethetimecomponentasanindividualvariable,we
cangathereachquarterwithinonecolumnvariableandalsogatherthevaluesassociatedwitheachquarter
inasecondcolumnvariable.
long_DF<DF%>%gather(Quarter,Revenue,Qtr.1:Qtr.4)
head(long_DF,24)#note,forbrevity,Ionlyshowthedataforthefirsttwoyears

##Source:localdataframe[24x4]
##
##GroupYearQuarterRevenue
##112006Qtr.115
##212007Qtr.112
##312008Qtr.122
http://rpubs.com/bradleyboehmke/data_wrangling

6/25

10/27/2016

RPubsDataProcessingwithdplyr&amptidyr

##412009Qtr.110
##522006Qtr.112
##622007Qtr.116
##722008Qtr.113
##822009Qtr.123
##932006Qtr.111
##1032007Qtr.113
##..............

Theseallproducethesameresults:
DF%>%gather(Quarter,Revenue,Qtr.1:Qtr.4)
DF%>%gather(Quarter,Revenue,Group,Year)
DF%>%gather(Quarter,Revenue,3:6)
DF%>%gather(Quarter,Revenue,Qtr.1,Qtr.2,Qtr.3,Qtr.4)

Alsonotethatifyoudonotsupplyargumentsforna.rmorconvertvaluesthenthedefaults
areused

Gototop

separate()function:
Objective:Splittingasinglevariableintotwo
Description:Manytimesasinglecolumnvariablewillcapturemultiplevariables,orevenpartsofavariable
youjustdontcareabout.Someexamplesinclude:
##Grp_IndYr_MoCity_StateFirst_LastExtra_variable
##11.a2006_JanDayton(OH)GeorgeWashingtonXX01person_1
##21.b2006_FebGrandForks(ND)JohnAdamsXX02person_2
##31.c2006_MarFargo(ND)ThomasJeffersonXX03person_3
##42.a2007_JanRochester(MN)JamesMadisonXX04person_4
##52.b2007_FebDubuque(IA)JamesMonroeXX05person_5
##62.c2007_MarFt.Collins(CO)JohnAdamsXX06person_6
##73.a2008_JanLakeCity(MN)AndrewJacksonXX07person_7
##83.b2008_FebRushford(MN)MartinVanBurenXX08person_8
##93.c2008_MarUnknownWilliamHarrisonXX09person_9

Ineachofthesecases,ourobjectivemaybetoseparatecharacterswithinthevariablestring.Thiscanbe
accomplishedusingthe separate() functionwhichturnsasinglecharactercolumnintomultiplecolumns.
Complementto: unite()

http://rpubs.com/bradleyboehmke/data_wrangling

7/25

10/27/2016

RPubsDataProcessingwithdplyr&amptidyr

Function:separate(data,col,into,sep="",remove=TRUE,convert=FALSE)
Sameas:data%>%separate(col,into,sep="",remove=TRUE,convert=FALSE)

Arguments:
data:dataframe
col:columnnamerepresentingcurrentvariable
into:namesofvariablesrepresentingnewvariables
sep:howtoseparatecurrentvariable(char,num,orsymbol)
remove:ifTRUE,removeinputcolumnfromoutputdataframe
convert:ifTRUEwillautomaticallyconvertvaluestological,integer,nume
ric,complexor
factorasappropriate

Example
Wecangobacktoourlong_DFdataframewecreatedaboveinwhichwaymaydesiretocleanupor
separatetheQuartervariable.
##Source:localdataframe[6x4]
##
##GroupYearQuarterRevenue
##112006Qtr.115
##212007Qtr.112
##312008Qtr.122
##412009Qtr.110
##522006Qtr.112
##622007Qtr.116

Byapplyingthe separate() functionwegetthefollowing:


separate_DF<long_DF%>%separate(Quarter,c("Time_Interval","Interval_ID"))
head(separate_DF,10)

##Source:localdataframe[10x5]
##
##GroupYearTime_IntervalInterval_IDRevenue
##112006Qtr115
##212007Qtr112
##312008Qtr122
##412009Qtr110
##522006Qtr112
##622007Qtr116
##722008Qtr113
##822009Qtr123
##932006Qtr111
##1032007Qtr113
http://rpubs.com/bradleyboehmke/data_wrangling

8/25

10/27/2016

RPubsDataProcessingwithdplyr&amptidyr

Theseproducethesameresults:
long_DF%>%separate(Quarter,c("Time_Interval","Interval_ID"))
long_DF%>%separate(Quarter,c("Time_Interval","Interval_ID"),sep="\\.")

Gototop

unite()function:
Objective:Mergingtwovariablesintoone
Description:Theremaybeatimeinwhichwewouldliketocombinethevaluesoftwovariables.
The unite() functionisaconveniencefunctiontopastetogethermultiplevariablevaluesintoone.In
essence,itcombinestwovariablesofasingleobservationintoonevariable.
Complementto: separate()
Function:unite(data,col,...,sep="",remove=TRUE)
Sameas:data%>%unite(col,...,sep="",remove=TRUE)

Arguments:
data:dataframe
col:columnnameofnew"merged"column
...:namesofcolumnstomerge
sep:separatortousebetweenmergedvalues
remove:ifTRUE,removeinputcolumnfromoutputdataframe

Example
Usingtheseparate_DFdataframewecreatedabove,wecanreunite
theTime_IntervalandInterval_IDvariableswecreatedandrecreatetheoriginalQuartervariablewehadin
thelong_DFdataframe.
unite_DF<separate_DF%>%unite(Quarter,Time_Interval,Interval_ID,sep=".")
head(unite_DF,10)

##Source:localdataframe[10x4]
##
##GroupYearQuarterRevenue
##112006Qtr.115
##212007Qtr.112
##312008Qtr.122
##412009Qtr.110
http://rpubs.com/bradleyboehmke/data_wrangling

9/25

10/27/2016

RPubsDataProcessingwithdplyr&amptidyr

##522006Qtr.112
##622007Qtr.116
##722008Qtr.113
##822009Qtr.123
##932006Qtr.111
##1032007Qtr.113

Theseproducethesameresults:
separate_DF%>%unite(Quarter,Time_Interval,Interval_ID,sep="_")
separate_DF%>%unite(Quarter,Time_Interval,Interval_ID)

Ifnospearatorisidentified,"_"willautomaticallybeused

Gototop

spread()function:
Objective:Reshapinglongformattowideformat
Description:Therearetimeswhenwearerequiredtoturnlongformatteddataintowideformatteddata.
The spread() functionspreadsakeyvaluepairacrossmultiplecolumns.
Complementto: gather()
Function:spread(data,key,value,fill=NA,convert=FALSE)
Sameas:data%>%spread(key,value,fill=NA,convert=FALSE)

Arguments:
data:dataframe
key:columnvaluestoconverttomultiplecolumns
value:singlecolumnvaluestoconverttomultiplecolumns'values
fill:Ifthereisn'tavalueforeverycombinationoftheothervariables
andthekey
column,thisvaluewillbesubstituted
convert:ifTRUEwillautomaticallyconvertvaluestological,integer,nume
ric,complexor
factorasappropriate

Example
wide_DF<unite_DF%>%spread(Quarter,Revenue)
head(wide_DF,24)

##Source:localdataframe[12x6]
http://rpubs.com/bradleyboehmke/data_wrangling

10/25

10/27/2016

RPubsDataProcessingwithdplyr&amptidyr

##
##GroupYearQtr.1Qtr.2Qtr.3Qtr.4
##11200615161917
##21200712132723
##31200822222420
##41200910142016
##52200612132518
##62200716142119
##72200813112915
##82200923202620
##93200611122216
##103200713112721
##113200817122319
##12320091493124

Gototop

dplyrOperations
Therearesevenfundamentalfunctionsofdatatransformation:
select() selectingvariables
filter() providesbasicfilteringcapabilities
group_by() groupsdatabycategoricallevels
summarise() summarisedatabyfunctionsofchoice
arrange() orderingdata
join() joiningseparatedataframes
mutate() createnewvariables

FortheseexampleswellusethefollowingcensusdatawhichincludestheK12publicschoolexpenditures
bystate.Thisdataframecurrentlyis50x16andincludesexpendituredatafor14uniqueyears.
Lefthalfofdata:
##DivisionStateX1980X1990X2000X2001X2002X2003
##16Alabama114671322752334176082435479444443904657643
##29Alaska3779478280511183499122903612848541326226
##38Arizona94975322586604288739484610553958145892227
##47Arkansas66694914045452380331250517928228772923401
##59California91721582148578238129479429087874626554447983402
##68Colorado124304924518334401010475817351510035551506
http://rpubs.com/bradleyboehmke/data_wrangling

11/25

10/27/2016

RPubsDataProcessingwithdplyr&amptidyr

Righthalfofdata:
##X2004X2005X2006X2007X2008X2009X2010X2011
##148124795164406569907662450316832439668384366705176592925
##213548461442269152964516343161918375200731920840192201270
##360717856579957713034178157208403221872675584825528340211
##431096443546999380801139977014156368424083944599104578136
##54921586650918654534361035735259961570555600809295824866257526835
##656661915994440636828965790537338766718726774293027409462

Gototop

select()function:
Objective:Reducedataframesizetoonlydesiredvariablesforcurrenttask
Description:Whenworkingwithasizabledataframe,oftenwedesiretoonlyassessspecificvariables.
The select() functionallowsyoutoselectand/orrenamevariables.
Function:select(data,...)
Sameas:data%>%select(...)

Arguments:
data:dataframe
...:callvariablesbynameorbyfunction

Specialfunctions:
starts_with(x,ignore.case=TRUE):namesstartswithx
ends_with(x,ignore.case=TRUE):namesendsinx
contains(x,ignore.case=TRUE):selectsallvariableswhosenamecontainsx
matches(x,ignore.case=TRUE):selectsallvariableswhosenamematchesthere
gularexpressionx

ExampleLetssayourgoalistoonlyassessthe5mostrecentyearsworthofexpendituredata.Applying
the select() functionwecanselectonlythevariablesofconcern.
sub.exp<expenditures%>%select(Division,State,X2007:X2011)
head(sub.exp)#forbrevityonlydisplayfirst6rows

##DivisionStateX2007X2008X2009X2010X2011
##16Alabama62450316832439668384366705176592925
##29Alaska16343161918375200731920840192201270
##38Arizona78157208403221872675584825528340211
http://rpubs.com/bradleyboehmke/data_wrangling

12/25

10/27/2016

RPubsDataProcessingwithdplyr&amptidyr

##47Arkansas39977014156368424083944599104578136
##59California5735259961570555600809295824866257526835
##68Colorado65790537338766718726774293027409462

Wecanalsoapplysomeofthespecialfunctionswithin select() .Forinstancewecanselectallvariables


thatstartwithX:
head(expenditures%>%select(starts_with("X")))

##X1980X1990X2000X2001X2002X2003X2004X2005
##111467132275233417608243547944444390465764348124795164406
##2377947828051118349912290361284854132622613548461442269
##39497532258660428873948461055395814589222760717856579957
##46669491404545238033125051792822877292340131096443546999
##5917215821485782381294794290878746265544479834024921586650918654
##612430492451833440101047581735151003555150656661915994440
##X2006X2007X2008X2009X2010X2011
##1569907662450316832439668384366705176592925
##2152964516343161918375200731920840192201270
##3713034178157208403221872675584825528340211
##4380801139977014156368424083944599104578136
##5534361035735259961570555600809295824866257526835
##6636828965790537338766718726774293027409462

Youcanalsodeselectvariablesbyusing""priortonameorfunction.Thefollowingpro
ducestheinverseoffunctionsabove
expenditures%>%select(X1980:X2006)
expenditures%>%select(starts_with("X"))

Gototop

filter()function:
Objective:Reducerows/observationswithmatchingconditions
Description:Filteringdataisacommontasktoidentify/selectobservationsinwhichaparticularvariable
matchesaspecificvalue/condition.The filter() functionprovidesthiscapability.
Function:filter(data,...)
Sameas:data%>%filter(...)

Arguments:
http://rpubs.com/bradleyboehmke/data_wrangling

13/25

10/27/2016

RPubsDataProcessingwithdplyr&amptidyr

data:dataframe
...:conditionstobemet

Examples
Continuingwithoursub.expdataframewhichincludesonlytherecent5yearsworthofexpenditures,wecan
filterbyDivision:
sub.exp%>%filter(Division==3)

##DivisionStateX2007X2008X2009X2010X2011
##13Illinois2032659121874484234952712469577324554467
##23Indiana94970779281709968089599212439687949
##33Michigan1701325917053521172175841722751516786444
##43Ohio1825136118892374193873181980167019988921
##53Wisconsin902966093661349696228996624410333016

Wecanapplymultiplelogicrulesinthe filter() functionsuchas:


<Lessthan!=Notequalto
>Greaterthan%in%Groupmembership
==Equaltois.naisNA
<=Lessthanorequalto!is.naisnotNA
>=Greaterthanorequalto&,|,!Booleanoperators

Forinstance,wecanfilterforDivision3andexpendituresin2011thatweregreaterthan$10B.Thisresults
inIndiana,whichisinDivision3,beingexcludedsinceitsexpenditureswere<$10B(FYItherawcensus
dataarereportedinunitsof$1,000).
sub.exp%>%filter(Division==3,X2011>10000000)#Rawcensusdataareinunitsof$1,0
00

##DivisionStateX2007X2008X2009X2010X2011
##13Illinois2032659121874484234952712469577324554467
##23Michigan1701325917053521172175841722751516786444
##33Ohio1825136118892374193873181980167019988921
##43Wisconsin902966093661349696228996624410333016

Gototop
http://rpubs.com/bradleyboehmke/data_wrangling

14/25

10/27/2016

RPubsDataProcessingwithdplyr&amptidyr

group_by()function:
Objective:Groupdatabycategoricalvariables
Description:Often,observationsarenestedwithingroupsorcategoriesandourgoalsistoperform
statisticalanalysisbothattheobservationlevelandalsoatthegrouplevel.The group_by() functionallows
ustocreatethesecategoricalgroupings.
Function:group_by(data,...)
Sameas:data%>%group_by(...)

Arguments:
data:dataframe
...:variablestogroup_by

*Useungroup(x)toremovegroups

ExampleThe group_by() functionisasilentfunctioninwhichnoobservablemanipulationofthedatais


performedasaresultofapplyingthefunction.Rather,theonlychangeyoullnoticeis,ifyouprintthe
dataframeyouwillnoticeunderneaththeSourceinformationandpriortotheactualdataframe,anindicatorof
whatvariablethedataisgroupedbywillbeprovided.Therealmagicofthe group_by() functioncomes
whenweperformsummarystatisticswhichwewillcovershortly.
group.exp<sub.exp%>%group_by(Division)

head(group.exp)

##Source:localdataframe[6x7]
##Groups:Division
##
##DivisionStateX2007X2008X2009X2010X2011
##16Alabama62450316832439668384366705176592925
##29Alaska16343161918375200731920840192201270
##38Arizona78157208403221872675584825528340211
##47Arkansas39977014156368424083944599104578136
##59California5735259961570555600809295824866257526835
##68Colorado65790537338766718726774293027409462

Gototop

summarise()function:

http://rpubs.com/bradleyboehmke/data_wrangling

15/25

10/27/2016

RPubsDataProcessingwithdplyr&amptidyr

summarise()function:
Objective:Performsummarystatisticsonvariables
Description:Obviouslythegoalofallthisdatawranglingistobeabletoperformstatisticalanalysisonour
data.The summarise() functionallowsustoperformthemajorityoftheinitialsummarystatisticswhen
performingexploratorydataanalysis.
Function:summarise(data,...)
Sameas:data%>%summarise(...)

Arguments:
data:dataframe
...:Namevaluepairsofsummaryfunctionslikemin(),mean(),max()et
c.

*DeveloperisfromNewZealand...canuse"summarise(x)"or"summarize(x)"

Examples
Letsgetthemeanexpenditurevalueacrossallstatesin2011
sub.exp%>%summarise(Mean_2011=mean(X2011))

##Mean_2011
##110513678

Nottoobad,letsgetsomemoresummarystats
sub.exp%>%summarise(Min=min(X2011,na.rm=TRUE),
Median=median(X2011,na.rm=TRUE),
Mean=mean(X2011,na.rm=TRUE),
Var=var(X2011,na.rm=TRUE),
SD=sd(X2011,na.rm=TRUE),
Max=max(X2011,na.rm=TRUE),
N=n())

##MinMedianMeanVarSDMaxN
##110497726527404105136781.48619e+14121909385752683550

Thisinformationisuseful,butbeingabletocomparesummarystatisticsatmultiplelevelsiswhenyoureally
starttogathersomeinsights.Thisiswherethe group_by() functioncomesin.First,letsgroup
http://rpubs.com/bradleyboehmke/data_wrangling

16/25

10/27/2016

RPubsDataProcessingwithdplyr&amptidyr

byDivisionandseehowthedifferentregionscomparedinby2010and2011.
sub.exp%>%
group_by(Division)%>%
summarise(Mean_2010=mean(X2010,na.rm=TRUE),
Mean_2011=mean(X2011,na.rm=TRUE))

##Source:localdataframe[9x3]
##
##DivisionMean_2010Mean_2011
##1151210035222277
##223241545732877923
##331632248916270159
##4446723324672687
##551097519411023526
##6661619676267490
##771491684315000139
##8838940033882159
##991554068115468173

Nowwerestartingtoseesomedifferencespopout.HowaboutwecomparestateswithinaDivision?We
canstarttoapplymultiplefunctionswevelearnedsofartogetthe5yearaverageforeachstatewithin
Division3.
sub.exp%>%
gather(Year,Expenditure,X2007:X2011)%>%#thisturnsourwidedatatoalongf
ormat
filter(Division==3)%>%#weonlywanttocomparestateswithi
nDivision3
group_by(State)%>%#wewanttosummarizedataatthesta
televel
summarise(Mean=mean(Expenditure),
SD=sd(Expenditure))

##Source:localdataframe[5x3]
##
##StateMeanSD
##1Illinois229893171867527.7
##2Indiana9613775238971.6
##3Michigan17059665180245.0
##4Ohio19264329705930.2
##5Wisconsin9678256507461.2

Gototop
http://rpubs.com/bradleyboehmke/data_wrangling

17/25

10/27/2016

RPubsDataProcessingwithdplyr&amptidyr

arrange()function:
Objective:Ordervariablevalues
Description:Often,wedesiretoviewobservationsinrankorderforaparticularvariable(s).
The arrange() functionallowsustoorderdatabyvariablesinaccendingordescendingorder.
Function:arrange(data,...)
Sameas:data%>%arrange(...)

Arguments:
data:dataframe
...:Variable(s)toorder

*usedesc(x)tosortvariableindescendingorder

Examples
Forinstance,inthesummariseexamplewecomparedthethemeanexpendituresforeachdivision.Wecould
applythe arrange() functionattheendtoorderthedivisionsfromlowesttohighestexpenditurefor2011.
ThismakesiteasiertoseethesignificantdifferencesbetweenDivisions8,4,1&6ascomparedtoDivisions
5,7,9,3&2.
sub.exp%>%
group_by(Division)%>%
summarise(Mean_2010=mean(X2010,na.rm=TRUE),
Mean_2011=mean(X2011,na.rm=TRUE))%>%
arrange(Mean_2011)

##Source:localdataframe[9x3]
##
##DivisionMean_2010Mean_2011
##1838940033882159
##2446723324672687
##3151210035222277
##4661619676267490
##551097519411023526
##671491684315000139
##791554068115468173
##831632248916270159
##923241545732877923

http://rpubs.com/bradleyboehmke/data_wrangling

18/25

10/27/2016

RPubsDataProcessingwithdplyr&amptidyr

Wecanalsoapplyandescendingargumenttorankorderfromhighesttolowest.Thefollowingshowsthe
samedatabutindescendingorderbyapplying desc() withinthe arrange() function.
sub.exp%>%
group_by(Division)%>%
summarise(Mean_2010=mean(X2010,na.rm=TRUE),
Mean_2011=mean(X2011,na.rm=TRUE))%>%
arrange(desc(Mean_2011))

##Source:localdataframe[9x3]
##
##DivisionMean_2010Mean_2011
##123241545732877923
##231632248916270159
##391554068115468173
##471491684315000139
##551097519411023526
##6661619676267490
##7151210035222277
##8446723324672687
##9838940033882159

Gototop

join()functions:
Objective:Jointwodatasetstogether
Description:Oftenwehaveseparatedataframesthatcanhavecommonanddifferingvariablesforsimilar
observationsandwewishtojointhesedataframestogether.Themultiple xxx_join() functionsprovide
multiplewaystojoindataframes.
Description:Jointwodatasets

Function:
inner_join(x,y,by=NULL)
left_join(x,y,by=NULL)
semi_join(x,y,by=NULL)
anti_join(x,y,by=NULL)

Arguments:
x,y:dataframestojoin
by:acharactervectorofvariablestojoinby.IfNULL,thedefault,jo
inwilldoanaturaljoin,usingall
http://rpubs.com/bradleyboehmke/data_wrangling

19/25

10/27/2016

RPubsDataProcessingwithdplyr&amptidyr

variableswithcommonnamesacrossthetwotables.

Example
Ourpubliceducationexpendituredatarepresentsthenyeardollars.Tomakeanyaccurateassessmentsof
longitudinaltrendsandcomparisonweneedtoadjustforinflation.Ihavethefollowingdataframewhich
providesinflationadjustmentfactorsforbaseyear2012dollars(obviouslyIshoulduse2014valuesbutIhad
theseeasilyaccessableanditonlyservesforillustrativepurposes).
##YearAnnualInflation
##282007207.3420.9030811
##292008215.3030.9377553
##302009214.5370.9344190
##312010218.0560.9497461
##322011224.9390.9797251
##332012229.5941.0000000

TojointomyexpendituredataIobviouslyneedtogetmyexpendituredataintheproperformthatallowsmy
tojointhesetwodataframes.Icanapplythefollowingfunctionstoaccomplishthis:
long.exp<sub.exp%>%
gather(Year,Expenditure,X2007:X2011)%>%#turntolongformat
separate(Year,into=c("x","Year"),sep="X")%>%#separate"X"fromyearvalue
select(x)#remove"x"column

long.exp$Year<as.numeric(long.exp$Year)#convertfromcharactertonumeric

head(long.exp)

##DivisionStateYearExpenditure
##16Alabama20076245031
##29Alaska20071634316
##38Arizona20077815720
##47Arkansas20073997701
##59California200757352599
##68Colorado20076579053

Icannowapplythe left_join() functiontojointheinflationdatatotheexpendituredata.Thisalignsthe


datainbothdataframesbytheYearvariableandthenjoinstheremaininginflationdatatotheexpenditure
dataframeasnewvariables.
join.exp<long.exp%>%left_join(inflation)
http://rpubs.com/bradleyboehmke/data_wrangling

20/25

10/27/2016

RPubsDataProcessingwithdplyr&amptidyr

head(join.exp)

##YearDivisionStateExpenditureAnnualInflation
##120076Alabama6245031207.3420.9030811
##220079Alaska1634316207.3420.9030811
##320078Arizona7815720207.3420.9030811
##420077Arkansas3997701207.3420.9030811
##520079California57352599207.3420.9030811
##620078Colorado6579053207.3420.9030811

Toillustratetheotherjoiningmethodswecanusethesetwosimpledateframes:
Dataframex:
##nameinstrument
##1Johnguitar
##2Paulbass
##3Georgeguitar
##4Ringodrums
##5Stuartbass
##6Petedrums

Dataframey:
##nameband
##1JohnTRUE
##2PaulTRUE
##3GeorgeTRUE
##4RingoTRUE
##5BrianFALSE

inner_join() :Includeonlyrowsinbothxandythathaveamatchingvalue

inner_join(x,y)

##nameinstrumentband
##1JohnguitarTRUE
##2PaulbassTRUE
##3GeorgeguitarTRUE
##4RingodrumsTRUE

http://rpubs.com/bradleyboehmke/data_wrangling

21/25

10/27/2016

RPubsDataProcessingwithdplyr&amptidyr

left_join() :Includeallofx,andmatchingrowsofy

left_join(x,y)

##nameinstrumentband
##1JohnguitarTRUE
##2PaulbassTRUE
##3GeorgeguitarTRUE
##4RingodrumsTRUE
##5Stuartbass<NA>
##6Petedrums<NA>

semi_join() :Includerowsofxthatmatchybutonlykeepthecolumnsfromx

semi_join(x,y)

##nameinstrument
##1Johnguitar
##2Paulbass
##3Georgeguitar
##4Ringodrums

anti_join() :Oppositeofsemi_join

anti_join(x,y)

##nameinstrument
##1Petedrums
##2Stuartbass

Gototop

mutate()function:
Objective:Createsnewvariables
Description:Oftenwewanttocreateanewvariablethatisafunctionofthecurrentvariablesinour
dataframeorevenjustaddanewvariable.The mutate() functionallowsustoaddnewvariableswhile
preservingtheexistingvariables.
http://rpubs.com/bradleyboehmke/data_wrangling

22/25

10/27/2016

RPubsDataProcessingwithdplyr&amptidyr

Function:
mutate(data,...)
Sameas:data%>%mutate(...)

Arguments:
data:dataframe
...:Expression(s)

Examples
Ifwegobacktoourpreviousjoin.expdataframe,rememberthatwejoinedinflationratestoournoninflation
adjustedexpendituresforpublicschools.Thedataframelookslike:
##YearDivisionStateExpenditureAnnualInflation
##120076Alabama6245031207.3420.9030811
##220079Alaska1634316207.3420.9030811
##320078Arizona7815720207.3420.9030811
##420077Arkansas3997701207.3420.9030811
##520079California57352599207.3420.9030811
##620078Colorado6579053207.3420.9030811

Ifwewantedtoadjustourannualexpendituresforinflationwecanuse mutate() tocreateanewinflation


adjustedcostvariablewhichwellnameAdj_Exp:
inflation_adj<join.exp%>%mutate(Adj_Exp=Expenditure/Inflation)

head(inflation_adj)

##YearDivisionStateExpenditureAnnualInflationAdj_Exp
##120076Alabama6245031207.3420.90308116915249
##220079Alaska1634316207.3420.90308111809711
##320078Arizona7815720207.3420.90308118654505
##420077Arkansas3997701207.3420.90308114426735
##520079California57352599207.3420.903081163507696
##620078Colorado6579053207.3420.90308117285119

Letssaywewantedtocreateavariablethatrankordersstatelevelexpenditures(inflationadjusted)forthe
year2010fromthehighestlevelofexpenditurestothelowest.
rank_exp<inflation_adj%>%
filter(Year==2010)%>%
arrange(desc(Adj_Exp))%>%
http://rpubs.com/bradleyboehmke/data_wrangling

23/25

10/27/2016

RPubsDataProcessingwithdplyr&amptidyr

mutate(Rank=1:length(Adj_Exp))

head(rank_exp)

##YearDivisionStateExpenditureAnnualInflationAdj_ExpRank
##120109California58248662218.0560.9497461613307741
##220102NewYork50251461218.0560.9497461529104172
##320107Texas42621886218.0560.9497461448771383
##420103Illinois24695773218.0560.9497461260025014
##520102NewJersey24261392218.0560.9497461255451355
##620105Florida23349314218.0560.9497461245847976

Ifyouwantedtoassessthepercentchangeincostforaparticularstateyoucanusethe lag() function


withinthe mutate() function:
inflation_adj%>%
filter(State=="Ohio")%>%
mutate(Perc_Chg=(Adj_Explag(Adj_Exp))/lag(Adj_Exp))

##YearDivisionStateExpenditureAnnualInflationAdj_ExpPerc_Chg
##120073Ohio18251361207.3420.903081120210102NA
##220083Ohio18892374215.3030.9377553201463780.003153057
##320093Ohio19387318214.5370.9344190207479920.029862103
##420103Ohio19801670218.0560.9497461208494360.004889357
##520113Ohio19988921224.9390.9797251204025820.021432441

YoucouldalsolookatwhatpercentofallUSexpenditureseachstatemadeupin2011.Inthiscasewe
use mutate() totakeeachstatesinflationadjustedexpenditureanddividebythesumoftheentireinflation
adjustedexpenditurecolumn.Wealsoapplyasecondfunctionwithin mutate() thatprovidesthe
cummalativepercentinrankorder.Thisshowsthatin2011,thetop8stateswiththehighestexpenditures
representedover50%ofthetotalU.S.expendituresinK12publicschools.(Iremovethenoninflation
adjustedExpenditure,Annual&Inflationcolumnssothatthecolumnsdontwraponthescreenview)
perc.of.whole<inflation_adj%>%
filter(Year==2011)%>%
arrange(desc(Adj_Exp))%>%
mutate(Perc_of_Total=Adj_Exp/sum(Adj_Exp),
Cum_Perc=cumsum(Perc_of_Total))%>%
select(Expenditure,Annual,Inflation)

head(perc.of.whole,8)

http://rpubs.com/bradleyboehmke/data_wrangling

24/25

10/27/2016

RPubsDataProcessingwithdplyr&amptidyr

##YearDivisionStateAdj_ExpPerc_of_TotalCum_Perc
##120119California587173240.109432370.1094324
##220112NewYork525752440.097985280.2074177
##320117Texas437513460.081540050.2889577
##420113Illinois250626090.046709570.3356673
##520115Florida243640700.045407690.3810750
##620112NewJersey241284840.044968620.4260436
##720112Pennsylvania239712180.044675520.4707191
##820113Ohio204025820.038024600.5087437

Gototop

AdditionalResources
Thistutorialsimplytouchesonthebasicsthatthesetwopackagescando.Thereareseveralotherresources
youcancheckouttolearnmore.Inaddition,muchofwhatIhavelearnedand,therefore,muchofthe
contentinthistutorialissimplyamodifiedregurgitationofthewonderfulresourcesprovidedbyR
Studio,HadleyWickham,andGarrettGrolemund.
RStudiosDatawranglingwithRandRStudiowebinar
RStudiosDatawranglingGitHubrepository
RStudiosDatawranglingcheatsheet
HadleyWickhamsdplyrtutorialatuseR!2014,Part1
HadleyWickhamsdplyrtutorialatuseR!2014,Part2
HadleyWickhamspaperonTidyData

Gototop

SpecialthankstoTomFilloonandJasonFreelsforprovidingconstructivecommentswhiledevelopingthistutorial.

http://rpubs.com/bradleyboehmke/data_wrangling

25/25

You might also like