Professional Documents
Culture Documents
of Contents
Introduction
1.1
Why?
1.1.1
1.1.2
About R
1.1.3
Author
1.1.4
Prepare environment
1.2
Data sources
1.2.1
1.2.2
1.2.3
1.2.4
Installing R Studio
1.2.5
Summary
1.2.6
First steps
1.3
Introduction to R
1.3.1
1.3.2
googleAnalyticsR package
1.3.3
1.3.4
Code repository
1.3.5
Summary
1.3.6
1.4
1.4.1
1.5
Data visualization in R
1.5.1
Traffic heatmap
1.5.2
Device comparsion
1.5.3
Machine Learning
Clustering (k-means)
Generating reports
1.6
1.6.1
1.7
Introduction to R Markdown
1.7.1
Create report
1.7.2
Additional analysis
1.8
Anomaly detection
1.8.1
Forecasting
1.8.2
Resources
1.9
Blogs
1.9.1
Documentation
1.9.2
Online trainings
1.9.3
Books
1.9.4
Introduction
Why?
Why?
I've decided to write this book to show how big value is hidden in data. If you have website
you probably collecting data about web traffic. But if you use this data to make business
decisions?
Nowadays we are swimming in data lake. Only if you know how to use this data you will stay
on the surface :). First step is to regularly check standard reports in your web analytics tool
(i.e. Google Analytics).
But to stay competitive you need something more. Everybody talks about data collection.
But only a few tell you what to do with data after collect them. I try to describe this process
and give you some ideas how to deal with data from Google Analytics using R.
In this book I will share my experience on this field. I hope that it will be usefull, interesting,
sometimes funny and will save you time :)
Target audience
I wrote this book for marketers who worked with Google Analytics and know basic metrics
included in this tool and know web interface. I hope that this material will be helpful in
learning how to extend features of Google Analytics in daily work and learning how to use R.
If you are analyst who knows perfectly R I hope that you also find some inspirations in this
book. Especialyy in learning how to connect Google Analytyics as additional data source in
R and what kind of analysis you can perform on this data.
Terms of service
Common question is: if this great tool is really free? To be precise, according to Google
Analytics Terms of Service:
Service is provided without charge to You for up to 10 million Hits per month per
account.
If you exceed this quota, you should think about Google Analytics 360, former Gooogle
Analytics Premium service. This premium and paid version offers you multiple times bigger
data collection quota.
What is hit?
As you read above, your Google Analytics account has 10 000 000 hits per month limit. So
what is hit?
According to Google Analytics help:
Hit - An interaction that results in data being sent to Analytics Common hit types
include page tracking hits, event tracking hits, and ecommerce hits.
Each time the tracking code is triggered by a users behavior (for example, user loads a
page on a website or a screen in a mobile app), Analytics records that activity. Each
interaction is packaged into a hit and sent to Googles servers. Examples of hit types
include:
page tracking hits
event tracking hits
ecommerce tracking hits
social interaction hits
About R
About R
What is R?
R is a programming language and software environment for statistical computing and
graphics supported by the R Foundation for Statistical Computing. The R language is widely
used among statisticians and data miners for developing statistical software and data
analysis. Polls, surveys of data miners, and studies of scholarly literature databases show
that R's popularity has increased substantially in recent years. Wikipedia
Author
Author
Micha Bry
Data scientist
Micha is working in internet industry from 2009. He is expert in web analytics in e-commerce
context, especially using Google Analytics & Google Tag Manager. He loves mining big data
sets and transform information into actionable knowledge. He loves creating story from
numbers. He graduated AGH University of Science and Technology and University of
Economics in Cracow. Michal is member of Google Developers Group Cracow.
Feel free to contact author:
about.me
michalbrys.com
Prepare environment
Preparing environment
To analysis data you will need to set up:
Google Analytics account
R Studio
Credentials to connect Google Analytics API in Google Developers Console (free)
I will precisely describe this steps in this chapter.
10
Data sources
Data sources
You can find the most popular scenarios website.
11
Data sources
support.google.com\/analytics\/answer\/6367342
12
Account details
To create Google Analytics account fill form with:
Account Name. (Note: One Account may have a few tracking IDs so it can be one
Account per one organization/company with many websites.)
In next steps create your unique tracking ID:
Insert Website Name, Website URL and Reporting Time Zone. (Note: Correct time
zone is critically important - your data will be divided into dates in reports using this
value).
13
14
To complete registration process, click Get Tracking ID and accept Google Analytics Terms
of Service.
After this you will see instructions how to install Google Analytics Tracking Code on your
website:
15
Search: Analytics
Select Enable
16
Create credentials:
17
18
Get credentials:
19
Save Client ID and Client Secret. You need this to configure library getting data from
Google Analytics to R.
20
Install tracking code on your website, between <head></head> tags, on every page you want
to track.
To do this you should have access to your website source code or contact with your
webmaster.
Alternatively you can install Google Analytics via Google Tag Manager. I personally
recommend that way because it will save you a lot of time in future :)
21
Installing R Studio
Installing R Studio
Go to R Studio and download R Studio Desktop - graphic interface tool for R language.
22
Installing R Studio
install.packages("ggplot2")
After installing and before using package you should load it to current session:
library("ggplot2")
23
Summary
Summary
In this chapter you may learn:
How to create account, configure and install Google Analytics on your website.
How to download and set up R Studio.
How to get credentials to download data from Google Analytics into R.
24
First steps
First steps
In this chapter you will set up your environment installing R Studio, creating Google
Analytics account and make connection via API between both tools.
25
Introduction to R
Introduction to R
Try type in console (left down corner window in R Studio) some basic instructions.
Commit instructions via press [enter] button.
Arithmetic operations
> 1+1
[1] 2
> 2*4
[1] 8
Using variables
You can assign value to variable using <- (more popular) or = operator. You can find
some basic examples below.
Numeric variables
26
Introduction to R
Text variables
> z <- "Hello world"
> z
[1] "Hello world"
Vectors
> v <- c(1,2,3,4,5)
> v
[1] 1 2 3 4 5
Data frames
More popular than one dimensional vector is multidimensional data structure called
data.frame .
Data returned from Google Analytics API query we'll also save as data.frame
27
Introduction to R
And select only unique values of column (we have sessions for only one date: 2016-01-01):
> unique(df$date)
28
Introduction to R
[1] 20160101
Levels: 20160101
Select row 1:
> df[1,]
[1] 20160101
Levels: 20160101
29
You will be asked about authorize R to download data from Google Analytics and your
browser will open authorization page. Click Agree:
30
All done. You can now start to send queries via Google Analytics API.
31
Display results
After you successfully run your first query you can check results fetched from Google
Analytics. Display first 6 rows of result:
head(gadata)
32
date sessions
1 20140101 39
2 20140102 46
3 20140103 47
4 20140104 53
5 20140105 49
6 20140106 15
Congrats! You've downloaded first data set from your Google Analytics account!
Source code
Complete code for this example in GitHub repository:
https:\/\/github.com\/michalbrys\/R-Google-Analytics\/blob\/master\/1_hello_world.R
33
googleAnalyticsR package
googleAnalyticsR package
To pull data from Google Analytics into R we'll use package googleAnalyticsR by Mark
Edmonson. Using this add on you can use all features of Google Analytics including the
latest like Google Analytics 360 or integration with Big Query.
Short list of googleAnalyticsR features:
First Google Analytics Reporting v4 API library for R
v4 features include: dynamic calculated metrics, pivots, histograms, date
comparisons, batching.
v4 API explorer
API metadata of possible metrics and dimensions
Multi-user login in Shiny App
Integration with BigQuery Google Analytics Premium\/360 exports.
Single authentication flow can be used with other googleAuthR apps like
searchConsoleR
Automatic batching, sampling avoidance with daily walk, multi-account fetching,
multi-channel funnnel
Support for googleAuthR batch. For big data calls this could be 10x quicker than
normal GA fetching.
Meta data in attributes of returned dataframe including date ranges, totals, min and
max
You can read the docs visiting Mark's website:
http:\/\/code.markedmondson.me\/googleAnalyticsR\/
or check docs on CRAN repository: https:\/\/cran.rproject.org\/web\/packages\/googleAnalyticsR\/vignettes\/googleAnalyticsR.html
34
googleAnalyticsR package
35
[1] "/Users/michal"
Export data
After conducted analysis you may want to save results in file to use it in other tools. To do
this you need write.csv function.
36
If you have data in data frame called ga.data you can use this code:
write.csv(gadata, file = "exported_data.csv")
As a result R will export data to .csv file. You can open it in every text editor or
spreadsheet (i.e. Microsoft Excel). Other use case is upload data as custom dimension or
campaign cost data to Google Analytics.
37
Code repository
Code repository
Source code in R for all examples described in this book you can find in my GitHub
repository:
github.com\/michalbrys\/R-Google-Analytics
Feel free to commit if you find some issue in code or if you want to share your examples.
38
Summary
Summary
In this chapter you can learn:
How to conduct basic arithmetic operations in R.
How to deal with basic data structures in R.
How to load extrenal packages into R.
How to connect Google Analytics and R.
Hot to import and export data from file into R.
39
40
Min
Check what is minimum number of sessions in 2014?
min(gadata$sessions)
[1] 0
date sessions
7 20140107 0
8 20140108 0
129 20140509 0
130 20140510 0
131 20140511 0
132 20140512 0
133 20140513 0
134 20140514 0
135 20140515 0
41
How many days with 0 sessions? Use function nrow() to count rows with this condition.
nrow(subset(gadata, ga.data$sessions == 0))
[1] 9
Max
When was the biggest traffic on your website? Use max() function.
> max(gadata$sessions)
[1] 204
date sessions
59 20140228 204
You can reach this data in one function, replacing value with max() . It is shorter but harder
to read:
subset(gadata, gadata$sessions == max(gadata$sessions))
date sessions
59 20140228 204
Mean
What is mean number of sessions per day? To calculate this, use mean() function.
42
mean(gadata$sessions)
[1] 27.6
Standard deviation
You can check diversity of number sessions per day. Use sd() function.
sd(gadata$sessions)
[1] 22.12984
So average number of sessions is equal 27.6 +\/- 22.12984. This dataset has big diversity
and in your case is better not to trust only average value.
Median
If dataset has high standard deviation its better to calculate median (the most popular value
in dataset).
median(gadata$sessions)
[1] 21
Summary
If you want, you can get all of this statistics in one function: summary .
summary(gadata)
43
date sessions
Length:365 Min. : 0.0
Class :character 1st Qu.: 12.0
Mode :character Median : 21.0
Mean : 27.6
3rd Qu.: 40.0
Max. :204.0
As a result you will get basic statistics for numeric variables and description for character
variables.
Source code
Complete code for this example in GitHub repository:
github.com\/michalbrys\/R-Google-Analytics\/blob\/master\/2_eda.R
44
Data visualization
Data visualization
45
Data visualization in R
Data visualization in R
We'll make some exploratory data analysis by visualizing data from Google Analytics in R.
R has big range of visualizing packages. My favourite is ggplot2 .
Package ggplot2
According to ggplot2 project site:
ggplot2 is a plotting system for R, based on the grammar of graphics, which tries to
take the good parts of base and lattice graphics and none of the bad parts. It takes care
of many of the fiddly details that make plotting a hassle (like drawing legends) as well
as providing a powerful model of graphics that makes it easy to produce complex multilayered graphics.
Full documentation: docs.ggplot2.org
This is my favourite visualization package in R because of:
Nice charts design.
Flexibility.
Wide range charts types.
Extending plugins i.e. ggtheme .
You can also check alternatives like Plotly or R Base Graphic.
Examples in this book is made with ggplot2 .
Using ggplot2
Download data to visualize in chart
In first step install (if necessary) and load package in current session.
install.packages("ggplot2")
library("ggplot2")
Next build query do fetch data about date and number of session:
46
Data visualization in R
date sessions
1 2016-01-01 199
2 2016-01-02 212
3 2016-01-03 155
4 2016-01-04 210
5 2016-01-05 192
6 2016-01-06 180
Scatter plot
Plot data in time (scatter plot)
ggplot(gadata, aes(x=date, y=sessions)) +
geom_point()
As a result you will get basic scatter plot with sessions in time:
47
Data visualization in R
As you see this plot isn't very nice because of a-axis labels. You can fix this using 90-degree
pivot.
Add line:
theme(axis.text.x = element_text(angle = 90, hjust = 1))
You can also change point size depending on number of sessions by adding:
size = sessions
48
Data visualization in R
49
Data visualization in R
Line chart
Plot data in time (line chart) with some styles:
ggplot(gadata,aes(x=date,y=sessions,group=1)) +
geom_line() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
# some styles to pivot x-axis labels
50
Data visualization in R
And now we can plot data points with added trend line:
ggplot(data = gadata, aes(x = gadata$date,y = gadata$sessions) ) +
geom_point() +
geom_smooth() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
51
Data visualization in R
Box plot
To make some exploratory data analysis, you can visualize your traffic in different day od
week. Is your website traffic is seasonal? When are more crowded days? Let's check
creating box plot which will illustrate distribution of number of sessions in every day of
week:
Build query to download data:
gadata <- google_analytics(id = ga_id,
start="2016-01-01", end="2016-06-30",
metrics = "sessions",
dimensions = c("dayOfWeek","date"),
max = 5000)
52
Data visualization in R
So in this case, the highest traffic was on Thursday. Fridays are also not bad :)
Source code
Complete code for this example in GitHub repository:
github.com\/michalbrys\/R-Google-Analytics\/blob\/master\/3_data_visualization.R
53
Traffic heatmap
Traffic heatmap
We will build some more advanced data visualization. It willl be useres engagement
heatmap. The darker color the highest user engagement (avgSessionDuration) was on this
time of day. Inspired by Todd Moy.
54
Traffic heatmap
# traffic heatmap
# based on https://github.com/toddmoy/Google-Analytics-Heatmap/blob/master/traffic_hea
tmap.R
# install libraries
# install.packages("googleAuthR")
# install.packages("googleAnalyticsR")
# install.packages("ggplot2")
# install.packages("RColorBrewer")
# load libraries
library("googleAuthR")
library("googleAnalyticsR")
library("ggplot2")
library("RColorBrewer")
# authorize connection with Google Analytics servers
ga_auth()
## pick a profile with data to query
#ga_id <- account_list[275,'viewId']
# or give it explicite using tool http://michalbrys.github.io/ga-tools/table-id.html i
n format 99999999
ga_id <- 00000000
gadata <- google_analytics(id = ga_id,
start="2012-01-01", end="2016-06-30",
metrics = c("avgSessionDuration"),
dimensions = c("dayOfWeekName", "hour"),
max = 5000)
# order data
gadata$dayOfWeekName <- factor(gadata$dayOfWeekName, levels = c("Sunday",
"Monday",
"Tuesday",
"Wednesday",
"Thursday",
"Friday",
"Saturday"))
gadata[order(gadata$dayOfWeekName),]
# convert data frame to xtab
heatmap_data <- xtabs(avgSessionDuration ~ dayOfWeekName + hour, data=gadata)
55
Traffic heatmap
# plot heatmap
heatmap(heatmap_data,
col=colorRampPalette(brewer.pal(9,"Blues"))(100),
revC=TRUE,
scale="none",
Rowv=NA, Colv=NA,
main="avgSessionDuration by Day and Hour",
xlab="Hour")
In this case - wednesday morning is the most engaging for users time of the day :)
Source code
Complete code for this example in GitHub repository:
github.com\/michalbrys\/R-Google-Analytics\/blob\/master\/6_heatmap.R
56
Traffic heatmap
57
Device comparsion
Device comparsion
Let's check how engaged users are on different types of device. To do this, we'll plot 2 charts
- describing how many sessions was made from different device types and what is
avgSessionDuration (in seconds) on particular device type.
# device comparsion
# install libraries
# install.packages("googleAuthR")
# install.packages("googleAnalyticsR")
# install.packages("ggplot2")
# load libraries
library("googleAuthR")
library("googleAnalyticsR")
library("ggplot2")
# authorize connection with Google Analytics servers
ga_auth()
## pick a profile with data to query
#ga_id <- account_list[275,'viewId']
# or give it explicite using tool http://michalbrys.github.io/ga-tools/table-id.html i
n format 99999999
ga_id <- 00000000
gadata <- google_analytics(id = ga_id,
start="2015-01-01", end="2016-06-30",
metrics = c("sessions", "avgSessionDuration"),
dimensions = c("date", "deviceCategory"),
max = 5000)
58
Device comparsion
In this case the longest sessions was made from mobile devices.
Source code
Complete code for this example in GitHub repository:
github.com\/michalbrys\/R-Google-Analytics\/blob\/master\/7_device_comparsion.R
59
Machine Learning
Machine Learning
60
Clustering (k-means)
Clustering (k-means)
Power of R is wide range of packages with advanced algorithms ready-to-use. In this
example we'll use k-means for custom users segmentation.
Unsupervised learning: k-Means k-means clustering aims to partition n observations
into k clusters in which each observation belongs to the cluster with the nearest mean
(Source: Wikipedia)
Because this example needs custom instalation of Google Analytics tracking (content
grouping, fingerprint), I've prepared special dataset for thus purpose. You can find complete
code below.
61
Clustering (k-means)
Results
Result visualized in plotly package:
62
Clustering (k-means)
Source code
Complete code for this example in GitHub repository:
github.com\/michalbrys\/R-Google-Analytics\/blob\/master\/5_users_segmentation.R
63
Generating reports
Generating reports
For every analyst periodic reports can be time-consuming work. We can automate this
process in R and prepare reporting templates. After that you can run this reports changing
time range and save it do i.e. .pdf file. Sounds interesting?
64
Introduction to R Markdown
Introduction to R Markdown
You can use markdowns as follow:
R Markdown options
--title: "Monthly report"
output: pdf_document
---
Chunks of code
```{r}
# R Code
```r
If you don't want to display code in chunk in output file, use echo = FALSE option.
```{r, echo=FALSE}
# R Code
```
Basic formatting
Headers
# Header 1
## Header 2
### Header 3
will produce
Header 1
Header 2
65
Introduction to R Markdown
Header 3
Lists
* element 1
* element 2
* element 3
will produce
element 1
element 2
element 3
1. element 1
2. element 2
3. element 3
will produce
1. element 1
2. element 2
3. element 3
Formatting
*italic*
**bold**
***bold+italic**
will produce
italic bold bold+italic
More resources
Full documentation:
www.rstudio.com\/wp-content\/uploads\/2015\/03\/rmarkdown-reference.pdf
Cheat sheet:
66
Introduction to R Markdown
www.rstudio.com\/wp-content\/uploads\/2016\/03\/rmarkdown-cheatsheet-2.0.pdf
67
Create report
Create report
To generate basic report template use this code. This report will contain title, sessions in
time scatter plot from chapter 2 (Data visualization in R).
You can select output of your report. Select HTML , PDF or Word .
Click OK and delete sample code.
68
Create report
69
Create report
Result
As a result you'll get complete HTML file with report. You can also generate PDF file.
For recurring reporing you can only change dates :)
70
Create report
71
Create report
Source code
Complete code for this example in GitHub repository:
github.com\/michalbrys\/R-Google-Analytics\/blob\/master\/8_rmarkdown_report.Rmd
72
Additional analysis
Additional analysis
73
Anomaly detection
Anomaly detection
Use: https:\/\/github.com\/twitter\/AnomalyDetection
74
Forecasting
Forecasting
Forecast of future web traffic using Holt-Winters method. Inspired by Richard Fergie.
# forecasting using Holt-Winters algorithm
# based on http://www.eanalytica.com/r-for-web-analysts/
# install libraries
# install.packages("googleAuthR")
# install.packages("googleAnalyticsR")
# install.packages("ggplot2")
# install.packages("forecast")
# install.packages("reshape2")
# load libraries
library("googleAuthR")
library("googleAnalyticsR")
library("ggplot2")
library("forecast")
library("reshape2")
# authorize connection with Google Analytics servers
ga_auth()
## pick a profile with data to query
#ga_id <- account_list[275,'viewId']
# or give it explicite using tool http://michalbrys.github.io/ga-tools/table-id.html i
n format 99999999
ga_id <- 00000000
gadata <- google_analytics(id = ga_id,
start="2016-05-01", end="2016-06-30",
metrics = "sessions",
dimensions = "date",
max = 5000)
75
Forecasting
plot(forecastmodel)
forecast <- forecast.HoltWinters(forecastmodel, h=26) # 26 days in future
plot(forecast, xlim=c(0,13))
forecastdf <- as.data.frame(forecast)
totalrows <- nrow(gadata) + nrow(forecastdf)
forecastdata <- data.frame(day=c(1:totalrows),
actual=c(gadata$sessions,rep(NA,nrow(forecastdf))),
forecast=c(rep(NA,nrow(gadata)-1),tail(gadata$sessions,1),forecastdf$"Point Forecast")
,
forecastupper=c(rep(NA,nrow(gadata)-1),tail(gadata$sessions,1),forecastdf$"Hi 80"),
forecastlower=c(rep(NA,nrow(gadata)-1),tail(gadata$sessions,1),forecastdf$"Lo 80")
)
ggplot(forecastdata, aes(x=day)) +
geom_line(aes(y=actual),color="black") +
geom_line(aes(y=forecast),color="blue") +
geom_ribbon(aes(ymin=forecastlower,ymax=forecastupper), alpha=0.4, fill="green") +
xlim(c(0,90)) +
xlab("Day") +
ylab("Sessions")
Result
As a result you'll get chart with predictions about your web traffic.
76
Forecasting
Source code
Complete code for this example in GitHub repository:
https:\/\/github.com\/michalbrys\/R-Google-Analytics\/blob\/master\/4_forecasting.R
77
Resources
Resources
78
Blogs
Blogs
R Bloggers
Mark Edmondson
Richard Fergie
Michal Brys - blog
...
79
Documentation
Documentation
R project - official website
ggplot2 - official website
googleAnalyticsR - R package
Google Analyitcs - for developers
80
Online trainings
Online trainings
To learn more details about R I recommend to check Coursera MOOC:
R Programming by Johns Hopkins University
81
Books
Books
List of book where you can get some inspiration for further analysis, with links for free online
versions:
Cookbook for R
www.cookbook-r.com
R for Data Science
r4ds.had.co.nz
Think Stats
greenteapress.com\/thinkstats
greenteapress.com\/thinkstats2 (2nd edition)
82