You are on page 1of 16

Statistics and Machine Learning I

Week 5: From data to functions


Coursework 5: Evidence for Global Warming

Luis Da Silva∗

November 11, 2018

1 Introduction
In this coursework we’re going to explore the concept of Gaussian Processes
(GPs) and apply them to sea ice cover on the northern polar region data to
investigate whether there is evidence for global warming or not. Thus, our
main objectives are:

• Briefly describe theory behind Gaussian Processes.

• Fit a GP regression model to sea ice cover data.

• Discuss whether there is a change in sea ice cover between 1990 and
2004.

2 Gaussian Processes
By Gaussian Process we refer to an stochastic process in which we assume
that the vector of targets comes from a multivariate normal distribution, thus

More info on: http://luisdasilva.me

1
we don’t choose a parametric function but a mean vector (µ) and covariance
matrix (C) [4]. That is a generalization of the normal probability distribution
in the sense that, to get a proper sense of how the data is connected (or about
the underlying generative function) we don’t need to compute the infinite
possible functions, but just its properties [1].
As we have to choose a mean and covariance function to be able to com-
pute the GP, this decision is call a GP Prior, in which we reflect our current
knowledge (or beliefs) about the data. A (maybe) good choice for the mean
function is µ(Xn ) = 0, which reflects no previous knowledge on the data. For
the covariance function, a popular choice is to use the Radial Basis Function.

2.1 Radial Basis Function as a Kernel

A Radial Basis Function (RBF) is one that uses a distance function (usually
Euclidean distance) to return a distance from a predefined point (usually
origin)[6]. We can build the covariance matrix needed by implementing:
2
c(xn , xm ) = αe−γ(xn −xm ) (1)

1
Or alternatively, by making α = A2 and γ = 2 we could rewrite equa-
2l
tion 1 as:
(xn −xm )2
kRBF (xn , xm ) = A2 e− 2l2 (2)

Where A may be interpreted as an amplitude (variance) measure and l


as a length-scale measure. These parameters will be further explained on
section 3.3.

3 Data
Global warming is everywhere these days since it seems to be a very harm-
ful process the Earth is going through. Worst part is that ’Most climate
scientists agree the main cause of the current global warming trend is hu-
man expansion of the ”greenhouse effect”’ [2]. Also, NASA claims there are
already observable effects of the phenomenon, let’s dig deeper into it.

2
The website https://www.kaggle.com has data available on sea ice cover
on the northern polar region. The dataset we’re working on consists of 6
variables: Day, Month, Decade, Mean.Extent and Var.Extent; and 48 obser-
vations: 2 per month for 1990 and 2004.

3.1 Fitting a Gaussian Process Regression Model to


Global Warming data

As we are interested in finding out if there is a difference between the ice


minima reached in 1990 and 2004, we need to fit a separate GP model for
each year. Python package ’sklearn’ has the tools needed to fit a GP model
to our data without needing to actually implement the math ourselves. By
pre-loading required libraries and data as specified in section 4, we are able
to run the following code to fit a GP model to our data:
# Gaussian Process Regression
A_list = {}
L_list = {}
jplot = {}
five_samples = {}
n_samples = 100000
minima = {}
for decade in [1990 , 2004]:
subdata = data [ data [ ’ Decade ’ ]== decade ]
# Set regressor
x = subdata [[ ’ odate ’ ]]
y = subdata [ ’ Mean . Extent ’]
x_dates = np . linspace (1 , 365 , 365)

iL = int (( np . amax ( x ) - np . amin ( x )) / 10)


iA = y . std ()
iScale = iA * iA
rbf_kernel = iScale * RBF ( length_scale = iL ,
l en g t h_ s c al e _ bo u n ds = ( iL /20 , 5* iL ))
gp = G a u s s i a n P r o c e s s R e g r e s s o r ( kernel = rbf_kernel ,
alpha = subdata [ ’ Var . Extent ’ ])

# Fit model
gp . fit (x , y )

3
fitted_kernel = gp . kernel_
fitted_params = fitted_kernel . get_params ()
A_list [ decade ] = ( math . sqrt ( fitted_params [ " k1_ _c on st an t_ va lu e " ]))
L_list [ decade ] = ( fitted_params [ " k2__length_scale " ])

# Get samples
y_samples = gp . sample_y ( x [: np . newaxis ] , n_samples )
minima [ decade ] = np . array ([ min ( sample ) for sample in
np . transpose ( y_samples )])

# Plot
date_col_vec = x_dates [: , np . newaxis ]
y_mean , y_std = gp . predict ( date_col_vec ,
return_std = True )

fig = plt . figure ( figsize =[8 ,5])


plt . plot ( x_dates , y_mean , color = colors [0])
plt . fill_between ( x_dates , y_mean - y_std ,
y_mean + y_std , alpha =0.3 ,
color = colors [1])

y_samples = gp . sample_y ( date_col_vec , 5)


plt . plot ( x_dates , y_samples , color = colors [ -1])

jplot [ decade ] = [ x_dates , y_samples , y_mean , y_std ]

pdf = PdfPages ( ’ Coursework /{} _samples . pdf ’. format ( str ( decade )))
pdf . savefig ( fig )
pdf . close ()

# Plot them both together


fig = plt . figure ( figsize =[8 ,5])

plt . plot ( jplot [1990][0] , jplot [1990][1] , color = ’ blue ’)


plt . fill_between ( jplot [1990][0] , jplot [1990][2] - jplot [1990][3] ,
jplot [1990][2] + jplot [1990][3] , alpha =0.3 ,
color = ’ blue ’)

plt . plot ( jplot [2004][0] , jplot [2004][1] , color = ’ red ’)


plt . fill_between ( jplot [2004][0] , jplot [2004][2] - jplot [2004][3] ,

4
Figure 1: GP models per decade

16 1990
2004
14

12
Ice extention

10

0 50 100 150 200 250 300 350


Day of the year

jplot [2004][2] + jplot [2004][3] , alpha =0.3 ,


color = ’ red ’)

blue_patch = mpatches . Patch ( color = ’ blue ’ , label = ’ 1990 ’)


red_patch = mpatches . Patch ( color = ’ red ’ , label = ’ 2004 ’)
plt . legend ( handles =[ blue_patch , red_patch ])

plt . xlabel ( ’ Day of the year ’)


plt . ylabel ( ’ Ice extention ’)

pdf = PdfPages ( ’ Coursework / joint_samples . pdf ’)


pdf . savefig ( fig )
pdf . close ()

Which gives us figure 1 as output. We are already capable of seeing


a difference between 1990 and 2004 distribution: 2004 seems to be almost
always below 1990.

5
Figure 2: Ice minima distribution

KDE 1990
KDE 2004
Histogram 1990
2.0 Histogram 2004

1.5
Density

1.0

0.5

0.0
4.5 5.0 5.5 6.0 6.5 7.0 7.5
Ice minima distribution

3.2 2004 versus 1990 ice minima distribution

By running above code, we also collected 100000 samples of global minima


across our GP model. Figure 2 shows the distribution of this minima esti-
mated from an Univariate Kernel Density Estimate (KDE)1 and a histogram.
There is no surprise in the shape of these distributions since they were taken
from a GP model, and any random subset of it will follow a normal distri-
bution, but we do see that KDE for 2004 is shifted left from KDE of 1990;
is this difference real?
To shed some light on this question, let’s look at the means of both
distributions: µ̂1990 ≈ 6.63 while µ̂2004 ≈ 5.33. There seems to be a difference
of 1.3, but to see if that difference is statistically significant we have to test
the null hypothesis that µ̂1990 = µ̂2004 or, alternatively µ̂1990 − µ̂2004 = 0. As
we cannot assume equal variance, we may test this hypothesis by performing
Welch’s t-test. In python, that is:
1
A ”non parametric way to estimate probability density from a random variable”[3]

6
sp . stats . ttest_ind ( minima [1990] , minima [2004] ,
equal_var = False )

giving us an statistic of 1478.31 as a result (pvalue = 0.0) allowing us to


reject the null hypothesis and confirming there is a difference between the
minima of 1990 and 2004 (or at least between the models fitted to those
years).
This arises the question, how big is the difference? To answer it, let’s plot
the distribution of the difference in the minima and get a 95% confidence
interval from it:
# Comparing 2004 and 1990 minima distribution
minima [ ’ comparison ’] = minima [1990] - minima [2004]
ci = sp . stats . norm . interval (0.95)
dif = [ minima [ ’ comparison ’ ]. mean () + interval * minima [ ’ comparison ’ ]. std ()
for interval in ci ]

fig = plt . figure ( figsize =[8 ,5])


plt . hist ( minima [ ’ comparison ’] , alpha =0.5 , bins = ’ fd ’)
plt . axvline ( x = dif [0] , color = ’ red ’)
plt . axvline ( x = dif [1] , color = ’ red ’)

plt . xlabel ( ’ Ice difference between 1990 and 2004 ’)


plt . ylabel ( ’ Counts ’)

# PDF
pdf = PdfPages ( ’ Coursework / dif_minima_dist . pdf ’)
pdf . savefig ( fig )
pdf . close ()

As spotted by figure 3, difference is somewhere between 0.55 and 2.05,


which implies between 8.32% and 30.98% less ice.

3.3 Parameters A and l

While the meaning of parameter A and l might seem fussy, we could interpret
them as a measure of how flexible our function is. A is usually taken from
sample standard deviation and will then measure how far one would expect

7
Figure 3: Minima difference distribution with 95% confidence intervals

2500

2000

1500
Counts

1000

500

0
0.0 0.5 1.0 1.5 2.0 2.5 3.0
Ice difference between 1990 and 2004

our functions to get away from the mean; thus, the higher this parameter is,
functions becomes more ’explosive’. On the other hand, l can be thought of
as a measure of how many constraints will the function has, thus lowering
its value will lead to a more complex function.
The whole point of fitting these parameters is to maximize likelihood
of our data being observed from final function. Having that objective in
mind, figure 4a tries to give us insight into how log likelihood changes as we
change parameters A and l. As this 3D graph is difficult to visualise on a 2D
screen/sheet, figure 4b gives us a representation of how likelihood changes as
we move parameter A given a fixed value for l (chosen such that optimizes
likelihood); black dot represents likelihood maximum. Similarly, figure 4c
shows the effect of parameter l on likelihood given a fixed value for A.

4 Full Code

import pandas as pd
import numpy as np

8
Figure 4: A and l effect on likelihood

200
400 250
600
800 500
1000
1200 750
1400
1000
200
150
5 100
10 50 l
A 15 0
20

(a) Simultaneous A and l variation

50

100
Log Likelihood

150

200

2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0


A

(b) A variation with fixed l

20

30

40

50
Log Likelihood

60

70

80

90

0 25 50 75 100 125 150 175 200


l

(c) l variation with fixed A

9
import scipy as sp
from sklearn . gaussian_process import G a u s s i a n P r o c e s s R e g r e s s o r
from sklearn . gaussian_process . kernels import RBF
from matplotlib import cm
import matplotlib . pyplot as plt
from matplotlib . backends . backend_pdf import PdfPages
import matplotlib . patches as mpatches
from mpl_toolkits . mplot3d import Axes3D
import math
import pathlib

pathlib . Path ( ’ Coursework ’ ). mkdir ( exist_ok = True )

data = pd . read_csv ( " All_GP_training . csv " )


data . head ()

dates = pd . DataFrame ({ ’ year ’: data [ ’ Decade ’] ,


’ month ’: data [ ’ Month ’] ,
’ day ’: data [ ’ Day ’ ]})
i_dates = pd . DataFrame ({ ’ year ’: data [ ’ Decade ’] -1 ,
’ month ’: 12 ,
’ day ’: 31})
data [ ’ odate ’] = ( pd . to_datetime ( dates ) - pd . to_datetime ( i_dates )). dt . days
data . head ()

colors = plt . get_cmap ( name = ’ Paired ’ ). colors

# Gaussian Process Regression


A_list = {}
L_list = {}
jplot = {}
five_samples = {}
n_samples = 100000
minima = {}
for decade in [1990 , 2004]:
subdata = data [ data [ ’ Decade ’ ]== decade ]
# Set regressor
x = subdata [[ ’ odate ’ ]]
y = subdata [ ’ Mean . Extent ’]

10
x_dates = np . linspace (1 , 365 , 365)

iL = int (( np . amax ( x ) - np . amin ( x )) / 10)


iA = y . std ()
iScale = iA * iA
rbf_kernel = iScale * RBF ( length_scale = iL ,
l en g t h_ s c al e _ bo u n ds = ( iL /20 , 5* iL ))
gp = G a u s s i a n P r o c e s s R e g r e s s o r ( kernel = rbf_kernel ,
alpha = subdata [ ’ Var . Extent ’ ])

# Fit model
gp . fit (x , y )
fitted_kernel = gp . kernel_
fitted_params = fitted_kernel . get_params ()
A_list [ decade ] = ( math . sqrt ( fitted_params [ " k1_ _c on st an t_ va lu e " ]))
L_list [ decade ] = ( fitted_params [ " k2__length_scale " ])

# Get samples
y_samples = gp . sample_y ( x [: np . newaxis ] , n_samples )
minima [ decade ] = np . array ([ min ( sample ) for sample in
np . transpose ( y_samples )])

# Plot
date_col_vec = x_dates [: , np . newaxis ]
y_mean , y_std = gp . predict ( date_col_vec ,
return_std = True )

fig = plt . figure ( figsize =[8 ,5])


plt . plot ( x_dates , y_mean , color = colors [0])
plt . fill_between ( x_dates , y_mean - y_std ,
y_mean + y_std , alpha =0.3 ,
color = colors [1])

y_samples = gp . sample_y ( date_col_vec , 5)


plt . plot ( x_dates , y_samples , color = colors [ -1])

jplot [ decade ] = [ x_dates , y_samples , y_mean , y_std ]

pdf = PdfPages ( ’ Coursework /{} _samples . pdf ’. format ( str ( decade )))
pdf . savefig ( fig )

11
pdf . close ()

# Plot them both together


fig = plt . figure ( figsize =[8 ,5])

plt . plot ( jplot [1990][0] , jplot [1990][1] , color = ’ blue ’)


plt . fill_between ( jplot [1990][0] , jplot [1990][2] - jplot [1990][3] ,
jplot [1990][2] + jplot [1990][3] , alpha =0.3 ,
color = ’ blue ’)

plt . plot ( jplot [2004][0] , jplot [2004][1] , color = ’ red ’)


plt . fill_between ( jplot [2004][0] , jplot [2004][2] - jplot [2004][3] ,
jplot [2004][2] + jplot [2004][3] , alpha =0.3 ,
color = ’ red ’)

blue_patch = mpatches . Patch ( color = ’ blue ’ , label = ’ 1990 ’)


red_patch = mpatches . Patch ( color = ’ red ’ , label = ’ 2004 ’)
plt . legend ( handles =[ blue_patch , red_patch ])

pdf = PdfPages ( ’ Coursework / joint_samples . pdf ’)


pdf . savefig ( fig )
pdf . close ()

# Plot minima distribution


fig = plt . figure ( figsize =[8 ,5])
color = [ ’ blue ’ , ’ orange ’]
count = 0

for decade in [1990 , 2004]:


kde = KDEUnivariate ( tuple ( minima [ decade ]))
kde . fit ()
plt . hist ( minima [ decade ] , alpha =0.25 ,
label = ’ Histogram {} ’. format ( decade ) ,
bins = ’ fd ’ , density = True ,
color = color [ count ])
plt . plot ( kde . support , kde . density ,
label = ’ KDE {} ’. format ( decade ) ,
color = color [ count ])
count += 1

12
plt . xlabel ( ’ Ice minima distribution ’)
plt . ylabel ( ’ Density ’)
plt . legend ()

# PDF
pdf = PdfPages ( ’ Coursework / minima_dist . pdf ’)
pdf . savefig ( fig )
pdf . close ()

# PDF
pdf = PdfPages ( ’ Coursework / minima_dist . pdf ’)
pdf . savefig ( fig )
pdf . close ()

# Run t - test on means


sp . stats . ttest_ind ( minima [1990] , minima [2004] , equal_var = False )

# Comparing 2004 and 1990 minima distribution


minima [ ’ comparison ’] = minima [1990] - minima [2004]
ci = sp . stats . norm . interval (0.95)
dif = [ minima [ ’ comparison ’ ]. mean () + interval * minima [ ’ comparison ’ ]. std ()
for interval in ci ]

fig = plt . figure ( figsize =[8 ,5])


plt . hist ( minima [ ’ comparison ’] , alpha =0.5 , bins = ’ fd ’)
plt . axvline ( x = dif [0] , color = ’ red ’)
plt . axvline ( x = dif [1] , color = ’ red ’)

# PDF
pdf = PdfPages ( ’ Coursework / dif_minima_dist . pdf ’)
pdf . savefig ( fig )
pdf . close ()

print ( minima [1990]. mean () ,


minima [2004]. mean () ,
minima [1990]. mean () - minima [2004]. mean () ,
dif ,

13
np . array ( dif )/ minima [1990]. mean ())

# ### Loglikelihood

# Compute likelihood for values of A and l


A_vals = np . linspace (1 , 20 , 200)
l_vals = np . linspace (1 , 200 , 200)
logA2 = np . log ( A_vals **2)
logl = np . log ( l_vals )
vals = []
for a2 in logA2 :
rows = []
for ll in logl :
rows . append ( gp . l o g _ m a r g i n a l _ l i k e l i h o o d ([ a2 , ll ]))
vals . append ( rows )

# Get indexes of max values


ma = np . amax ( vals )
for val in range ( len ( vals )):
try :
idx = { ’A ’ : val , ’l ’: vals [ val ]. index ( ma )}
break
except ValueError :
pass
idx

# 3 D plot on A and l by maximum


fig = plt . figure ()
ax = fig . gca ( projection = ’3 d ’)
surf = ax . plot_surface ( A_vals , l_vals , np . array ( vals ) , cmap = cm . coolwarm ,
linewidth =0)
fig . colorbar ( surf , shrink =0.5 , aspect =5)
pdf = PdfPages ( ’ Coursework /3 dloglikelihood . pdf ’)
plt . xlabel ( ’A ’)
plt . ylabel ( ’l ’)
pdf . savefig ( fig )
pdf . close ()

# Plot l given maximum at A

14
fig = plt . figure ( figsize =[8 ,5])

plt . plot ( l_vals , vals [ idx [ ’A ’ ]])


plt . plot ( l_vals [ idx [ ’l ’]] , vals [ idx [ ’A ’ ]][ idx [ ’l ’]] , ’o ’ , color = ’ black ’)

pdf = PdfPages ( ’ Coursework / lLogLikelihooh . pdf ’)


plt . xlabel ( ’l ’)
plt . ylabel ( ’ Log Likelihood ’)
pdf . savefig ( fig )
pdf . close ()

# Plot A given maximum at l


vals_t = np . transpose ( vals )

fig = plt . figure ( figsize =[8 ,5])


plt . plot ( A_vals , vals_t [ idx [ ’l ’ ]])
plt . plot ( A_vals [ idx [ ’A ’]] , vals_t [ idx [ ’l ’ ]][ idx [ ’A ’]] ,
’o ’ , color = ’ black ’)
pdf = PdfPages ( ’ Coursework / ALogLikelihooh . pdf ’)
plt . xlabel ( ’A ’)
plt . ylabel ( ’ Log Likelihood ’)
pdf . savefig ( fig )
pdf . close ()

References
[1] Carl E. Rasmussen and Christophet K. Williams. Gaussian Processes for
Machine Learning. First edition, 2006. MIT Press.

[2] Climate Nasa. URL: https://climate.nasa.gov/causes/ (accessed:


31/10/2018)

[3] Kernel density estimation. URL: https://en.wikipedia.org/wiki/Kernel density estimation


(accessed: 31/10/2018)

[4] Mark Girolami, Simon Rogers. A First Course in Machine Learning. 2nd
Edition, 2017. CRC Press.

15
[5] Dr. Mark Muldoon. Statistics and Machine Learning 1, lecture 5 slides
(dated 29/10/2018)

[6] Radial Basis Function. URL: https://en.wikipedia.org/wiki/Radial basis function


(accessed 30/10/2018)

16

You might also like