Professional Documents
Culture Documents
Akhil Ilango
January 2019
## 1. Practice of Computation Here, we have a population data with discrete uniform distribution. The
domain of this distribution is s = (0, 1, 2, ...K). All the values are equally likely to be observed with
probability K+1
1
since there are K + 1 terms. Let’s take K = 100. We would like to compute the mean and
variance of the given data.
K <- 100
x1 <- seq(0,K)
cat('Mean by R function:', mean(x1), '\n')
## Mean by R function: 50
A1 <- sum(x1**2)
A2 <- K*(K+1)*(2*K+1)/6 # the formula is for natural numbers
A1==A2
1
## [1] TRUE
• using R function
We use the function mean. µ̂ = mean(x).
• using distribution formula
We know that for a discrete uniform distribution x ∼ U[a, b], we have µ = a+b 2 . If these values are
given to us, we can directly use them. Since, a and b are the end points, one way to recover them from
the data is by finding the minimum and maximum values. a = min(x), b = max(x). Note that here,
it is the true value and not an estimate. This is because we are using information about the shape of
the true distribution.
• using the sample estimate
An unbiased estimate for any given sample is given by
n
1X
µ̂ = xi
n i=1
Overall, we see that all three methods give us the same answer.
Second, we compare the variance obtained via three ways:
• using R function
We use the function var. σ̂ 2 = var(x).
• using Formula 1
Here, we use the first formula given in the question
K
1 X 2
σ̂ 2 = (xi − µ̂)
K i=1
(K + 1)(K + 2)
σ̂ 2 =
12
PN
. This is obtained by using the general formula for sum of squares of a sequence i=1 i2 =
N (N +1)(2N +1)
6 .
Overall, we see that all three methods give us the same answer.
2. Simulation
Since this exercise requires sampling many times on multiple occassions, let us define a function. This can
help us keep our code compact.
2
# defining a function for sampling
ex2 <- function(data,iterations,n_sample,how){
set.seed(1)
mu <- rep(0,iterations)
# method 1
# for (i in 1:iterations){
# x <- sample(data, n_sample, replace=how)
# mu[i] <- mean(x)
# }
# method 2
x <- as.matrix(replicate(iterations,{sample(data,n_sample, replace=how)}))
mu <- colMeans(x)
In this fucntion, we use set.seed(1) to define a reference point for random number generation. This gives
us same random numbers every time we try this exercise. This is useful for falsifiability of results. In this
function, we take as arguments, the data, number of iterations, size of each sample and a logical variable to
notify if sampling with or without replacement. Then, the function samples from the data in each iteration
and computes the sample mean for each sample. This value is returned.
(a)
In this exercise, we are concerned with the variance of sample mean and the variance of population. From
a given population data, we draw many samples to study this.
3
cat('(Variance of population)/(Sample size):', var(data)/n_sample)
The variables data and nsample are defined as given in question. We take the number of iterations to be
100. To obtain the estimate of sample mean, we call the function ex2 that we defined earlier. We know that
for a sample mean
N
1 X
E[µ̂] = E[ xi ]
N i=1
N
1 X
= E(xi )
N i=1
N
1 X
= [ µ]
N i=1
= µ
N
1 X
Var[µ̂] = Var[ xi ]
N i=1
N
1 X
= Var(xi )
N i=1
N
1 X 2
= [ σ ]
N 2 i=1
σ2
=
N
Therefore, it is not the same as the population mean. This is also the result we get in the computation.
(b)
set.seed(1)
n_sample <- 500
x <- sample(data,n_sample,replace = TRUE)
summary(x)
4
Histogram of x (n=500)
50
40
Frequency
30
20
10
0
30 40 50 60 70 80 90 100
5
Histogram of x (n=5000)
400
300
Frequency
200
100
0
30 40 50 60 70 80 90 100
##
## For 100 iterations:
6
##
## For 10^4 iterations:
As argued in the previous exercise, we do not obtain the variance of sample mean as equal to the variance
of the data. Also, we observe that we have an asymptotically consistent estimator.
N
1 X p σ
2
Var[µ̂] = Var[ xi ] −
→
N i=1 N
We do not obtain a normal distribution as we are sampling with replacement with equal probability of
drawing any of the values. This means that we will get an approximately discrete uniform distribution. This
can be seen more clearly as the sample size increases.
(c)
n_sample <- 10
iterations <- 100
mu_rep <- ex2(data, iterations, n_sample, TRUE)
hist(mu_rep, main='Histogram of mu_rep (100 iterations)')
7
Histogram of mu_rep (100 iterations)
10 15 20 25 30 35
Frequency
5
0
40 50 60 70 80
mu_rep
qqnorm(scale(mu_rep))
abline(0,1)
8
Normal Q−Q Plot
2
1
Sample Quantiles
0
−1
−2
−3
−2 −1 0 1 2
Theoretical Quantiles
abline(v = mean(mu_rep),
col = 3,
lwd = 2)
abline(v = median(mu_rep),
col = 4,
lwd = 2)
abline(v = c(mean(mu_rep)+2*sqrt(var(mu_rep)),mean(mu_rep)-2*sqrt(var(mu_rep))),
col = 5,
lty = 2,
lwd = 2)
9
col = c(2, 3, 4, 5),
lwd = c(2, 2, 2,2))
Density plot
Mean
Median
2 sigma
0.04
Density
0.02
0.00
40 50 60 70 80 90
mu_rep
qqnorm(scale(mu_rep))
abline(0,1)
10
Normal Q−Q Plot
2
Sample Quantiles
0
−2
−4
−4 −2 0 2 4
Theoretical Quantiles
Since our sampling is independent in each iteration and we are drawing from the same distribution (uniformly
from same data values), our procedure satisfied the IID assumption. Therefore, applying the Lindberg-Levy
Central Limit Theorem, we have the following result
µ̂ ∼ N (µ, σ 2 )
We see this result more clearly as we increase the number of iterations. This shows that the estimator is
asymptotically normal.
3.
This is an exercise to practice using R to do some calculations.
C <- 25000
r <- 0.05/12
n <- 360 #600
A <- (r*C*(1+r)^n)/((1+r)^n-1)
cat('Monthly payment (30 years): $',A,'\n')
C <- 25000
r <- 0.05/12
n <- 600
A <- (r*C*(1+r)^n)/((1+r)^n-1)
cat('Monthly payment (50 years): $',A,'\n')
11
## Monthly payment (50 years): $ 113.5347
Also, this exercise introduces the use of functions. Here, we have defined the function named payment.
This takes values of n as input and gives the monthly payment as output. Writing functions are useful when
the formula is complicated and the same calculation has to be performed many times.
n <-15*12
r <- 6/1200
C <-200000
A <- r*((1+r)**n)*C/(((1+r)**n) -1)
cat('Monthly payment (15 years): $',A,'\n')
n <- (20:50)*12
payment <- function(n){ r*((1+r)**n)*C/(((1+r)**n) -1)}
plot(20:50, payment(n), type="l",xlab = 'years', ylab = 'monthly payment (in $)')
1400
monthly payment (in $)
1300
1200
1100
20 25 30 35 40 45 50
years
4. Plotting
12
mu_x <- 1.7
sig_x <- 0.3
mu_e <- 0
sig_e <- 0.8
N <- 6000
x <- rnorm(n = N, mean = mu_x, sd = sig_x)
e <- rnorm(n = N, mean = mu_e, sd = sig_e)
y <- 12+3*x+e
First, we visualize the scatterplot for the generated data. Then, we plot two families of datapoints to explore
the usage of plots in R. The first half of x has been considered as one family and the other half as the second
family. We use different symbols, colors and a legend to highlight the difference.
16
14
# plotting families
plot(x = c(x[1:3000], x[3001:6000]), y = c(y[1:3000],y[3001:6000]), type = 'p',
pch=c(1,2), cex= c(0.5,0.5), col = c('black','red'), xlab = 'x',ylab = 'y',
main = 'Scatter Plot')
13
Scatter Plot
Legend
FamilyBlack
20
FamilyRed
18
y
16
14
Here, we have made a scatterplot using the plot function. type gives the type of plot (e.g, ‘p’ gives scatter
plot, ‘l’ gives lines etc.). The argument pch gives the plotting symbol. Using different values for this
argument gives different symbols (e.g, pch=1 gives circles, pch=2 gives triangles etc.). This can be useful to
differentiate families among the plotted points.
Similarly, the argument cex refers to the symbol size. col gives the color of the symbols. The value can be
given as numbers (which have a pre-defined meaning in R, e.g, col=1 gives black, col=2 gives red etc.) or
can be expressed between quotes.
xlab, ylab and main give the label of x-axis, label of y-axis and the plot title respectively.
Legend is defined in the second plot to help understand the plot. It is a good exercise to add the legend.
4.2
We use the lm function from R to fit a Linear Regression model on the data. We get an object as an
output which contains a lot of information. An object is like a collection of information. We extract the
required information (e.g, coefficent estimates, coefficient estimate standard deviation etc.) using a $ symbol
as shown below. The information contained in such object can also be studied by looking at the variables
windown in R.
summary(lm(y~x))
##
## Call:
## lm(formula = y ~ x)
##
14
## Residuals:
## Min 1Q Median 3Q Max
## -2.88087 -0.53367 0.00344 0.54541 2.72196
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.07827 0.05895 204.88 <2e-16 ***
## x 2.95517 0.03425 86.29 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7961 on 5998 degrees of freedom
## Multiple R-squared: 0.5538, Adjusted R-squared: 0.5538
## F-statistic: 7445 on 1 and 5998 DF, p-value: < 2.2e-16
Scatter Plot
20
18
y
16
14
We can edit the line type (e.g, lty=1 gives solid line, lty=2 gives dashed line etc.) using lty and the line
width using lwd. We use the line color as black and points in grey for good visualization.
15
4.3
Let us define a function that takes a sample from the given population and plots the regression line over
the scatter plot. Here, the arguments are the population data and sample size. The plot also includes the
95% confidence interval. We combine the concepts of sub-plots and for loops. We use the par function to
determine the style of our subplot. And importantly, mfrow(m,n) tells that we will make a matrix of plots
with m rows and n columns.
set.seed(1)
iterations <- 9
i <- 1
n <- 20
par(pty = "m", mfrow = c(3, 3), cex.lab = 0.2 , cex.main = 1, mar = c(2, 2, 1, 0),
oma = c(1, 1, 1.5, 1), mgp = c(2, 0.5, 0));
# mar, oma, mgp to define margins.
# Please See https://www.rdocumentation.org/packages/graphics/versions/3.5.1/topics/par
16
For n = 20
19
17 Sample 1 Sample 2 Sample 3
y
y
15
Sample 4
x Sample 5
x Sample 6
x
19
y axis
17
y
y
15
Sample 7
x Sample 8
x Sample 9
x
19
y
y
17
15
1.2 1.5 1.8 2.1 1.5 1.8 2.1 1.2 1.5 1.8 2.1 2.4
x x axisx x
4.4
We see that in small samples the variance of the OLS estimate increases as we move to small samples. This
can be seen by plotting the 95% confidence bands around the estimates. Below, we plot for the case when
we sample 500 datapoints instead of 20. This concept is evident by looking at both the graphs. Note that
the axis ranges are the same in both the graphs. Therefore, the graphs are visually comparable.
n <- 500
i <- 1
par(pty = "m", mfrow = c(3, 3), cex.lab = 0.2 , cex.main = 1, mar = c(2, 2, 1, 0),
oma = c(1, 1, 1.5, 1), mgp = c(2, 0.5, 0));
while (i <= iterations){
ex3(x,y,n)
if (i %in% c(7, 8, 9)){ axis(1, col = "grey40", col.axis = "grey20",
at = seq(0.9, 2.4, 0.3)) }
if (i %in% c(1, 4,7)){ axis(2, col = "grey40", col.axis = "grey20",
at = seq(15, 19, 1)) }
box(col = "grey60")
i <- i+1
}
mtext("x axis", side = 1, outer = TRUE, cex = 0.7, col = "grey20")
mtext("y axis", side = 2, outer = TRUE, cex = 0.7, col = "grey20")
mtext("For n = 500", side = 3, outer = TRUE, cex = 1.2, col = "grey20")
17
For n = 500
19
17 Sample 1 Sample 2 Sample 3
y
y
15
Sample 4
x Sample 5
x Sample 6
x
19
17
y axis
y
y
15
Sample 7
x Sample 8
x Sample 9
x
19
17
y
y
15
0.9 1.2 1.5 1.8 2.1 2.4 0.9 1.2 1.5 1.8 2.1 2.4 0.9 1.5 2.1
x x axis x x
Also, we see that the estimates are biased in small sample though theroretically, this should not be the
case if the same assumptions hold. One explanation could the violation of exogenity or homoskedasticity
assumption on in the small sample.
Let us compare the distribution of dependent variable. Let us define another function for this.
18
set.seed(1)
i <- 1
n <- 20
par(pty = "m", mfrow = c(3, 3), cex.lab = 0.2 , cex.main = 1, mar = c(1, 1, 1, 0),
oma = c(1, 1, 1.5, 1), mgp = c(2, 0.5, 0));
while (i <= iterations){
ex34(x,y,n)
box(col = "grey60")
i <- i+1
}
mtext("y", side = 1, outer = TRUE, cex = 0.7, col = "grey20")
mtext("Probability Density", side = 2, outer = TRUE, cex = 0.7, col = "grey20")
mtext("For n = 20", side = 3, outer = TRUE, cex = 1.2, col = "grey20")
For n = 20
Sample 1 Sample 2 Sample 3
Prob
Prob
Prob
Prob
set.seed(1)
i <- 1
n <- 20
par(pty = "m", mfrow = c(3, 3), cex.lab = 0.2 , cex.main = 1, mar = c(1, 1, 1, 0),
oma = c(1, 1, 1.5, 1), mgp = c(2, 0.5, 0));
while (i <= iterations){
ex34q(x,y,n)
box(col = "grey60")
i <- i+1
}
mtext("Theoretical quantiles", side = 1, outer = TRUE, cex = 0.7, col = "grey20")
19
mtext("Sample quantiles", side = 2, outer = TRUE, cex = 0.7, col = "grey20")
mtext("For n = 20", side = 3, outer = TRUE, cex = 1.2, col = "grey20")
For n = 20
Sample 1 Sample 2 Sample 3
Prob
Prob
Sample 4 Sample 5 Sample 6
y y y
Sample quantiles
Prob
Prob
Sample 7 Sample 8 Sample 9
y y y
Prob
Prob
Theoretical quantiles
set.seed(1)
i <- 1
n <- 500
par(pty = "m", mfrow = c(3, 3), cex.lab = 0.2 , cex.main = 1, mar = c(1, 1, 1, 0),
oma = c(1, 1, 1.5, 1), mgp = c(2, 0.5, 0));
while (i <= iterations){
ex34(x,y,n)
box(col = "grey60")
i <- i+1
}
mtext("y", side = 1, outer = TRUE, cex = 0.7, col = "grey20")
mtext("Probability Density", side = 2, outer = TRUE, cex = 0.7, col = "grey20")
mtext("For n = 500", side = 3, outer = TRUE, cex = 1.2, col = "grey20")
20
For n = 500
Sample 1 Sample 2 Sample 3
Prob
Prob
Sample 4 Sample 5 Sample 6
y y y
Probability Density
Prob
Prob
Sample 7 Sample 8 Sample 9
y y y
Prob
Prob
set.seed(1)
i <- 1
n <- 500
par(pty = "m", mfrow = c(3, 3), cex.lab = 0.2 , cex.main = 1, mar = c(1, 1, 1, 0),
oma = c(1, 1, 1.5, 1), mgp = c(2, 0.5, 0));
while (i <= iterations){
ex34q(x,y,n)
box(col = "grey60")
i <- i+1
}
mtext("Theoretical quantiles", side = 1, outer = TRUE, cex = 0.7, col = "grey20")
mtext("Sample quantiles", side = 2, outer = TRUE, cex = 0.7, col = "grey20")
mtext("For n = 500", side = 3, outer = TRUE, cex = 1.2, col = "grey20")
21
For n = 500
Sample 1 Sample 2 Sample 3
Prob
Prob
Sample 4 Sample 5 Sample 6
y y y
Sample quantiles
Prob
Prob
Sample 7 Sample 8 Sample 9
y y y
Prob
Prob
Theoretical quantiles
22