Statistics Using R Tutorial

R tutorial
2. Measures of Central Tendency

1. Suppose the yearly number of whales beached in Chennai during the period 1990 to 1999
is 74, 122, 235, 111, 292, 111, 211, 133, 156, 79. What is the mean, the variance, the
standard deviation?
Solution:
> whale = c(74, 122, 235, 111, 292, 111, 211, 133, 156, 79)
# Mean
> mean (whale)
[1] 152.4
# Median
> median(whale)
[1] 127.5
# Mode
> y<-table(whale)
> y
whale
74 79 111 122 133 156 211 235 292
1 1 2 1 1 1 1 1 1
names(table(whale))[table(whale)==max(table(whale))]
[1] "111"
# Variance
> var(whale)
[1] 5113.378
# Standard Deviation
> sqrt(var(whale)) #Standard Deviation

[1] 71.50789
> sqrt( sum( (whale - mean(whale))^2 /(length(whale)-1)))

[1] 71.50789
> std = function(x) sqrt(var(x))
> std(whale)
[1] 71.50789
> sd(whale)
[1] 71.50789
# Quartile
> quantile(whale)
0% 25% 50% 75% 100%
74.00 111.00 127.50 197.25 292.00
> summary(whale)
Min. 1st Qu. Median Mean 3rd Qu. Max.
74.0 111.0 127.5 152.4 197.2 292.0
3. Binomial ,Normal and Poisson Distributions
Binomial Distribution
The binomial distribution is a discrete probability distribution. It describes the outcome
of n independent trials in an experiment. Each trial is assumed to have only two outcomes,
either success or failure. If the probability of a successful trial is p, then the probability of
having x successful outcomes in an experiment of n independent trials is as follows.
Problem
2. Suppose there are twelve multiple choice questions in an English class quiz. Each
question has five possible answers, and only one of them is correct. Find the probability
of having four or less correct answers if a student attempts to answer every question at
random.
Solution
Since only one out of five possible answers is correct, the probability of answering a question
correctly by random is 1/5=0.2. We can find the probability of having exactly 4 correct
answers by random attempts as follows.
> dbinom(4, size=12, prob=0.2)
[1] 0.1328756
To find the probability of having four or less correct answers by random attempts, we apply the
function dbinom with x = 0,…,4.
> dbinom(0, size=12, prob=0.2) +
dbinom(1, size=12, prob=0.2) +
dbinom(4, size=12, prob=0.2)
[1] 0.9274
Alternatively, we can use the cumulative probability function for binomial distribution pbinom.
> pbinom(4, size=12, prob=0.2)
[1] 0.92744
Answer
The probability of four or less questions answered correctly by random in a twelve question
multiple choice quiz is 92.7%
The Binomial Distribution
The Binomial Distribution is applicable for counting the number of outcomes of a given type
from a prespecified number n independent trials, each with two possible outcomes, and the same
probability of the outcome of interest, p.
In R, the function dbinom returns this probability. There are three required arguments: the
value(s) for which to compute the probability (j), the number of trials (n), and the success
probability for each trial (p). For example, here we find the complete distribution when n = 5 and
p = 0.1
> x <- seq(-20,20,by=.5)
> y <- dt(x,df=10)
> plot(x,y)
> y <- dt(x,df=50)
> plot(x,y)
> x <- seq(0,50,by=1)
> y <- dbinom(x,50,0.2)
> plot(x,y)
> y <- dbinom(x,50,0.6)

> plot(x,y)
> x <- seq(0,100,by=1)
> y <- dbinom(x,100,0.6)

> plot(x,y)
Cumulative Probability Distributions
> pbinom(24,50,0.5)
[1] 0.4438624
> pbinom(25,50,0.5)
[1] 0.5561376
> pbinom(25,51,0.5)
[1] 0.5
> pbinom(26,51,0.5)
[1] 0.610116
> pbinom(25,50,0.5)
[1] 0.5561376
> pbinom(25,50,0.25)
[1] 0.999962
> pbinom(25,500,0.25)
[1] 4.955658e-33
Next we have the inverse cumulative probability distribution function:
> qbinom(0.5,51,1/2)
[1] 25
> qbinom(0.25,51,1/2)
[1] 23
> pbinom(23,51,1/2)
[1] 0.2879247
> pbinom(22,51,1/2)
[1] 0.200531
Finally random numbers can be generated according to the binomial distribution:
> rbinom(5,100,.2)
[1] 22 16 17 22 25
> rbinom(5,100,.7)
[1] 66 65 67 67 65
Normal Distribution
There are four functions that can be used to generate the values associated with the normal
distribution. You can get a full list of them and their options using the help command:
> help(Normal)
The first function we look at it is dnorm. Given a set of values it returns the height of the
probability distribution at each point. If you only give the points it assumes you want to use a
mean of zero and standard deviation of one. There are options to use different values for the
mean and standard deviation, though:
> dnorm(0)
[1] 0.3989423
> dnorm(0)*sqrt(2*pi)
[1] 1
> dnorm(0,mean=4)
[1] 0.0001338302
> dnorm(0,mean=4,sd=10)
[1] 0.03682701
> v <- c(0,1,2)

> dnorm(v)
[1] 0.39894228 0.24197072 0.05399097
> x <- seq(-20,20,by=.1)
> y <- dnorm(x)
> plot(x,y)
> y <- dnorm(x,mean=2.5,sd=0.1)

> plot(x,y)
> pnorm(0)
[1] 0.5
> pnorm(1)
[1] 0.8413447
> pnorm(0,mean=2)
[1] 0.02275013
> pnorm(0,mean=2,sd=3)
[1] 0.2524925
> v <- c(0,1,2)
> pnorm(v)
[1] 0.5000000 0.8413447 0.9772499
> x <- seq(-20,20,by=.1)
> y <- pnorm(x)
> plot(x,y)
y <- pnorm(x,mean=3,sd=4)
> plot(x,y)
If you wish to find the probability that a number is larger than the given number you can use
the lower.tail option:
> pnorm(0,lower.tail=FALSE)
[1] 0.5
> pnorm(1,lower.tail=FALSE)
[1] 0.1586553
> pnorm(0,mean=2,lower.tail=FALSE)
[1] 0.9772499
The next function we look at is qnorm which is the inverse of pnorm. The idea behind qnorm is
that you give it a probability, and it returns the number whose cumulative distribution matches
the probability. For example, if you have a normally distributed random variable with mean zero
and standard deviation one, then if you give the function a probability it returns the associated Z-
score
> qnorm(0.5)
[1] 0
> qnorm(0.5,mean=1)
[1] 1
> qnorm(0.5,mean=1,sd=2)
[1] 1
[1] 2
[1] 2
[1] 0.6510205
> qnorm(0.333)
[1] -0.4316442
> qnorm(0.333,sd=3)
[1] -1.294933
[1] 6.34898
> v = c(0.1,0.3,0.75)
> qnorm(v)
[1] -1.2815516 -0.5244005 0.6744898
> x <- seq(0,1,by=.05)
> y <- qnorm(x)
> plot(x,y)
> y <- qnorm(x,mean=3,sd=2)

> plot(x,y)
> y <- qnorm(x,mean=3,sd=0.1)
> plot(x,y)
3. Assume that the test scores of a college entrance exam fits a normal distribution.
Furthermore, the mean test score is 72, and the standard deviation is 15.2. What is the
percentage of students scoring 84 or more in the exam?
Solution
We apply the function pnorm of the normal distribution with mean 72 and standard deviation
15.2. Since we are looking for the percentage of students scoring higher than 84, we are
interested in the upper tail of the normal distribution.
> pnorm(84, mean=72, sd=15.2, lower.tail=FALSE)

[1] 0.2149176
Answer
The percentage of students scoring 84 or more in the college entrance exam is 21.5%.
POISSON DISTRIBUTION
The Poisson distribution is the probability distribution of independent event occurrences in
an interval. If λ is the mean occurrence per interval, then the probability of
having x occurrences within a given interval is:
Problem
If there are twelve cars crossing a bridge per minute on average, find the probability of having
seventeen or more cars crossing the bridge in a particular minute.
Solution
The probability of having sixteen or less cars crossing the bridge in a particular minute is
given by the function ppois.
> ppois(16, lambda=12) # lower tail
[1] 0.898709
Hence the probability of having seventeen or more cars crossing the bridge in a minute is in
the upper tail of the probability density function.
> ppois(16, lambda=12, lower=FALSE) # upper tail
[1] 0.101291
Answer
If there are twelve cars crossing a bridge per minute on average, the probability of having
seventeen or more cars crossing the bridge in a particular minute is 10.1%.
Student t Distribution
Assume that a random variable Z has the standard normal distribution, and another random
variable V has the Chi-Squared distribution with m degrees of freedom. Assume further
that Z and V are independent, then the following quantity follows a Student t
distribution with m degrees of freedom.
Here is a graph of the Student t distribution with 5 degrees of freedom.
Problem
Find the 2.5th and 97.5th percentiles of the Student t distribution with 5 degrees of freedom.
Solution
We apply the quantile function qt of the Student t distribution against the decimal values 0.025
and 0.975.
> qt(c(.025, .975), df=5) # 5 degrees of freedom

[1] -2.570582 2.570582
Answer
The 2.5th and 97.5th percentiles of the Student t distribution with 5 degrees of freedom are -
2.5706 and 2.5706 respectively
F Distribution
If V 1 and V 2 are two independent random variables having the Chi-Squared

distribution with m1 and m2 degrees of freedom respectively, then the following quantity
follows an F distribution with m1 numerator degrees of freedom and m2 denominator degrees
of freedom, i.e., (m1,m2) degrees of freedom.
Here is a graph of the F distribution with (5, 2) degrees of freedom.
Problem
Find the 95th percentile of the F distribution with (5, 2) degrees of freedom.
Solution
We apply the quantile function qf of the F distribution against the decimal value 0.95.
> qf(.95, df1=5, df2=2)
[1] 19.29641
Answer
The 95th percentile of the F distribution with (5, 2) degrees of freedom is 19.296.
Lower Tail Test of Population Mean with Known Variance
The null hypothesis of the lower tail test of the population mean can be expressed as follows:
where μ is a hypothesized lower bound of the true population mean μ.

0
Let us define the test statistic z in terms of the sample mean, the sample size and the population
standard deviation σ :
Then the null hypothesis of the lower tail test is to be rejected if z ≤−z , where z is
α α
the 100(1 − α) percentile of the standard normal distribution.

Problem
Suppose the manufacturer claims that the mean lifetime of a light bulb is more than 10,000 hours.
In a sample of 30 light bulbs, it was found that they only last 9,900 hours on average. Assume the
population standard deviation is 120 hours. At .05 significance level, can we reject the claim by the
manufacturer?
Solution
The null hypothesis is that μ ≥ 10000. We begin with computing the test statistic.
> xbar = 9900 # sample mean
> mu0 = 10000 # hypothesized value
> sigma = 120 # population standard deviation
> n = 30 # sample size
> z = (xbar−mu0)/(sigma/sqrt(n))
> z # test statistic
[1] −4.5644
We then compute the critical value at .05 significance level.

> alpha = .05
> z.alpha = qnorm(1−alpha)
> −z.alpha # critical value
[1] −1.6449
Answer
The test statistic -4.5644 is less than the critical value of -1.6449. Hence, at .05 significance level,
we reject the claim that mean lifetime of a light bulb is above 10,000 hours.
Alternative Solution
Instead of using the critical value, we apply the pnorm function to compute the lower tail p-value of
the test statistic. As it turns out to be less than the .05 significance level, we reject the null
hypothesis that μ ≥ 10000.
> pval = pnorm(z)
> pval # lower tail p−value
[1] 2.5052e−06
Upper Tail Test of Population Mean with Known Variance
The null hypothesis of the upper tail test of the population mean can be expressed as
follows:
where μ0 is a hypothesized upper bound of the true population mean μ.

Let us define the test statistic z in terms of the sample mean, the sample size and
the population standard deviation σ :
Then the null hypothesis of the upper tail test is to be rejected if z ≥ zα , where zα is
the 100(1 − α) percentile of the standard normal distribution.
Problem
Suppose the food label on a cookie bag states that there is at most 2 grams of saturated fat in a
single cookie. In a sample of 35 cookies, it is found that the mean amount of saturated fat per
cookie is 2.1 grams. Assume that the population standard deviation is 0.25 grams. At .05
significance level, can we reject the claim on food label?
Solution
The null hypothesis is that μ ≤ 2. We begin with computing the test statistic.
> xbar = 2.1 # sample mean
> sigma = 0.25 # population standard deviation
>z # test statistic
[1] 2.3664

> alpha = .05
> z.alpha = qnorm(1−alpha)
> z.alpha # critical value
[1] 1.6449
Answer
The test statistic 2.3664 is greater than the critical value of 1.6449. Hence, at .05 significance
level, we reject the claim that there is at most 2 grams of saturated fat in a cookie.
Two-Tailed Test of Population Mean with Unknown Variance
The null hypothesis of the two-tailed test of the population mean can be expressed as follows:
where μ is a hypothesized value of the true population mean μ.

0
Let us define the test statistic t in terms of the sample mean, the sample size and the sample
standard deviation s :
Then the null hypothesis of the two-tailed test is to be rejected if t ≤−t or t ≥ t , where t is
α∕2 α∕2 α∕2
the 100(1 − α) percentile of the Student t distribution with n − 1 degrees of freedom.

Problem
Suppose the mean weight of King Penguins found in an Antarctic colony last year was 15.4 kg. In a
sample of 35 penguins same time this year in the same colony, the mean penguin weight is 14.6
kg. Assume the sample standard deviation is 2.5 kg. At .05 significance level, can we reject the null
hypothesis that the mean penguin weight does not differ from last year?
Solution
The null hypothesis is that μ = 15.4. We begin with computing the test statistic.
> mu0 = 15.4 # hypothesized value
> s = 2.5 # sample standard deviation
> t = (xbar−mu0)/(s/sqrt(n))
> t # test statistic
[1] −1.8931
We then compute the critical values at .05 significance level.

> alpha = .05
> t.half.alpha = qt(1−alpha/2, df=n−1)
> c(−t.half.alpha, t.half.alpha)
[1] −2.0322 2.0322
Answer
The test statistic -1.8931 lies between the critical values -2.0322, and 2.0322. Hence, at .05
significance level, we do not reject the null hypothesis that the mean penguin weight does not differ
from last year.
Instead of using the critical value, we apply the pt function to compute the two-tailed p-value of the
test statistic. It doubles the lower tail p-value as the sample mean is less than the hypothesized
value. Since it turns out to be greater than the .05 significance level, we do not reject the null
hypothesis that μ = 15.4.
> pval = 2 ∗ pt(t, df=n−1) # lower tail
> pval # two−tailed p−value
[1] 0.066876
Two-Tailed Test of Population Mean with Known Variance
The null hypothesis of the two-tailed test of the population mean can be expressed as follows:
where μ is a hypothesized value of the true population mean μ.

0
Let us define the test statistic z in terms of the sample mean, the sample size and the population
standard deviation σ :
Then the null hypothesis of the two-tailed test is to be rejected if z ≤−z or z ≥ z , where z is
α∕2 α∕2 α∕2
the 100(1 − α∕2) percentile of the standard normal distribution.
Problem
Suppose the mean weight of King Penguins found in an Antarctic colony last year was 15.4 kg. In a
sample of 35 penguins same time this year in the same colony, the mean penguin weight is 14.6
kg. Assume the population standard deviation is 2.5 kg. At .05 significance level, can we reject the
null hypothesis that the mean penguin weight does not differ from last year?
Solution
The null hypothesis is that μ = 15.4. We begin with computing the test statistic.
> mu0 = 15.4 # hypothesized value
> sigma = 2.5 # population standard deviation
> z # test statistic
[1] −1.8931
We then compute the critical values at .05 significance level.

> alpha = .05
> z.half.alpha = qnorm(1−alpha/2)
> c(−z.half.alpha, z.half.alpha)
[1] −1.9600 1.9600
Answer
The test statistic -1.8931 lies between the critical values -1.9600 and 1.9600. Hence, at .05
significance level, we do not reject the null hypothesis that the mean penguin weight does not differ
from last year.
Instead of using the critical value, we apply the pnorm function to compute the two-tailed p-value of
the test statistic. It doubles the lower tail p-value as the sample mean is less than the hypothesized
value. Since it turns out to be greater than the .05 significance level, we do not reject the null
hypothesis that μ = 15.4.
> pval = 2 ∗ pnorm(z) # lower tail

> pval # two−tailed p−value
[1] 0.058339
Lower Tail Test of Population Mean with Unknown Variance
The null hypothesis of the lower tail test of the population mean can be expressed as follows:
where μ is a hypothesized lower bound of the true population mean μ.

0
Then the null hypothesis of the lower tail test is to be rejected if t ≤−t , α where t is
α

Problem
Suppose the manufacturer claims that the mean lifetime of a light bulb is more than 10,000 hours.
In a sample of 30 light bulbs, it was found that they only last 9,900 hours on average. Assume the
sample standard deviation is 125 hours. At .05 significance level, can we reject the claim by the
manufacturer?
Solution
The null hypothesis is that μ ≥ 10000. We begin with computing the test statistic.
> xbar = 9900 # sample mean
> s = 125 # sample standard deviation
[1] −4.3818

> alpha = .05
> t.alpha = qt(1−alpha, df=n−1)
> −t.alpha # critical value
[1] −1.6991
Answer
The test statistic -4.3818 is less than the critical value of -1.6991. Hence, at .05 significance level,
we can reject the claim that mean lifetime of a light bulb is above 10,000 hours.
Instead of using the critical value, we apply the pt function to compute the lower tail p-value of the
test statistic. As it turns out to be less than the .05 significance level, we reject the null hypothesis
that μ ≥ 10000.
> pval = pt(t, df=n−1)
> pval # lower tail p−value
[1] 7.035e−05
Upper Tail Test of Population Mean with Unknown Variance
The null hypothesis of the upper tail test of the population mean can be expressed as follows:
where μ is a hypothesized upper bound of the true population mean μ.

0
Then the null hypothesis of the upper tail test is to be rejected if t ≥ t , α where t is
α

Problem
Suppose the food label on a cookie bag states that there is at most 2 grams of saturated fat in a
single cookie. In a sample of 35 cookies, it is found that the mean amount of saturated fat per
cookie is 2.1 grams. Assume that the sample standard deviation is 0.3 gram. At .05 significance
level, can we reject the claim on food label?
Solution
The null hypothesis is that μ ≤ 2. We begin with computing the test statistic.
> s = 0.3 # sample standard deviation
[1] 1.9720

> alpha = .05
> t.alpha = qt(1−alpha, df=n−1)
> t.alpha # critical value
[1] 1.6991
Answer
The test statistic 1.9720 is greater than the critical value of 1.6991. Hence, at .05 significance level,
we can reject the claim that there is at most 2 grams of saturated fat in a cookie.
Instead of using the critical value, we apply the pt function to compute the upper tail p-value of the
test statistic. As it turns out to be less than the .05 significance level, we reject the null hypothesis
that μ ≤ 2.
> pval = pt(t, df=n−1, lower.tail=FALSE)
> pval # upper tail p−value
[1] 0.028393
One sample t-test
It was made an intelligence test in 10 subjects, and here are the results obtained. The average result of the
population whici received the same test, is equal to 75. You want to check if the sample mean is significantly
similar (when the significance level is 95%) to the average population, assuming that the variance of the
population is not known.
65, 78, 88, 55, 48, 95, 66, 57, 79, 81
Contrary to the one sample Z-test, the Student’s t-test for a single sample have a pre-set function in R we can
apply immediately. It is the t.test (a, mu), we can see below applied.
> a = c(65, 78, 88, 55, 48, 95, 66, 57, 79, 81)
> t.test (a, mu=75)
One Sample t-test
data: a
t = -0.78303, df = 9, p-value = 0.4537
alternative hypothesis: true mean is not equal to 75
95 percent confidence interval:
60.22187 82.17813
sample estimates:
mean of x
71.2
The function t.test on one sample provides in output the value of t calculated; also gives us degrees of
freedom, the confidence interval and the average (mean of x).
In order to take your statistic decision, you can proceed in two ways. We can compare the value of t with the
value of the tabulated student t with 9 degrees of freedom. If we do not have tables, we can calculate the value
t-tabulated in the following way:
> qt(0.975, 9)
[1] 2.262157
The function qt (p, df) returns the value of t computed considering the significance level (we chose a
significance level equal to 95%, which means that each tail is the 2.5% which corresponds to the value of
p = 1 – 0.025), and the degrees of freedom. By comparing the value of t-tabulated with t-computed, t-
computed appears smaller, which means that we accept the null hypothesis of equality of the averages: our
sample mean is significantly similar to the mean of the population.
Alternatively we could consider the p-value. With a significance level of 95%, remember this rule: If p-value
is greater than 0.05 then we accept the null hypothesis H0; if p-value is less than 0.05 then we reject the
null hypothesis H0 in favor of the alternative hypothesis H1.
Paired-Samples T-Tests
To conduct a paired-samples test, we need either two vectors of data, (y_1) and (y_2), or we need
one vector of data with a second that serves as a binary grouping variable. The test is then run using
the syntax t.test(y1, y2, paired=TRUE).
For instance, let’s say that we work at a large health clinic and we’re testing a new drug, Procardia,
that’s meant to reduce hypertension. We find 1000 individuals with a high systolic blood pressure
((bar{x}=145)mmHg, (SD=9)mmHg), we give them Procardia for a month, and then measure their
blood pressure again. We find that the mean systolic blood pressure has decreased to 138mmHg with
a standard deviation 8mmHg.
We can visualize this difference with a kernel density plot as:
Here, we would conduct a t-test using:
> set.seed(2820)
> preTreat <- c(rnorm(1000, mean = 145, sd = 9))
> postTreat <- c(rnorm(1000, mean = 138, sd = 8))
> t.test(preTreat, postTreat, paired = TRUE)
Paired t-test
data: preTreat and postTreat

t = 19.751, df = 999, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
6.703959 8.183011
sample estimates:
mean of the differences
7.443485
Again, we see that there is a statistically significant difference in means on
t = 19.7514, p-value < 2.2e-16

Regression analysis
 Regression analysis is a very widely used statistical tool to establish a relationship model
between two variables.
 One of these variable is called predictor variable whose value is gathered through
experiments.
 The other variable is called response variable whose value is derived from the predictor
variable.
 In Linear Regression these two variables are related through an equation, where exponent
(power) of both these variables is 1.
 Mathematically a linear relationship represents a straight line when plotted as a graph.
 A non-linear relationship where the exponent of any variable is not equal to 1 creates a
curve.
 The general mathematical equation for a linear regression is:
y = ax+b
 Following is the description of the parameters used:
y is the response variable.
x is the predictor variable.
a and b are constants which are called the coefficients.
Steps to Establish a Regression
 A simple example of regression is predicting weight of a person when his height is

known.
 To do this we need to have the relationship between height and weight of a person.
The steps to create the relationship is:
 Carry out the experiment of gathering a sample of observed values of height and
corresponding weight.
 Create a relationship model using the lm() functions in R.
 Find the coefficients from the model created and create the mathematical equation using
these.
 Get a summary of the relationship model to know the average error in prediction.
Also called residual.
 To predict the weight of new persons, use the predict() function in R.
Create Relationship Model & get the Coefficients in R
> getwd()
[1] "C:/Program Files/R/R-3.3.1/bin"
> x <- c(151, 174, 138, 186, 128, 136, 179, 163, 152, 131)
> y <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48)
Apply the lm() function
> relation <- lm(y~x)

> print(relation)
Call:
lm(formula = y ~ x)
Coefficients:
(Intercept) x
-38.4551 0.6746
Get the Summary of the Relationship
> print(summary(relation))
Call:
lm(formula = y ~ x)
Residuals:
Min 1Q Median 3Q Max
-6.3002 -1.6629 0.0412 1.8944 3.9775
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -38.45509 8.04901 -4.778 0.00139 **
x 0.67461 0.05191 12.997 1.16e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.253 on 8 degrees of freedom
Multiple R-squared: 0.9548,
Adjusted R-squared: 0.9491
F-statistic: 168.9 on 1 and 8 DF,
p-value: 1.164e-06
Predict the weight of new persons
> a <- data.frame(x=170)

> result <- predict(relation,a)
> print(result)
1
76.22869
Visualize the Regression Graphically

> plot(y,x,col="blue",main="Height & Weight Regression",
abline(lm(x~y)),cex = 1.3,pch=16,xlab="Weight in Kg",ylab="Height in cm")

Statistics Using R Tutorial

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Statistics Using R Tutorial

Uploaded by

Copyright:

Available Formats

R tutorial

2. Measures of Central Tendency

> sqrt(var(whale)) #Standard Deviation

> sqrt( sum( (whale - mean(whale))^2 /(length(whale)-1)))

> std = function(x) sqrt(var(x))

> x <- seq(-20,20,by=.5)

> y <- dt(x,df=10)

> x <- seq(0,50,by=1)

> y <- dbinom(x,50,0.2)

> y <- dbinom(x,50,0.6)

> x <- seq(0,100,by=1)

> y <- dbinom(x,100,0.6)

Cumulative Probability Distributions

Next we have the inverse cumulative probability distribution function:

Finally random numbers can be generated according to the binomial distribution:

mean and standard deviation, though:

> v <- c(0,1,2)

> x <- seq(-20,20,by=.1)

> y <- dnorm(x)

> y <- dnorm(x,mean=2.5,sd=0.1)

> y <- qnorm(x,mean=3,sd=2)

> pnorm(84, mean=72, sd=15.2, lower.tail=FALSE)

Here is a graph of the Student t distribution with 5 degrees of freedom.

> qt(c(.025, .975), df=5) # 5 degrees of freedom

If V 1 and V 2 are two independent random variables having the Chi-Squared

Here is a graph of the F distribution with (5, 2) degrees of freedom.

where μ is a hypothesized lower bound of the true population mean μ.

the 100(1 − α) percentile of the standard normal distribution.

We then compute the critical value at .05 significance level.

where μ0 is a hypothesized upper bound of the true population mean μ.

We then compute the critical value at .05 significance level.

where μ is a hypothesized value of the true population mean μ.

the 100(1 − α) percentile of the Student t distribution with n − 1 degrees of freedom.

We then compute the critical values at .05 significance level.

where μ is a hypothesized value of the true population mean μ.

the 100(1 − α∕2) percentile of the standard normal distribution.

We then compute the critical values at .05 significance level.

> pval = 2 ∗ pnorm(z) # lower tail

where μ is a hypothesized lower bound of the true population mean μ.

the 100(1 − α) percentile of the Student t distribution with n − 1 degrees of freedom.

We then compute the critical value at .05 significance level.

where μ is a hypothesized upper bound of the true population mean μ.

the 100(1 − α) percentile of the Student t distribution with n − 1 degrees of freedom.

We then compute the critical value at .05 significance level.

65, 78, 88, 55, 48, 95, 66, 57, 79, 81

> t.test (a, mu=75)

One Sample t-test

We can visualize this difference with a kernel density plot as:

Here, we would conduct a t-test using:

data: preTreat and postTreat

Again, we see that there is a statistically significant difference in means on

t = 19.7514, p-value < 2.2e-16

Steps to Establish a Regression

 A simple example of regression is predicting weight of a person when his height is

The steps to create the relationship is:

Create Relationship Model & get the Coefficients in R

Apply the lm() function

> relation <- lm(y~x)

> a <- data.frame(x=170)

Visualize the Regression Graphically

You might also like