You are on page 1of 16

Lecture 1

Definition:
1. Statistic: the purpose is to summarize, analyze the data to provide decisions and
determine the risks of wrong decisions.
Statistic problems are applied in daily life, chemical engineering and other fields
Example: Blood pressure measuring of patients at the hospital.
2. Mean = average of xn
3. The average data is not real(*), as it is impossible to measure every single object.
However, an estimation can be made after doing research or experiments. However, we
cannot conclude whether it is different without Confidence interval.
4. Confidence interval example:
1.65 is different from 1.7
1.65 0.5 is the same with 1.7
1.65 0.25 is significantly different from 1.7
5. Basis statistical concept with R:
+ Sample:
+ Simulating data: We only simulating data when we dont have real data to work on the
lectures
+ Distribution
+ Standard deviation
6. Distribution: the way data is distributed (similar to spreading cards)
6.1 Draw the distribution of the Weight of class:
There are 2 ways to do:
1. Normal: assume the whole class is homogeneous only 1 curve
2. Distinguish between Male and Female more than 1 curves
6.2 Draw the distribution of 20 waist measured, average 85, deviation 2
Use rnorm function: a = rnorm(20,85,2) = f(x,y,z)
The program gives statistic for this function (with approximate average
number of 84.89(*))
sort a = arrange a from smallest to largest
hist a = give Histogram of a
(2-dimension graph with x = waist measurement, y = frequency)
Quartile = 25%
+ 1st quartile: 25% = 83 (5 people have waist measured < 83)
+ 2nd quartile: 50% = 85.26 = Median Mean
+ 3rd quartile: 75% = 87 (5 people have waist measured > 87)

7. Deviation depends on the quantity of samples. If the sample increases, the confidence
interval decreases and reverse. Example:
If the number of women included in the survey is 30000:
average waist measurement of 1st research: 84.9 cm
average waist measurement 2nd research should be: 84.91 cm
If the number of women included in the survey is 300000:
average waist measurement of 1st research: 84.9 cm
average waist measurement 2nd research should be: 84.9001 cm

Lecture 2
1.
-

There are 3 ways of distribution:


Plot
Box plot
Histogram: If we want to merge two columns, its the surface being the population, not
its height
Example:
S= 42 1 +36
42
39

36

2. The density function of Normal distribution


)

depends on the average and the standard deviation

+ Maximum at
+

A normal distribution: Describe the behaviour of random variabilty

95% of the values should be around the mean


Probability
pnorm(0,mean=0,sd=1) P=0.5
pnorm(0,mean=5,sd=1) P<0.5
Square
Standard deviation is the key to make decision: compare the value with sd from the
mean
Example: Whether a students score 7/10 is excellent compared to the mean in class
(5/10)
We need the sd: If sd=3 this student is normal
If sd= 0.5 brilliant individual

3. The concept of centering: centering the data is to place in the middle by substracting
the average to each value
xi
-

xi

New average = 0
The goal is to know the number regarding the average

4. The concept of standardizing: (gift 1) to compare 2 things with different meanings by


comparing sd
)

xi (no dimension)

Varience=1

5. Notion of estimation
Estimator: the way is used to get the estimation, there are 2 types of estimator:
- Guessing: quick but biased
-

(gift 2)
+ We expect the result

= (expectation)

+ Although we never know the real


+ Unbiased
Estimation is a number cant conclude without

Lecture 3
1. Centering the data:
Centering the data is subtracting the average to each value has been measured.

Hence, we will receive the new average value of 0. Assume that there are 5
values x which are centered to the new value. Then the new average can be
calculated:

Centering supports to known the number regarding the average.


2. Standardizing the data:
POWER 1: Standardizing is used for comparing two things with different
meanings by comparing standard deviation.

The standardized data xi can be both positive or negative and contains no


dimension (as both parameters of the distribution x and standard deviation
have compensated with each other). This property helps us to compare variables
which are not a priori comparable.
For example: A is taller than B is big.
Assume that: + A (female) is 1.8m height
+ B (male) is 70kg weight
Compare these measurements to standard distributions (mean=0) of Vietnamese
female height and male weight, using the tools of centering and standardizing, we can
eliminate two units meter(m) of A and kilogram(kg) of B:
+ A is 2.5*standard deviation height
+ B is 1.5*standard deviation weigh
As 2.5 > 1.5, it can be concluded that A is taller than B is big.
The new data can be represented on the Box-plot with Varience=0, =0.
Knowing that for a normal distribution, 95% of realizations belong to 2 (or
1.96) *standard deviation from average. For cases over standard (>2), we have
the absolute value which individual is quite peculiar regarding variability of the
interest.

Consider the example above, it is not able to give any conclusion whether A is special or
not based on the measurement of 1.8m height unless it is represented over standard
(>2) on standardized data box-plot.
3. Estimation:
The estimator is the way to get estimation. Result/ number given by certain
estimator is the estimation. To choose a good estimator is to choose that has the
best property of unbiased. This means that expectation E(X) of estimator should
be equal to the value we are looking for.
For example: There are 2 estimators for measuring the waist of France male by Guessing
on observation or Doing the experiment on measuring the waists of 1000 males.
The value given by guessing method is a random variable which might be biased.
POWER 2: The experiment on measuring the waists of 1000 males gives:
)

is unbiased because this expectation gives . So it is a good

estimator. However, we still dont know how good the estimator is until we have
x mean which provides variability.
In real life situation, experiment can be only done once. Thus, it should have
considered carefully before making any decision of estimating.

Lecture 4
1. The condition for estimating is that all individuals must follow the same distribution
( The population must be homogenous)
Example: Vietnameses weight females vs Frenchs weight females
If standard deviation is big ppl in that environment are more different but still, they
are belong in the same group
If standard deviation is small they are very similar to each other
Random variables Xi = {X1,Xn}
Realizations {x1,..,xn}
2. POWER 3: The distribution of the mean = the property of the mean.
In real life situation, there is only 1 mean that you can access by doing experiment
Any mean given by any kind of original distribution is approximately a normal distribution
Because all the averages move around the same area normal distribution

~ N (,

: average of average

~ f( )
~

3. Sampling the distribution


If we dont have enough data by conducting many experiments, then simulating data
For example, repeat the data (20,85,2) 10 times Ave(ave)=
The varience of the measures Var( )
Individual is independent variable
Assume 2 independent variables V(X) =
Y(X) =
Variability dont change when you move by any sizes
V(X+Y)= V(X)+V(Y)
V(X-Y)= V(X)+V(Y)
V(aX)= a2 V(X)
V( )=
)=n.

V( )

Find confidence interval


Step 1: Estimation of =

Step 2: Estimation of the standard deviation


)

E(

actually E(

as we will always underestimate the varience because you cant do

experiment on every individual In order to balance the underestimation, we use this formula

S2=

Step 3: Estimation of the standard deviation of the estimator

Step 4: Confidence interval

1.96

The meaning of the confidence interval : 95% of the interval contains the true value

Lecture 5
1. Student distribution vs Normal distribution
Normal
+ divided by , a same number every
time
+

N (0,1) with known

Student
+ S gives certain random variable.
So it is not divided by a constant
value but change a little bit each
time
+

T (n-1) with unknown

and size of sample n


+ n-1 is considered as the degree of
freedom
+ Look like Normal distribution but
not
Compare Student distribution to Normal:
As we have less information of the real value both standard deviation and
might impact the size of Confidence interval (to be larger).
The more information we have, the more accurate we are.
2. Using program R to illustrate the idea of distribution

it

Using the former example of calculate the average height of Vietnamese female based
on the data given by the girls in class.
Following the step of: Mean Standard deviation Standard deviation of estimation
Quantile:

Notice that the code for quantile of student distribution is qt(0.95,n-1) but not (0.9,n-1)
with the wanted level of confidence of 0.9. This is because 0.9 is belonged to the 95%
area. Thus, 0.95=0.9+0.1/2.
So we receive the Confidence interval value of [158.8;162.6].
3. POWER 4: HYPOTHESIS TESTING P VALUE
P value is to check whether your target value is on the Confidence interval
Hypothesis testing gives us the probability to observe that we have observed under
hypothesis assuming that the hypothesis is right (Value of true mean). Thus, P value
helps us to give decision on accepting or rejecting the assuming hypothesis.
Using the former example with 2 hypothesis of 155 and 160.5:

t presented in the given result is the number of standard deviation between the
hypothesis and the mean observed.
The CI is only changed with the change of Confident level.
Giving 4 hypothesis

of 0;160.5;157,8 and 157.9:

Now the new CI0.95( F)=[157.89;163.10].

We only accept the hepothesis which P is > 0.05; As it gives values belonged to
the 90% around the mean.

Lecture 6+7
2

1) Assuming that
)

Estimator
Standardize:

)
)

P VALUE
2) Comparing two means (

Notion: The way to build the data set is that a row is for an individual and a column is for a
variable
Example 1: In our experiment, we want to check whether the target 0 ( the difference
between two of F and M) is belong to the confidence interval or not?
The hypothesis is F = M
Height

Gender

156
178
158
158
158
158
160
160
167
163
163
163
157
153
170
165
162
162
162

F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F

162
162
160
167
150
154
145
177
169
173
170
171
168
175
170
174

F
F
F
F
F
F
F
M
M
M
M
M
M
M
M
M

Using t.test to check, output: CI =[-14.69 -8.087] =


; nF =26
; nM =9

p-value= 8.379e-08
Because p-value is smaller than 0.05, I will reject the hypothesis, and two means are
significantly different.
3) Comparing two variances
F test is the ratio between two variances, to compare two variances (Fisher distribution). The
ratio of variances follows the fisher distribution.
Ratio of variances =

Using var.test, output: CI = [1.15; 12.55] ; p-value= 0.031


According to the confidence interval,
In this case, p-value is the probability to observe ratio of variances equal to 4.5, assuming that
the ratio of true variances equal to 1.

According to the level of risk of 0.05 (can be wrong in 5% situation), I will reject the hypothesis,
according to which the ratio of true variances is equal to 1.
4) Alternative hypothesis
The hypothesis is F = M
The first alternative is F < M (alternative=less), the output of p-value is too small. The
computer rejects the hypothesis of equal, and the difference may be smaller than 0
The second alternative is F > M (alternative=greater), the output of p-value is equal to 1.
The computer accepts the less wrong hypothesis which is the difference is greater than 0
5) Compare more than two means
-within variability is the variance in a same group
-between variability is the variance between the means of the groups

Lecture 8
1. Salary based on major:
read.table("C:\\Users\\Administrator\\Desktop\\Statistics\\29.09.txt",header=TRUE,sep="\t")
toto <read.table("C:\\Users\\Administrator\\Desktop\\Statistics\\29.09.txt",header=TRUE,sep="\t")
toto
summary(toto)
library(FactoMineR)
AovSum(Salary~Major,data=toto)
Analysis of variance with the contrasts sum (the sum of the coefficients is 0)
Ftest

Major

SS

df

MS

F value

Pr(>F)

13705851

6852926

4.6805

0.01701 *

30

1464131

Residuals 43923917

--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1


According to the Global test, when F value >>1 and P.value <0.05, we reject our given null
hypothesis and draw the conclusion that there is a major effect of major to salary.
Ttest
Estimate

Std. Error

t value

Pr(>|t|)

(Intercept) 1493.042 226.379

6.5953

< 2e-16 ***

Major - CE 40.292 364.127

0.1107

0.91263

Major - CI -732.965 297.975

-2.4598

0.01988 *

Major - P 692.673 293.441

2.3605

0.02495 *

--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1


According to the Local test
2. The chicken eggs experiment: (The impact of >=2 factors)

The question we are going to answer here is choosing the Treatment (1,2 or 3) should be
applied on the producing line, thus to receive the largest weight of product.

From the first T test, we can observe that the factor of treatment doesnt effect on the Weight
received (P<0.05) so it is impossible to decide which treatment to be chosen. Also, the
residue is quite big (938). Thus we put the question mark whether the identified sources are
enough or not?
Actually, Weight is depend on BOTH Gender and Treatment. On the T test of latter, it is clear
that the factor of Gender was ignored and calculated to the Residue in the former code:
(938=636G+301). Therefore, the most important step on running these kinds of test is to
identify all the sources of analysis.
The next step is to decide which method to be chosen based on the F test. It is clear that the 2 nd
would decrease the weight of product. Compare the P value between the 1 st and 3rd treatment,

they are both >0.05. From the estimation of parameter (column 1), the 1 st method gives higher
weight of the chicken (1.18), although its probability is smaller than the last one. So the first is
which we choose in this situation.

Lecture 9
Behaviour of one individual is different based on groups. To observe how it depend, we
use the Co-Variance. Co-variance is a joint variance that describe the behavior of 1
regarding another.
)
), we can also apply this equation to Cov(X,X)
Cov(X,Y)=
We use Correlation Co-effience to observe the relation between 2 variables.

+ If is at -1, so once one variable increases, the other decreases


+ If is at 1, so once one variable increases, the other also increases
+ 2 variances are independent while
(orthogonal)
To check these situations, we also use the T-test. And we can only conclude whether a
number is near 0 or not based on the P value but not value of this number
Interesting is the relationship between 2 factors. We also use T-test to check compares
to hypothesis 0:
+ If P > 0.05 Interesting 0 Reject that these factors are dependent
+ If P < 0.05 Interesting 0 These factors are dependent

You might also like