Professional Documents
Culture Documents
Definition:
1. Statistic: the purpose is to summarize, analyze the data to provide decisions and
determine the risks of wrong decisions.
Statistic problems are applied in daily life, chemical engineering and other fields
Example: Blood pressure measuring of patients at the hospital.
2. Mean = average of xn
3. The average data is not real(*), as it is impossible to measure every single object.
However, an estimation can be made after doing research or experiments. However, we
cannot conclude whether it is different without Confidence interval.
4. Confidence interval example:
1.65 is different from 1.7
1.65 0.5 is the same with 1.7
1.65 0.25 is significantly different from 1.7
5. Basis statistical concept with R:
+ Sample:
+ Simulating data: We only simulating data when we dont have real data to work on the
lectures
+ Distribution
+ Standard deviation
6. Distribution: the way data is distributed (similar to spreading cards)
6.1 Draw the distribution of the Weight of class:
There are 2 ways to do:
1. Normal: assume the whole class is homogeneous only 1 curve
2. Distinguish between Male and Female more than 1 curves
6.2 Draw the distribution of 20 waist measured, average 85, deviation 2
Use rnorm function: a = rnorm(20,85,2) = f(x,y,z)
The program gives statistic for this function (with approximate average
number of 84.89(*))
sort a = arrange a from smallest to largest
hist a = give Histogram of a
(2-dimension graph with x = waist measurement, y = frequency)
Quartile = 25%
+ 1st quartile: 25% = 83 (5 people have waist measured < 83)
+ 2nd quartile: 50% = 85.26 = Median Mean
+ 3rd quartile: 75% = 87 (5 people have waist measured > 87)
7. Deviation depends on the quantity of samples. If the sample increases, the confidence
interval decreases and reverse. Example:
If the number of women included in the survey is 30000:
average waist measurement of 1st research: 84.9 cm
average waist measurement 2nd research should be: 84.91 cm
If the number of women included in the survey is 300000:
average waist measurement of 1st research: 84.9 cm
average waist measurement 2nd research should be: 84.9001 cm
Lecture 2
1.
-
36
+ Maximum at
+
3. The concept of centering: centering the data is to place in the middle by substracting
the average to each value
xi
-
xi
New average = 0
The goal is to know the number regarding the average
xi (no dimension)
Varience=1
5. Notion of estimation
Estimator: the way is used to get the estimation, there are 2 types of estimator:
- Guessing: quick but biased
-
(gift 2)
+ We expect the result
= (expectation)
Lecture 3
1. Centering the data:
Centering the data is subtracting the average to each value has been measured.
Hence, we will receive the new average value of 0. Assume that there are 5
values x which are centered to the new value. Then the new average can be
calculated:
Consider the example above, it is not able to give any conclusion whether A is special or
not based on the measurement of 1.8m height unless it is represented over standard
(>2) on standardized data box-plot.
3. Estimation:
The estimator is the way to get estimation. Result/ number given by certain
estimator is the estimation. To choose a good estimator is to choose that has the
best property of unbiased. This means that expectation E(X) of estimator should
be equal to the value we are looking for.
For example: There are 2 estimators for measuring the waist of France male by Guessing
on observation or Doing the experiment on measuring the waists of 1000 males.
The value given by guessing method is a random variable which might be biased.
POWER 2: The experiment on measuring the waists of 1000 males gives:
)
estimator. However, we still dont know how good the estimator is until we have
x mean which provides variability.
In real life situation, experiment can be only done once. Thus, it should have
considered carefully before making any decision of estimating.
Lecture 4
1. The condition for estimating is that all individuals must follow the same distribution
( The population must be homogenous)
Example: Vietnameses weight females vs Frenchs weight females
If standard deviation is big ppl in that environment are more different but still, they
are belong in the same group
If standard deviation is small they are very similar to each other
Random variables Xi = {X1,Xn}
Realizations {x1,..,xn}
2. POWER 3: The distribution of the mean = the property of the mean.
In real life situation, there is only 1 mean that you can access by doing experiment
Any mean given by any kind of original distribution is approximately a normal distribution
Because all the averages move around the same area normal distribution
~ N (,
: average of average
~ f( )
~
V( )
E(
actually E(
experiment on every individual In order to balance the underestimation, we use this formula
S2=
1.96
The meaning of the confidence interval : 95% of the interval contains the true value
Lecture 5
1. Student distribution vs Normal distribution
Normal
+ divided by , a same number every
time
+
Student
+ S gives certain random variable.
So it is not divided by a constant
value but change a little bit each
time
+
it
Using the former example of calculate the average height of Vietnamese female based
on the data given by the girls in class.
Following the step of: Mean Standard deviation Standard deviation of estimation
Quantile:
Notice that the code for quantile of student distribution is qt(0.95,n-1) but not (0.9,n-1)
with the wanted level of confidence of 0.9. This is because 0.9 is belonged to the 95%
area. Thus, 0.95=0.9+0.1/2.
So we receive the Confidence interval value of [158.8;162.6].
3. POWER 4: HYPOTHESIS TESTING P VALUE
P value is to check whether your target value is on the Confidence interval
Hypothesis testing gives us the probability to observe that we have observed under
hypothesis assuming that the hypothesis is right (Value of true mean). Thus, P value
helps us to give decision on accepting or rejecting the assuming hypothesis.
Using the former example with 2 hypothesis of 155 and 160.5:
t presented in the given result is the number of standard deviation between the
hypothesis and the mean observed.
The CI is only changed with the change of Confident level.
Giving 4 hypothesis
We only accept the hepothesis which P is > 0.05; As it gives values belonged to
the 90% around the mean.
Lecture 6+7
2
1) Assuming that
)
Estimator
Standardize:
)
)
P VALUE
2) Comparing two means (
Notion: The way to build the data set is that a row is for an individual and a column is for a
variable
Example 1: In our experiment, we want to check whether the target 0 ( the difference
between two of F and M) is belong to the confidence interval or not?
The hypothesis is F = M
Height
Gender
156
178
158
158
158
158
160
160
167
163
163
163
157
153
170
165
162
162
162
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
162
162
160
167
150
154
145
177
169
173
170
171
168
175
170
174
F
F
F
F
F
F
F
M
M
M
M
M
M
M
M
M
p-value= 8.379e-08
Because p-value is smaller than 0.05, I will reject the hypothesis, and two means are
significantly different.
3) Comparing two variances
F test is the ratio between two variances, to compare two variances (Fisher distribution). The
ratio of variances follows the fisher distribution.
Ratio of variances =
According to the level of risk of 0.05 (can be wrong in 5% situation), I will reject the hypothesis,
according to which the ratio of true variances is equal to 1.
4) Alternative hypothesis
The hypothesis is F = M
The first alternative is F < M (alternative=less), the output of p-value is too small. The
computer rejects the hypothesis of equal, and the difference may be smaller than 0
The second alternative is F > M (alternative=greater), the output of p-value is equal to 1.
The computer accepts the less wrong hypothesis which is the difference is greater than 0
5) Compare more than two means
-within variability is the variance in a same group
-between variability is the variance between the means of the groups
Lecture 8
1. Salary based on major:
read.table("C:\\Users\\Administrator\\Desktop\\Statistics\\29.09.txt",header=TRUE,sep="\t")
toto <read.table("C:\\Users\\Administrator\\Desktop\\Statistics\\29.09.txt",header=TRUE,sep="\t")
toto
summary(toto)
library(FactoMineR)
AovSum(Salary~Major,data=toto)
Analysis of variance with the contrasts sum (the sum of the coefficients is 0)
Ftest
Major
SS
df
MS
F value
Pr(>F)
13705851
6852926
4.6805
0.01701 *
30
1464131
Residuals 43923917
Std. Error
t value
Pr(>|t|)
6.5953
0.1107
0.91263
-2.4598
0.01988 *
2.3605
0.02495 *
The question we are going to answer here is choosing the Treatment (1,2 or 3) should be
applied on the producing line, thus to receive the largest weight of product.
From the first T test, we can observe that the factor of treatment doesnt effect on the Weight
received (P<0.05) so it is impossible to decide which treatment to be chosen. Also, the
residue is quite big (938). Thus we put the question mark whether the identified sources are
enough or not?
Actually, Weight is depend on BOTH Gender and Treatment. On the T test of latter, it is clear
that the factor of Gender was ignored and calculated to the Residue in the former code:
(938=636G+301). Therefore, the most important step on running these kinds of test is to
identify all the sources of analysis.
The next step is to decide which method to be chosen based on the F test. It is clear that the 2 nd
would decrease the weight of product. Compare the P value between the 1 st and 3rd treatment,
they are both >0.05. From the estimation of parameter (column 1), the 1 st method gives higher
weight of the chicken (1.18), although its probability is smaller than the last one. So the first is
which we choose in this situation.
Lecture 9
Behaviour of one individual is different based on groups. To observe how it depend, we
use the Co-Variance. Co-variance is a joint variance that describe the behavior of 1
regarding another.
)
), we can also apply this equation to Cov(X,X)
Cov(X,Y)=
We use Correlation Co-effience to observe the relation between 2 variables.