Professional Documents
Culture Documents
Definition
Statistics is a standard method for collecting, organizing, summarizing, presenting, and analyzing and interpreting data for drawing conclusions and making decisions based upon the analyses of these data. Statistics are used extensively by engineers, managers, govt, businessmen, etc throughout the world.
Secondary data
Whether data are suitable? Whether data are adequate? Whether data are reliable?
Primary data
Questioning observation
Presentation of data
Classification
Frequency distribution
Class limits
Tabulation of data
Functions of Statistics
Presents facts in a definite form Simplifies mass of figures Facilitates comparison Helps in formulating and testing hypothesis Helps in prediction Helps in the formulation of suitable policies.
for example, students in this College. for example, any one of the classes.
If you have to declare a single value to represent a population or a sample, what do you use? The most common value is the mean, also called the average or the expected value. Another common value is the mode or the most likely (most common) value. Another value is the median or the middle of the data set.
Mean
This is the mathematical average of a set of numbers This is the middle value of a set of data that has been arranged from lowest to highest The value that occurs the most in a set of data
Median
Mode
We can use expenditure as a good way of discussing these three measures. If we wanted to know the average expenditure of NIFT students.
The mean is the sum of all of the values in the data set divided by the number of values.
The equation for calculating the mean is the same for both samples and populations.
x x ! n
Mean
Sample Mean
1 x ! n Where:
i!1
xi
X-bar is the mean xi are the data points n is the sample size
Population Mean
Q !
1 N
i!1
Where:
is the population mean xi are the data points N is the total number of observations in the population
The Mean
If the data has been sorted (ascending or descending), the median is the middle value (for an odd number of points) or the average of the two middle values (for an even number of points). median is used to characterize data sets with a few extreme values that distort the relevance of the mean, such as house values or family incomes.
Median =
n+1 2
The Median
This is the middle values: 5000, 6000, 6000, 6000, 6000, 8000, 11000, 11500, 12000, 13000, 15000, 15000, 17000, 30000, 110000 The median here is 11500 In cases where there are two middle values, we average the two.
If the data is discrete, or has been grouped into discrete intervals, the mode is that value that occurs the most often. In other words it is the value most likely to occur.
The Mode
This is the most numerous value: 5000, 6000, 6000, 6000, 6000, 8000, 11000, 11500, 12000, 13000, 15000, 15000, 17000, 30000, 110000 The Mode here is 6000. Sometimes there is no mode or even two modes!
So given these values 5000, 6000, 6000, 6000, 6000, 8000, 11000, 11500, 12000, 13000, 15000, 15000, 17000, 30000, 110000
what is the best measure of central tendency for this random sample of NIFT students? Mean?...18100 Median?...11500 Mode?...6000
range: the distance between the lowest and the highest values in the set. For example, the time to drive to Churchgate is 2-hours plus or minus 15 minutes. Or, 105 to 135 minutes. Thus the range is 30 minutes.
Range
The highest value minus the lowest value . From our last example, the range would be: 110000 5000 = 105000
The Variance of a population is the sum of the squares of the differences between the mean and the individual data points divided by the number of data points. The Variance of a sample is the sum of the squared differences divided by the number of data points less one.
Standard Deviation
This is the average distance your values have from the mean score.
Population
1 N
W !
i !1
( xi Q )2
The expression under the square root sign is the variance It is important that you recognize the difference between these two equations!
Sample "s"
n 1 2 s! ( xi x ) (n 1) i !1
Standard Deviation Let s return to our NIFT random sample 5000, 6000, 6000, 6000, 6000, 8000, 11000, 11500, 12000, 13000, 15000, 15000, 17000, 30000, 110000
1.
2.
3. 4.
Follow the steps on the right while we calculate the standard deviation as a class on the board
5.
Calculate the mean which is 18100 Find the distance that each value has from the mean Square the distance Add up these distances and divide by the sample size 1 Then we get the square root of this number
Standard Deviation
X
5000 6000 6000 6000 6000 8000 11000 11500 12000 13000 15000 15000 17000 30000 110000
Mean (x-bar)
18100 18100 18100 18100 18100 18100 18100 18100 18100 18100 18100 18100 18100 18100 18100
X x-bar
-13100 -12100 -12100 -12100 -12100 -10100 -7100 -6600 -6100 -5100 -3100 -3100 -1100 11900 91900
(X x-bar)2
17161 + E4 14641 + E4 14641 + E4 14641 + E4 14641 + E4 10201 + E4 5041 + E4 4356 + E4 3721 + E4 2601 + E4 961 + E4 961 + E4 121 + E4 14161 + E4 844561 + E4
Standard Deviation
We sum (x x-bar)2, and get the square root of this sum. This is the standard deviation. What is the square root of the sum?
Appx. 26,219
The difference in the divisors (N versus n1) results in S being slightly larger than . This is to account for the fact that S (from a sample) is an estimate of the (of a population) and this adds a degree of error to the value. Note: for large n the difference is trivial.
A Valuable Tool
The standard deviation is a rather recent invention and was originally devised by Gauss to explain the error observed in measured star positions. Today it is used in everything from Quality Control to Measuring Risk in financial investments.
Remember that grouped data is a collection of data that has been placed into categories Thus we need to calculate the mean and standard deviation differently, but the idea is the same.
250 - 300
0 50 50 100 100 150 150 200 200 250 250 - 300 Total
90 150 10 80 70 10 500
-2 -1 0 1 2 3
x!A
i !1
fi X d i
i !1
x h fi x 50 ! Rs . 117
! 125
80
500
Advantages: 1) Familiar and intuitively clear to most people 2) Every data set has one and only one mean 3) Useful for performing statistical procedures Disadvantages: 1) May be affected by extreme values 2) Tedious to compute 3) Difficult to compute for data set with open- ended classes
id d=(X-A)/h Value(x) 15 -3 25 -2 35 -1 45 0 55 1 65 2
Median = L
/ 2 C. F .) Xh where L is lower limit of Median Class; N is total Frequency, F C.F. id cumulative frequency of class preceding median class, F is frequency of median class and h is class width. N/2 = 103/2 = 51.5 This value lies in the class interval 40-50 (This value is seen from the cumulative frequency column). Hence L=40
= 40.34
Mode
Mean
Mean
Mode
Median
Median
S.D !
f vd f
f vd vh f
xi di= h
No. of Students 2 10 20 17 1
Fr. (f)
D= a x
h
fxd
f X d2
61 81 91
100
71 90 100
110
71 -80
2 10 20 17 1
-2 -1 0 1 2
Total
-4 -10 0 17 2 5
8 10 0 17 4 39
x ! A
n
f i xd
i ! 1
i ! 1
x h f
i
86
.5
f i xd
2 i
i ! 1
i ! 1
f i xd
i ! 1
i ! 1
x h
! 8 . 86
Aside from measure of dispersion... Determines where values of frequency distribution are in relation to mean ( standard scores ) Measures percentage of items within specific ranges
Coefficient of Variation
1. Measure of relative dispersion 2. Always a % 3. Shows variation relative to mean 4. Used to compare 2 or more groups Population W
Sample
s CV ! _ (100) x
CV !
(100) Q
Qa!40 W a!5
Qb!160 Wb!15
Solution
W CV ! Q (100)
Q
N _ 2 7(x x )
n1 s / x (100)
_