Professional Documents
Culture Documents
Types of Statistics
Descriptive Statistics are used to describe the data set
Inferential Statistics allow you to infer something about the the parameters of the
population based on the statistics of the sample, and various tests we perform on the
sample
The three most commonly-used measures of central tendency are the following.
Mean
The sum of the values divided by the number of values--often called the "average."
• If you have an even number of values, the median is the arithmetic mean (see above) of
the two middle values.
Example: The median of the same five numbers (7, 12, 24, 20, 19) is 19.
Mode
The most frequently-occurring value (or values).
Example: For individuals having the following ages -- 18, 18, 19, 20, 20, 20, 21, and 23, the
mode is 20.
Dispersion
Measurements of central tendency (mean, mode and median) locate the distribution within the
range of possible values, measurements of dispersion describe the spread of values.
The dispersion of values within variables is especially important in social and political research
because:
Range
The range is the simplest measure of dispersion. The range can be thought of in two ways.
As a quantity: the difference between the highest and lowest scores in a distribution.
variance.
In calculating the variance of data points, we square the difference between each point
and the mean because if we summed the differences directly, the result would always be
zero. For example, suppose three friends work on campus and earn $5.50, $7.50, and $8
per hour, respectively. The mean of these values is $(5.50 + 7.50 + 8)/3 = $7 per hour. If
we summed the differences of the mean from each wage, we would get (5.50-7) + (7.50-
7) + (8-7) = -1.50 + .50 + 1 = 0. Instead, we square the terms to obtain a variance equal to
2.25 + .25 + 1 = 3.50. This figure is a measure of dispersion in the set of scores.
The variance is the minimum sum of squared differences of each score from any number.
In other words, if we used any number other than the mean as the value from which each
score is subtracted, the resulting sum of squared differences would be greater. (You can
try it yourself -- see if any number other than 7 can be plugged into the preceeding
calculation and yield a sum of squared differences less than 3.50.)
The standard deviation is simply the square root of the variance. In some sense, taking the
square root of the variance "undoes" the squaring of the differences that we did when we
calculated the variance.
Population
Sample
In these equations, is the population mean, is the sample mean, N is the total
number of scores in the population, and n is the number of scores in the sample.
Coefficient of Variation
within them.
Co-efficient Of Variation ( C. V. )
To compare the variations ( dispersion ) of two different series, relative measures of standard
deviation must be calculated. This is known as co-efficient of variation or the co-efficient of s. d.
Its formula is
C. V. =
Remark: It is given as a percentage and is used to compare the consistency or variability of two
more series. The higher the C. V. , the higher the variability and lower the C. V., the higher is the
consistency of the data.
Correlational vs. experimental research. Most empirical research belongs clearly to one of
those two general categories. In correlational research we do not (or at least try not to) influence
any variables but only measure them and look for relations (correlations) between some set of
variables, such as blood pressure and cholesterol level. In experimental research, we manipulate
some variables and then measure the effects of this manipulation on other variables; for example,
a researcher might artificially increase blood pressure and then record cholesterol level. Data
analysis in experimental research also comes down to calculating "correlations" between
variables, specifically, those manipulated and those affected by the manipulation. However,
experimental data may potentially provide qualitatively better information: Only experimental
data can conclusively demonstrate causal relations between variables. For example, if we found
that whenever we change variable A then variable B changes, then we can conclude that "A
influences B." Data from correlational research can only be "interpreted" in causal terms based
on some theories that we have, but correlational data cannot conclusively prove causality.
Dependent vs. independent variables. Independent variables are those that are manipulated
Page8
whereas dependent variables are only measured or registered. This distinction appears
terminologically confusing to many because, as some students say, "all variables depend on
something." However, once you get used to this distinction, it becomes indispensable. The terms
dependent and independent variable apply mostly to experimental research where some variables
are manipulated, and in this sense they are "independent" from the initial reaction patterns,
features, intentions, etc. of the subjects. Some other variables are expected to be "dependent" on
the manipulation or experimental conditions. That is to say, they depend on "what the subject
will do" in response. Somewhat contrary to the nature of this distinction, these terms are also
used in studies where we do not literally manipulate independent variables, but only assign
subjects to "experimental groups" based on some pre-existing properties of the subjects. For
example, if in an experiment, males are compared with females regarding their white cell count
(WCC), Gender could be called the independent variable and WCC the dependent variable.
Measurement scales. Variables differ in "how well" they can be measured, i.e., in how much
measurable information their measurement scale can provide. There is obviously some To index
measurement error involved in every measurement, which determines the "amount of
information" that we can obtain. Another factor that determines the amount of information that
can be provided by a variable is its "type of measurement scale." Specifically variables are
classified as (a) nominal, (b) ordinal, (c) interval or (d) ratio.
a. Nominal variables allow for only qualitative classification. That is, they can be measured
only in terms of whether the individual items belong to some distinctively different
categories, but we cannot quantify or even rank order those categories. For example, all
we can say is that 2 individuals are different in terms of variable A (e.g., they are of
different race), but we cannot say which one "has more" of the quality represented by the
variable. Typical examples of nominal variables are gender, race, color, city, etc.
b. Ordinal variables allow us to rank order the items we measure in terms of which has less
and which has more of the quality represented by the variable, but still they do not allow
us to say "how much more." A typical example of an ordinal variable is the
socioeconomic status of families. For example, we know that upper-middle is higher than
middle but we cannot say that it is, for example, 18% higher. Also this very distinction
between nominal, ordinal, and interval scales itself represents a good example of an
ordinal variable. For example, we can say that nominal measurement provides less
information than ordinal measurement, but we cannot say "how much less" or how this
difference compares to the difference between ordinal and interval scales.
c. Interval variables allow us not only to rank order the items that are measured, but also to
quantify and compare the sizes of differences between them. For example, temperature,
as measured in degrees Fahrenheit or Celsius, constitutes an interval scale. We can say
that a temperature of 40 degrees is higher than a temperature of 30 degrees, and that an
increase from 20 to 40 degrees is twice as much as an increase from 30 to 40 degrees.
d. Ratio variables are very similar to interval variables; in addition to all the properties of
interval variables, they feature an identifiable absolute zero point, thus they allow for
statements such as x is two times more than y. Typical examples of ratio scales are
measures of time or space. For example, as the Kelvin temperature scale is a ratio scale,
not only can we say that a temperature of 200 degrees is higher than one of 100 degrees,
we can correctly state that it is twice as high. Interval scales do not have the ratio
Page8
property. Most statistical data analysis procedures do not distinguish between the interval
and ratio properties of the measurement scales.
Correlations
The correlation is one of the most common and most useful statistics. A correlation is a single
number that describes the degree of relationship between two variables
Purpose (What is Correlation?) Correlation is a measure of the relation between two or more
variables. The measurement scales used should be at least interval scales, but other correlation
coefficients are available to handle other types of data. Correlation coefficients can range from
-1.00 to +1.00. The value of -1.00 represents a perfect negative correlation while a value of
+1.00 represents a perfect positive correlation. A value of 0.00 represents a lack of correlation.
In statistics, Spearman's rank correlation coefficient named after Charles Spearman and often
denoted by the Greek letter ρ (rho) or as rs, is a non-parametric measure of correlation – that is, it
assesses how well an arbitrary monotonic function could describe the relationship between two
variables, without making any assumptions about the frequency distribution of the variables.
Linear Regression
Correlation gives us the idea of the measure of magnitude and direction between correlated
variables. Now it is natural to think of a method that helps us in estimating the value of one
variable when the other is known. Also correlation does not imply causation. The fact that the
variables x and y are correlated does not necessarily mean that x causes y or vice versa. For
Page8
example, you would find that the number of schools in a town is correlated to the number of
accidents in the town. The reason for these accidents is not the school attendance; but these two
increases what is known as population. A statistical procedure called regression is concerned
with causation in a relationship among variables. It assesses the contribution of one or more
variable called causing variable or independent variable or one which is being caused
(dependent variable). When there is only one independent variable then the relationship is
expressed by a straight line. This procedure is called simple linear regression.
Regression can be defined as a method that estimates the value of one variable when that of
other variable is known, provided the variables are correlated. The dictionary meaning of
regression is "to go backward." It was used for the first time by Sir Francis Galton in his research
paper "Regression towards mediocrity in hereditary stature."
Lines of Regression: In scatter plot, we have seen that if the variables are highly correlated then
the points (dots) lie in a narrow strip. if the strip is nearly straight, we can draw a straight line,
such that all points are close to it from both sides. such a line can be taken as an ideal
representation of variation. This line is called the line of best fit if it minimizes the distances of
all data points from it.
This line is called the line of regression. Now prediction is easy because now all we need to do
is to extend the line and read the value. Thus to obtain a line of regression, we need to have a line
of best fit. But statisticians don’t measure the distances by dropping perpendiculars from points
on to the line. They measure deviations ( or errors or residuals as they are called) (i) vertically
and (ii) horizontally. Thus we get two lines of regressions as shown in the figure (1) and (2).
Its form is y = a + b x
Its form is x = a + b y
Regression can be used for prediction (including forecasting of time-series data), inference,
hypothesis testing, and modeling of causal relationships
Correlation and linear regression are not the same. Consider these differences:
• Correlation quantifies the degree to which two variables are related. Correlation does not
find a best-fit line (that is regression). You simply are computing a correlation coefficient
(r) that tells you how much one variable tends to change when the other one does.
• With correlation you don't have to think about cause and effect. You simply quantify how
well two variables relate to each other. With regression, you do have to think about cause
and effect as the regression line is determined as the best way to predict Y from X.
• With correlation, it doesn't matter which of the two variables you call "X" and which you
call "Y". You'll get the same correlation coefficient if you swap the two. With linear
regression, the decision of which variable you call "X" and which you call "Y" matters a
lot, as you'll get a different best-fit line if you swap the two. The line that best predicts Y
from X is not the same as the line that predicts X from Y.
• Correlation is almost always used when you measure both variables. It rarely is
appropriate when one variable is something you experimentally manipulate. With linear
regression, the X variable is often something you experimentall manipulate (time,
concentration...) and the Y variable is something you measure.
Probable Error
It is used to help in the determination of the Karl Pearson’s coefficient of correlation ‘ r ’. Due to
this ‘ r ’ is corrected to a great extent but note that ‘ r ’ depends on the random sampling and its
conditions. it is given by
P. E. = 0.6745
i. If the value of r is less than P. E., then there is no evidence of correlation i.e. r is not
significant.
ii. If r is more than 6 times the P. E. ‘ r ’ is practically certain .i.e. significant.
iii. By adding or subtracting P. E. to ‘ r ’ , we get the upper and Lower limits within which ‘ r
’ of the population can be expected to lie.
Symbolically e = r ± P. E.
hypothesis is either that a parameter is greater than or equal to zero or that a parameter is
less than or equal to zero. If the prediction is that μ1 is larger than μ2, then the null
hypothesis (the reverse of the prediction) is μ2 - μ1 ≥ 0. This is equivalent to μ1 ≤ μ2.
2. The second step is to specify the α level also known as the significance level. Typical
values are 0.05 and 0.01.
3. The third step is to compute the probability value (also known as the p value). This is the
probability of obtaining a sample statistic as different or more different from the
parameter specified in the null hypothesis given that the null hypothesis is true.
4. Next, compare the probability value with the α level. If the probability value is lower then
you reject the null hypothesis. Keep in mind that rejecting the null hypothesis is not an
all-or-none decision. The lower the probability value, the more confidence you can have
that the null hypothesis is false. However, if your probability value is higher than the
conventional α level of 0.05, most scientists will consider your findings inconclusive.
Failure to reject the null hypothesis does not constitute support for the null hypothesis. It
just means you do not have sufficiently strong data to reject it.
Page8