You are on page 1of 9

Statistics

Statistics is a mathematical science pertaining to the collection, analysis, interpretation or


explanation, and presentation of data. Also with prediction and forecasting based on
data. It is applicable to a wide variety of academic disciplines, from the natural and
social sciences to the humanities, government and business.

Types of Statistics
Descriptive Statistics are used to describe the data set

Examples: graphing, calculating averages, looking for extreme scores

Inferential Statistics allow you to infer something about the the parameters of the
population based on the statistics of the sample, and various tests we perform on the
sample

Examples: Chi-Square, T-Tests, Correlations, ANOVA

Measure Central Tendency


A way of summarising data using the value which is most typical. Three examples are the
Mean, Median and Mode.

The three most commonly-used measures of central tendency are the following.
Mean
The sum of the values divided by the number of values--often called the "average."

• Add all of the values together.


• Divide by the number of values to obtain the mean.

Example: The mean of 7, 12, 24, 20, 19 is (7 + 12 + 24 + 20 + 19) / 5 = 16.4.


Median
The value which divides the values into two equal halves, with half of the values being
lower than the median and half higher than the median.

• Sort the values into ascending order.


• If you have an odd number of values, the median is the middle value.
Page8

• If you have an even number of values, the median is the arithmetic mean (see above) of
the two middle values.
Example: The median of the same five numbers (7, 12, 24, 20, 19) is 19.
Mode
The most frequently-occurring value (or values).

• Calculate the frequencies for all of the values in the data.


• The mode is the value (or values) with the highest frequency.

Example: For individuals having the following ages -- 18, 18, 19, 20, 20, 20, 21, and 23, the
mode is 20.

Dispersion
Measurements of central tendency (mean, mode and median) locate the distribution within the
range of possible values, measurements of dispersion describe the spread of values.

The dispersion of values within variables is especially important in social and political research
because:

• Dispersion or "variation" in observations is what we seek to explain.


• Researchers want to know WHY some cases lie above average and others below average
for a given variable:
o TURNOUT in voting: why do some states show higher rates than others?
o CRIMES in cities: why are there differences in crime rates?
o CIVIL STRIFE among countries: what accounts for differing amounts?
• Much of statistical explanation aims at explaining DIFFERENCES in observations -- also
known as
o VARIATION, or the more technical term, VARIANCE.

Range
The range is the simplest measure of dispersion. The range can be thought of in two ways.

As a quantity: the difference between the highest and lowest scores in a distribution.

Variance and Standard Deviation


By far the most commonly used measures of dispersion in the social sciences are
variance and standard deviation. Variance is the average squared difference of scores
from the mean score of a distribution. Standard deviation is the square root of the
Page8

variance.
In calculating the variance of data points, we square the difference between each point
and the mean because if we summed the differences directly, the result would always be
zero. For example, suppose three friends work on campus and earn $5.50, $7.50, and $8
per hour, respectively. The mean of these values is $(5.50 + 7.50 + 8)/3 = $7 per hour. If
we summed the differences of the mean from each wage, we would get (5.50-7) + (7.50-
7) + (8-7) = -1.50 + .50 + 1 = 0. Instead, we square the terms to obtain a variance equal to
2.25 + .25 + 1 = 3.50. This figure is a measure of dispersion in the set of scores.

The variance is the minimum sum of squared differences of each score from any number.
In other words, if we used any number other than the mean as the value from which each
score is subtracted, the resulting sum of squared differences would be greater. (You can
try it yourself -- see if any number other than 7 can be plugged into the preceeding
calculation and yield a sum of squared differences less than 3.50.)

The standard deviation is simply the square root of the variance. In some sense, taking the
square root of the variance "undoes" the squaring of the differences that we did when we
calculated the variance.

Variance and standard deviation of a population are designated by and , respectively.


Variance and standard deviation of a sample are designated by s2 and s, respectively.

Variance Standard Deviation

Population

Sample

In these equations, is the population mean, is the sample mean, N is the total
number of scores in the population, and n is the number of scores in the sample.

Coefficient of Variation

This is the ratio of the standard deviation to the mean:


Page8
The coefficient of variation describes the magnitude sample values and the variation

within them.

Co-efficient Of Variation ( C. V. )

To compare the variations ( dispersion ) of two different series, relative measures of standard
deviation must be calculated. This is known as co-efficient of variation or the co-efficient of s. d.
Its formula is

C. V. =

Thus it is defined as the ratio s. d. to its mean.

Remark: It is given as a percentage and is used to compare the consistency or variability of two
more series. The higher the C. V. , the higher the variability and lower the C. V., the higher is the
consistency of the data.

What are variables. Variables are things that we measure, control, or


manipulate in research. They differ in many respects, most notably in the role they
are given in our research and in the type of measures that can be applied to them.

Correlational vs. experimental research. Most empirical research belongs clearly to one of
those two general categories. In correlational research we do not (or at least try not to) influence
any variables but only measure them and look for relations (correlations) between some set of
variables, such as blood pressure and cholesterol level. In experimental research, we manipulate
some variables and then measure the effects of this manipulation on other variables; for example,
a researcher might artificially increase blood pressure and then record cholesterol level. Data
analysis in experimental research also comes down to calculating "correlations" between
variables, specifically, those manipulated and those affected by the manipulation. However,
experimental data may potentially provide qualitatively better information: Only experimental
data can conclusively demonstrate causal relations between variables. For example, if we found
that whenever we change variable A then variable B changes, then we can conclude that "A
influences B." Data from correlational research can only be "interpreted" in causal terms based
on some theories that we have, but correlational data cannot conclusively prove causality.

Dependent vs. independent variables. Independent variables are those that are manipulated
Page8

whereas dependent variables are only measured or registered. This distinction appears
terminologically confusing to many because, as some students say, "all variables depend on
something." However, once you get used to this distinction, it becomes indispensable. The terms
dependent and independent variable apply mostly to experimental research where some variables
are manipulated, and in this sense they are "independent" from the initial reaction patterns,
features, intentions, etc. of the subjects. Some other variables are expected to be "dependent" on
the manipulation or experimental conditions. That is to say, they depend on "what the subject
will do" in response. Somewhat contrary to the nature of this distinction, these terms are also
used in studies where we do not literally manipulate independent variables, but only assign
subjects to "experimental groups" based on some pre-existing properties of the subjects. For
example, if in an experiment, males are compared with females regarding their white cell count
(WCC), Gender could be called the independent variable and WCC the dependent variable.

Measurement scales. Variables differ in "how well" they can be measured, i.e., in how much
measurable information their measurement scale can provide. There is obviously some To index
measurement error involved in every measurement, which determines the "amount of
information" that we can obtain. Another factor that determines the amount of information that
can be provided by a variable is its "type of measurement scale." Specifically variables are
classified as (a) nominal, (b) ordinal, (c) interval or (d) ratio.

a. Nominal variables allow for only qualitative classification. That is, they can be measured
only in terms of whether the individual items belong to some distinctively different
categories, but we cannot quantify or even rank order those categories. For example, all
we can say is that 2 individuals are different in terms of variable A (e.g., they are of
different race), but we cannot say which one "has more" of the quality represented by the
variable. Typical examples of nominal variables are gender, race, color, city, etc.
b. Ordinal variables allow us to rank order the items we measure in terms of which has less
and which has more of the quality represented by the variable, but still they do not allow
us to say "how much more." A typical example of an ordinal variable is the
socioeconomic status of families. For example, we know that upper-middle is higher than
middle but we cannot say that it is, for example, 18% higher. Also this very distinction
between nominal, ordinal, and interval scales itself represents a good example of an
ordinal variable. For example, we can say that nominal measurement provides less
information than ordinal measurement, but we cannot say "how much less" or how this
difference compares to the difference between ordinal and interval scales.
c. Interval variables allow us not only to rank order the items that are measured, but also to
quantify and compare the sizes of differences between them. For example, temperature,
as measured in degrees Fahrenheit or Celsius, constitutes an interval scale. We can say
that a temperature of 40 degrees is higher than a temperature of 30 degrees, and that an
increase from 20 to 40 degrees is twice as much as an increase from 30 to 40 degrees.
d. Ratio variables are very similar to interval variables; in addition to all the properties of
interval variables, they feature an identifiable absolute zero point, thus they allow for
statements such as x is two times more than y. Typical examples of ratio scales are
measures of time or space. For example, as the Kelvin temperature scale is a ratio scale,
not only can we say that a temperature of 200 degrees is higher than one of 100 degrees,
we can correctly state that it is twice as high. Interval scales do not have the ratio
Page8
property. Most statistical data analysis procedures do not distinguish between the interval
and ratio properties of the measurement scales.

Correlations
The correlation is one of the most common and most useful statistics. A correlation is a single
number that describes the degree of relationship between two variables

Purpose (What is Correlation?) Correlation is a measure of the relation between two or more
variables. The measurement scales used should be at least interval scales, but other correlation
coefficients are available to handle other types of data. Correlation coefficients can range from
-1.00 to +1.00. The value of -1.00 represents a perfect negative correlation while a value of
+1.00 represents a perfect positive correlation. A value of 0.00 represents a lack of correlation.

How to Interpret the Values of Correlations. As mentioned before, the correlation


coefficient (r) represents the linear relationship between two variables. If the correlation
coefficient is squared, then the resulting value (r2, the coefficient of determination) will represent
the proportion of common variation in the two variables (i.e., the "strength" or "magnitude" of
the relationship). In order to evaluate the correlation between variables, it is important to know
this "magnitude" or "strength" as well as the significance of the correlation.

In statistics, Spearman's rank correlation coefficient named after Charles Spearman and often
denoted by the Greek letter ρ (rho) or as rs, is a non-parametric measure of correlation – that is, it
assesses how well an arbitrary monotonic function could describe the relationship between two
variables, without making any assumptions about the frequency distribution of the variables.

Linear regression is a form of regression analysis in which the relationship between


one or more independent variables and another variable, called dependent variable, is modeled
by a least squares function, called linear regression equation. This function is a linear
combination of one or more model parameters, called regression coefficients. A linear regression
equation with one independent variable represents a straight line. The results are subject to
statistical analysis.
From algebra, any straight line can be described as:
Y = a + bX, where a is the intercept and b is the slope

Linear Regression

Correlation gives us the idea of the measure of magnitude and direction between correlated
variables. Now it is natural to think of a method that helps us in estimating the value of one
variable when the other is known. Also correlation does not imply causation. The fact that the
variables x and y are correlated does not necessarily mean that x causes y or vice versa. For
Page8

example, you would find that the number of schools in a town is correlated to the number of
accidents in the town. The reason for these accidents is not the school attendance; but these two
increases what is known as population. A statistical procedure called regression is concerned
with causation in a relationship among variables. It assesses the contribution of one or more
variable called causing variable or independent variable or one which is being caused
(dependent variable). When there is only one independent variable then the relationship is
expressed by a straight line. This procedure is called simple linear regression.

Regression can be defined as a method that estimates the value of one variable when that of
other variable is known, provided the variables are correlated. The dictionary meaning of
regression is "to go backward." It was used for the first time by Sir Francis Galton in his research
paper "Regression towards mediocrity in hereditary stature."

Lines of Regression: In scatter plot, we have seen that if the variables are highly correlated then
the points (dots) lie in a narrow strip. if the strip is nearly straight, we can draw a straight line,
such that all points are close to it from both sides. such a line can be taken as an ideal
representation of variation. This line is called the line of best fit if it minimizes the distances of
all data points from it.

This line is called the line of regression. Now prediction is easy because now all we need to do
is to extend the line and read the value. Thus to obtain a line of regression, we need to have a line
of best fit. But statisticians don’t measure the distances by dropping perpendiculars from points
on to the line. They measure deviations ( or errors or residuals as they are called) (i) vertically
and (ii) horizontally. Thus we get two lines of regressions as shown in the figure (1) and (2).

(1) Line of regression of y on x

Its form is y = a + b x

It is used to estimate y when x is given

(2) Line of regression of x on y

Its form is x = a + b y

It is used to estimate x when y is given.

They are obtained by (1) graphically - by Scatter plot (ii)


Mathematically - by the method of least squares.

Regression can be used for prediction (including forecasting of time-series data), inference,
hypothesis testing, and modeling of causal relationships

What is the difference between correlation and linear regression?


Page8

Correlation and linear regression are not the same. Consider these differences:
• Correlation quantifies the degree to which two variables are related. Correlation does not
find a best-fit line (that is regression). You simply are computing a correlation coefficient
(r) that tells you how much one variable tends to change when the other one does.
• With correlation you don't have to think about cause and effect. You simply quantify how
well two variables relate to each other. With regression, you do have to think about cause
and effect as the regression line is determined as the best way to predict Y from X.
• With correlation, it doesn't matter which of the two variables you call "X" and which you
call "Y". You'll get the same correlation coefficient if you swap the two. With linear
regression, the decision of which variable you call "X" and which you call "Y" matters a
lot, as you'll get a different best-fit line if you swap the two. The line that best predicts Y
from X is not the same as the line that predicts X from Y.
• Correlation is almost always used when you measure both variables. It rarely is
appropriate when one variable is something you experimentally manipulate. With linear
regression, the X variable is often something you experimentall manipulate (time,
concentration...) and the Y variable is something you measure.

Probable Error

It is used to help in the determination of the Karl Pearson’s coefficient of correlation ‘ r ’. Due to
this ‘ r ’ is corrected to a great extent but note that ‘ r ’ depends on the random sampling and its
conditions. it is given by

P. E. = 0.6745

i. If the value of r is less than P. E., then there is no evidence of correlation i.e. r is not
significant.
ii. If r is more than 6 times the P. E. ‘ r ’ is practically certain .i.e. significant.
iii. By adding or subtracting P. E. to ‘ r ’ , we get the upper and Lower limits within which ‘ r
’ of the population can be expected to lie.

Symbolically e = r ± P. E.

P = Correlation ( coefficient ) of the population.

Steps in hypothesis testing


1. The first step is to specify the null hypothesis. For a two tailed test, the null hypothesis is
typically that a parameter equals zero although there are exceptions. A typical null
hypothesis is μ1 - μ2 = 0 which is equivalent to μ1 = μ2. For a one-tailed test, the null
Page8

hypothesis is either that a parameter is greater than or equal to zero or that a parameter is
less than or equal to zero. If the prediction is that μ1 is larger than μ2, then the null
hypothesis (the reverse of the prediction) is μ2 - μ1 ≥ 0. This is equivalent to μ1 ≤ μ2.
2. The second step is to specify the α level also known as the significance level. Typical
values are 0.05 and 0.01.
3. The third step is to compute the probability value (also known as the p value). This is the
probability of obtaining a sample statistic as different or more different from the
parameter specified in the null hypothesis given that the null hypothesis is true.
4. Next, compare the probability value with the α level. If the probability value is lower then
you reject the null hypothesis. Keep in mind that rejecting the null hypothesis is not an
all-or-none decision. The lower the probability value, the more confidence you can have
that the null hypothesis is false. However, if your probability value is higher than the
conventional α level of 0.05, most scientists will consider your findings inconclusive.
Failure to reject the null hypothesis does not constitute support for the null hypothesis. It
just means you do not have sufficiently strong data to reject it.

Page8

You might also like