You are on page 1of 30

Section-2

Data Analysis
Short Questions:
Question 1: What is data?
Answer: Data is the substrate for decision-making process. Data is measure of some ad
servable characteristic of characteristic of a set of objects of interest. Statistics is a vast
area of applied mathematics wherein data are collected, classified, presented and analyzed
for a specific purpose.
Question 2: What is role of statistics in business decision?
Answer: Statistics plays an important role in business, because it provides the
quantitative basis for arriving at decisions in all matters connected with operations of
business. Statistics helps in a business to plan production according to the tastes of the
consumers.
Statistics in business can also serve as a tool of management to evaluate performance of
machines and personnel. It also enables the businessman to judge the efficiency of new
production methods by studying relationship between costs and methods of production.
Question 3: Define Frequency Table.
Answer: frequency is the number of occurrences of a data item. A table such as the one
shown above that summarizes number of cases against a column of interest is called a
frequency table.
Question 4: What is Central Tendency?
Answer: In a series of statistical data that parameter which reflects a central value of the
series is called the central tendency. Central tendency refers to a single value that represent
the whole set of data.
Question 5: Define Average and discuss various types of averages.
Answer: An average can be defined as a central value around which other values of series
tend to cluster. An average is computed to give a concise picture of a large group. By the
use of average complex groups of large numbers are presented in a few significant words or
figures. Averages help in obtaining a picture of universe with the help of sample. Although
sample and the universe differ in size, still their average may be very much identical.
Average may be classified into tree board types:
1) Mathematical Averages:
a) Arithmetical mean
b) Geometric mean
c) Harmonic average

Santoshsahni833@gmail.com

2) Positional Averages:
a) Mode
b) Median
3) Commercial Averages:
a) Moving average
b) Progressive average
c) Quadratic average
Question 6: What you understand by term Range in statistics?
Answer: Range: Range of data set is the difference between the largest value and the
smallest value.
For example runs scored by two batsmen A and B, we had some idea of variability in the
scores on the basis of minimum and maximum runs in each series.
To obtain a single number for this, we find the difference of maximum and minimum
Values of each series. This difference is called the Range of the data. In case of batsman A,
Range = 117 0 = 117 and for batsman B, Range = 60 46 = 14. Clearly, Range of A >
Range of B. Therefore, the scores are scattered or dispersed in Case of A while for B these
are close to each other.
Thus, Range of a series = Maximum value Minimum value.
Question 7: Define Mean Deviation.
Answer: Mean deviation also known as average deviation, mean deviation is the mean of
the absolute amounts by which the individual items deviate from the mean. The following
procedure is usually applied:
1) Calculate the absolute deviation from the mean, removing any negative signs.
2) Add all the deviations.
3) Divide the sum of the deviation by the total number of items.
Symbolically, these steps may be summarized as follows:
For a sample size, the mean deviation is defined by
MD =
Where x is the arithmetic mean of variable x.
Question 8: What is Skewness?
Answer:
Skewness: Skewness is a measure of the lack of symmetry or degree of distortion from
symmetry exhibited by a normal distribution.
Negative skew: The left tail is longer; the mass of the distribution is concentrated on the
right of the figure. It has a few relatively low values. The distribution is said to be leftskewed. In such a distribution, the mean is lower than median which in turn is lower than
the mode (i.e.; mean < median < mode); in which case the skewness coefficient is lower
than zero. Example (observations): 1, 1000, 1001, 1002, 1003

Santoshsahni833@gmail.com

Positive skew: The right tail is longer; the mass of the distribution is concentrated on the
left of the figure. It has a few relatively high values. The distribution is said to be rightskewed. In such a distribution, the mean is greater than median which in turn is greater
than the mode (i.e.; mean > median > mode); in which case the skewness coefficient is
greater than zero. Example (observations): 1,2,3,4,100
In a skewed (unbalanced, lopsided) distribution, the mean is farther out in the long tail than
is the median. If there is no skewness or the distribution is symmetric like the bell-shaped
normal curve then the mean = median = mode.

Question 9: Discuss Merits and Demerits of Standard Deviation.


Answer:
Merits
(1) The standard deviation is the best measure of variation because of its mathematical
characteristics. It is based on every item of the distribution. Also it is amenable to algebraic
treatment and is less affected by fluctuations of sampling than most other measures of
dispersion.
(2) It is possible to calculate the combined standard deviation of two or more groups. This
is not possible with any other measure.
(3) For comparing the variability of two or more distributions coefficient of variation is
considered to be most appropriate and this is based on mean and standard deviation.
(4) Standard deviation is most prominently used in further statistical work.
Limitations
(1) As compared to other measures it is difficult to compute. However, it does not reduce
the importance of this measure because of high degree of accuracy of results is gives.
(2) It gives more weight to extreme items and less to those which are near the mean. It is
because of the fact that the squares of the deviations which are big in size would be
proportionately greater than the squares of those deviations which are comparatively small.
Question 10: Calculate the arithmetic mean for the following data:

Santoshsahni833@gmail.com

Serial num.
Height of
stu.

1
14.4

2
15.2

3
15.0

4
15.8

5
15.5

Answer: calculation of arithmetic mean


Serial number
1
2
3
4
5
n=5

Height of student
14.4
15.2
15.0
15.8
15.5

Mean(X) =
X =
= 15.18
Question11: Find the mean of first n natural numbers?
Answer: since X =
Sum of First natural number = xi
xi = 1+2+3+n
=
X =
X =

Question12. Find arithmetic mean of given xi and frequency?


xi
F

4
7

7
10

10
15

Answer: Calculation of arithmetic mean

Santoshsahni833@gmail.com

13
20

16
25

19
30

Xi

fx

4
7

7
10

28
70

10

15

150

13

20

260

16
19

25
30

400
570

xi = 69

fx = 1478

A.M. =
A.M. =
= 13.81
Question13: Find the mode of the given data.
Family size
No. of
family

1-3
7

3-5
8

5-7
2

7-9
2

9-11
1

Answer:
l=3
h=2
f0 = 7
f2 = 2
Mode = l +
=3+[

]*2

=3+
= 3.28
(Answer)
Question14: Find the Mode of the given data.
Age x

5-15

15-25

Santoshsahni833@gmail.com

25-35

35-45

45-55

55-65

No. of
pl. f

11

21

23

14

Answer: calculation of mode


Age x

No. of people f

5-15
15-25
25-35
35-45
45-55
55-65

cf

6
11
21
23
14
5

6
17
38
61
75
80

Where
l = 35
h = 10
f0 = 21
f1 = 23
f2 = 14
Mode = 35 +

*2

= 35 +
= 35.86

Question15: Find the M.D. of the mean for the given data.
6, 7, 10, 12, 13, 4, 8, 12
Answer:
X =

= 72
=9

Xi x = 6 9 7- 9 10 9 12 9 13 9 4 9 8 9 12 9
= -3
|xi x| = 3

-2

-5

-1

M. D. =

Santoshsahni833@gmail.com

=
=
X = 2.75
Question16: Why Study Dispersion?
Answer:

A measure of location, such as the mean or the median, only describes the center of the
data, but it does not tell us anything about the spread of the data.
For example, if your nature guide told you that the river ahead averaged 3 feet in depth,
would you want to wade across on foot without additional information? Probably not.
You would want to know something about the variation in the depth.
A second reason for studying the dispersion in a set of data is to compare the spread in
two or more distributions.

Question17: Write a short note on Properties of the Median.


Answer:
1. There is a unique median for each data set.
2. It is not affected by extremely large or small values and is therefore a valuable measure
of central tendency when such values occur.
3. It can be computed for ratio-level, interval-level, and ordinal-level data.
4. It can be computed for an open-ended frequency distribution if the median does not lie
in an open-ended class.
Question18: Discuss Merits and Demerits of arithmetic Mean.
Answer: Merits: Arithmetic mean is widely used in practice because of the following
reasons:
1. It is the simplest to understand and the easiest to compute. Neither the arranging of
data as required for calculating median nor grouping of data as required for calculating
mode is needed while calculating mean.
2. It is affected by the value of every item in the series.
3. It is defined by a rigid mathematical formula with the result that everyone who
computes the average gets the same answer.
Demerits:
1. Arithmetical mean is not always a good measure of central tendency, as, for instance, in
extremely asymmetrical distributions.
2. Since the value of mean depends upon each and every item of the series, extreme
items, i.e., very small and very large items, unduly affect the value of the average.

Santoshsahni833@gmail.com

Question19: Discuss Merits and Demerits of Median.


Answer:
Merits
1. It is especially useful in case of open-end classes since only the position and not the
values of items must be known.
2. In a markedly skewed distribution such as income distribution or price distribution where
the arithmetic mean would be distorted by extreme values the median is especially
useful.
3. The value of median can be determined graphically whereas the value of mean can not
be graphically ascertained.
Demerits
1. For calculating median it is necessary to arrange the data; other averages do not need
any arrangement.
2. The value of median is affected mare by sampling fluctuations as compared to the value
of arithmetic mean.
Question20: Discuss Merits and Demerits of Mode.
Answer:
Merits
1. It is not affected by extremely large or small items.
2. The value of mode can also be determined graphically whereas the value of mean
cannot be graphically ascertained.
Demerits
1. The value of mode is not based on each and every item of the series.
2. It is not a rigidly defined measure. There are several formulae for calculating the mode,
all of which usually give somewhat different answer.

Long Questions:
Question1: Write short notes on followings.
a. Arithmetical Mean
b. Weighted Average
c. Geometric Mean
d. Harmonic Mean
Answer:

Santoshsahni833@gmail.com

a. Arithmetical Mean: Arithmetic Mean or simple mean (represented by putting a bar


above the variable name) is the quantity obtained by dividing the sum of the values of
items (X) in a variable by their number (n) i.e. number of items.
X=
b. Weighted Average: In calculating simple arithmetic mean it is assumed that all items
were equal in importance. It may not be the case always. When items vary in importance
they should be assigned weights in order of their relative importance. For calculating the
weighted arithmetic mean the value of each items multiplied by its weight, product
summated and divided by the total of weights and not by the number of items. The result is
the weighted arithmetic average.
Xw=
Here w1, w2, w3 . Stands for the respective weights of each of the items.
c. Geometric Mean: geometric mean is defined as the positive nth root of the product of N
items of series. If there are two items, take the square roots; if there are three items, we
take the cube root, and so on.

G.M. =
d. Harmonic Mean: the harmonic mean is based on the reciprocals of the numbers
averaged. It is defined as the reciprocal of the arithmetical mean of the reciprocal of the
individual observations.

H.M. =
Question2: Write short notes on followings.
a. Mean
b. Median
c. Mode

Answer:
a. Mean: Arithmetic Mean or simple mean (represented by putting a bar above the
variable name) is the quantity obtained by dividing the sum of the values of items
(X) in a variable by their number (n) i.e. number of items.
X
b. Median: Median is the value of that item in the set of data which divides the data in
two equal parts, one part consisting of all the values less and other all value greater than it.

Santoshsahni833@gmail.com

Defined in another way median is that value of the central tendency, which divides the total
frequency into two halves.
When n is odd,
The middle position number =
When n is even,
The middle position number =

+1

c. Mode: A third type of Central value or Centre of the distribution is the value of
greatest frequency or, more precisely, of greatest frequency density. Graphically, it is the
value on the X-axis below the peak, or highest point of the frequency curve. This is called
then mode.
Mode=L1+

where

L1=lower boundary of the class containing the largest frequency


d1= difference of the largest frequency and the frequency of the last class
d2= difference of the largest frequency and the frequency of the next class
C= class interval

Question3: Write short notes on followings.


a. Quartiles
b. Deciles
c. Percentiles
d. Moving Average
e. Quadratic Average
Answer:
a. Quartiles: quartiles are another set of measures of positional central tendency. Like
median, a quartile divides the entire set of data into four equal parts. Each part is known as
a quartile. Therefore, three quartiles are possible in a data set as shown below.
General idea remains the same. The data values are arranged either in ascending or
descending order.

Santoshsahni833@gmail.com

b. Deciles: in a manner similar to median and quartiles, the data set can be divided into 10
equal parts when arranged either in ascending or descending order. Each point of division is
called a deciles. Thus, there are nine deciles represented as D1,D2,D3.D9.
The interpretation of a deciles is similar to that of median and quartile.
c. Percentiles: The data set can also be divided in to 100 equal parts whence each point of
division called percentile. The 99 number of percentiles are represented by
P1,P2,P3P99.
A general formula for all the positional measures of central tendency for a frequency for a
frequency class distribution is given by:

Ti=LTi+
d. Moving Average: The moving average is an arithmetic average of data over a period and
is updated regularly by replacing the first item in the average by the new item as it comes
in. it is useful eliminating the irregularity of time series and is generally computed to study
the trend.
Example: Suppose the prices of 12 months are given and a tree monthly average is to be
computed. Then the first item in the 3-month moving average would be the average
[(a1+a2+a3)/3], the second item would be the average of the next three
months[(a2+a3+a4)/3] and so on. The last item would be the average[(a10+a11+a12)/3]. As
the next month would come in a10 would be dropped and a13 would be added in
[(a10+a11+a12)/3] and so on.
e. Quadratic Average: the quadratic mean on average is estimated by taking the square root
of the average squares of the items of a series.
Qm =
Where Qm = Quadratic Mean
a2, b2,c2 =square of the different values

Santoshsahni833@gmail.com

Quadratic average is useful when some items have negative values and other positive
values because in such cases the mean is not very representative. It is also used in
averaging deviations, rather than original values, when the standard deviation is computed.
Question4: Write short notes on followings.
a. Standard Deviation
b. Variance
c. Coefficient of Variance
d. Quartile Deviation
Answer:
a. Standard deviation: The standard deviation of a sample(SD) is similar to the mean
deviation in that it considers the deviation of each X value from the mean. However,
instead of using the absolute values of the deviations, it uses the square of the
deviations. These are added, divided by n, and the square root extracted.
The formula for standard deviation SD

SD =
b. Variance: Variance is the square of SD and is represented by:

Variance = V =
c. Coefficient of variance: to get an indication of the variation that is related that is
related to the mean, we divide the standard deviation by the mean to get the coefficient of
variance. This enables us to compare two groups, which have different standard deviations
and means more easily.

Coefficient of variation =
d. Quartile deviation: Half of the interquartile range is called the quartile deviation or
semi-interquartile range. Symbolically,
The value of Q.D. givens the average magnitude by which the two quartiles deviate from
median.
If the distribution is approximately symmetrical, then Md
50 % fo the
observations and, thus, we can write Q1=Md-Q.D. and Q3=Md+Q.D.
Question5: Write a short notes on Measures of Skewness and Kurtosis. And Kurtosis Vs.
skewness
Answer: Definition of skewness: For univariate data Y1, Y2, ..., YN, the formula for
skewness is:

where is the mean, is the standard deviation, and N is the number of data points. The
skewness for a normal distribution is zero, and any symmetric data should have a skewness

Santoshsahni833@gmail.com

near zero. Negative values for the skewness indicate data that are skewed left and positive
values for the skewness indicate data that are skewed right. By skewed left, we mean that
the left tail is long relative to the right tail. Similarly, skewed right means that the right tail
is long relative to the left tail. Some measurements have a lower bound and are skewed
right. For example, in reliability studies, failure times cannot be negative.
Definition of kurtosis: For univariate data Y1, Y2, ..., YN, the formula for kurtosis is:

where

is the mean, is the standard deviation, and N is the number of data points.

The kurtosis for a standard normal distribution is three. For this reason, some sources use
the following defition of kurtosis:

This definition is used so that the standard normal distribution has a kurtosis of zero. In
addition, with the second definition positive kurtosis indicates a "peaked" distribution and
negative kurtosis indicates a "flat" distribution.
Which definition of kurtosis is used is a matter of convention. When using software to
compute the sample kurtosis, you need to be aware of which convention is being followed.

Santoshsahni833@gmail.com

Examples
The following example shows histograms for 10,000
random numbers generated from a normal, a double
exponential, a Cauchy, and a Weibull distribution.

Skewness and kurtosis: A fundamental task in many statistical analyses is to


characterize the location and variability of a data set. A further characterization of the
data includes skewness and kurtosis.
Skewness is a measure of symmetry, or more precisely,
the lack of symmetry. A distribution, or data set, is
symmetric if it looks the same to the left and right of the
center point.
Kurtosis is a measure of whether the data are peaked or
flat relative to a normal distribution. That is, data sets
with high kurtosis tend to have a distinct peak near the
mean, decline rather rapidly, and have heavy tails. Data
sets with low kurtosis tend to have a flat top near the
mean rather than a sharp peak. A uniform distribution
would be the extreme case.
The histogram is an effective graphical technique for
showing both the skewness and kurtosis of data set.

Question6: Find the A.M. of given range and frequency.


By (1) Assumption method, (2) Step deviation method
Wages
x
No. of f

800

820

860

900

920

980

1000

14

19

25

20

10

Santoshsahni833@gmail.com

Answer: calculation of A.M.


Wages x
800
820
860
900
920
980
1000

No. f
7
14
18
25
20
10
5

D=X-A
-100
-80
-40
0
20
10
5

f*D
-700
-1120
-760
0
400
800
500

U=D/20
-5
-4
-2
0
1
4
5

f*u
-35
-56
-38
0
20
40
25

Let A = 900
Method (1)
A.M. = A +
A.M. = 900 +
= 891.2
Method (2)
A.M. = A +
A.M. = 900 +
= 891.2
Question6: Find the Quartiles of given data below?
Length
c
Leaves
f

118126
3

127135
5

136144
9

145153
12

154162
5

163171
4

172180
2

Answer: calculation of Quartiles


Length c
118-126
127-135
136-144
145-153
154-162
163-171
172-180

Leaves f
3
5
9
12
5
4
2

For Q2
Q2 = l1 +

l1= 144.5

Santoshsahni833@gmail.com

Length c
117.5-126.5
126.5-135.5
135.5-144.5
144.5-153.5
153.5-162.5
162.5-171.5
171.5-180.5

C*f
3
8
17
29
34
38
40

l2 = 153.5
f = 12
N = 40
C = 17
So, Q2 = 144.5 +
= 144.5 + 2.25
= 146.75
For Q1
Q1 = l1 +
l1= 135.5
l2 = 144.5
f=9
N = 40
C = 17
So, Q1 = 135.5 +
=137.5
For Q3
Q3 = l1 +
l1= 153.5
l2 = 162.5
f=5
N = 40
C = 29
So, Q3 = 153.5+
=155.3

Santoshsahni833@gmail.com

Question7: Find the M.D. about the mean for the given data.
Xi
fi

2
4

5
40

6
60

8
56

10
80

12
60

Answer: calculation of M.D.


xi

2
5
6
8
10
12

2
8
10
7
8
5

fi

4
40
60
56
80
60

fi*xi

40

5.5
2.5
1.5
.5
2.5
4.5

|xi x|

300

fi|xi x|
11
20
15
3.5
20
22.5
92

As
M.D
.(m
)=

N=

40
X =
=
= 7.5
M=
M = 2.3
Question8: Find the Median of the given data.

Less
Less
Less
Less
Less
Less

than
than
than
than
than
than

Height in c.m.
140
145
150
155
160
165

Number of student
4
11
29
40
46
51

Answer: calculation of median


Class interval
0-140
4
140-145
7
145-150
18
150-155
11
155-160
6
160-165
5
is 51 odd, so observation will

Santoshsahni833@gmail.com

fc
4
11
29
40
46
51

Sinc
en

Median = l +
f-> frequency of observation class
l-> Lower limit of observation
cf-> frequency commutative of proceeding class
h-> class size

]*5

]*5

Median = 145 +

= 145 +

= 145 +
= 145 + 4.03
= 149.03
Question9: Find the M.D. about the median for the following data.
Xi
fi

3
3

6
4

9
5

12
2

13
4

15
5

21
4

22
3

Answer: calculation of M.D.


xi
3
3
6
4
9
5
12
2
13
4
15
5
21
4
22
3
Since N = 30, which is even.

Fi

So Median is the A.M. of 15th and 16th observation.


Median =
= 13

Santoshsahni833@gmail.com

Cf
3
7
12
14
18
23
27
30

fi

3
4
5
2
4
5
4
3

|xi M|

10
7
4
1
0
2
8
9

30
28
20
2
0
10
32
27

fi*|xi - M|

fi*|xi - M| = 149
M.D. =
M.D. =
= 4.97
Question10: Find the M.D. about the mean for the following data.
Mark
obt.
No. of
stu.

10-20

20-30

30-40

40-50

50-60

60-70

70-80

14

Answer: calculation of M.D.


Mark ob.
10-20
20-30
30-40
40-50
50-60
60-70
70-80

fi
2
3
8
14
8
3
2
fi = 40

15
25
35
45
55
65
75

0
= 45
fi*|xi x| =400
M.D. =
M.D. =

Santoshsahni833@gmail.com

xi

fi*xi
30
75
280
630
440
195
150
fi*xi =
1800

|xi x|
30
20
10
0
10
20
30

fi*|xi x|
60
60
80
0
80
60
60
fi*|xi x|
=400

X =

=
180

= 10
(Answer)
Question11: Calculate Karl Pearsons coefficient of skewness for the following
distribution.
Monthly Salary (in Rs.)
400 but less than 600
600 but less than 800
800 but less than 1000
1000 but less than 1200
1200 but less than 1400
1400 but less than 1600

Number of salesmen
4
10
19
12
4
1

Answer: calculation of Karl Pearsons coefficient of skewness


Salary
Rs.
400-600
600-800
800-1000
1000-1200
1200-1400
1400-1600

m.p.
m.
500
700
900
1100
1300
1500

f
4
10
19
12
4
1
N=50

n: X = A +
A = 900, fd = 5, N=50, i=200
X = 900 +
=920
Mode: mode = L +
Mode lies in the class 800- 1000
L = 800,
Mode=800+

=800 + 112.5=912.5

Santoshsahni833@gmail.com

(m900)/200
d
-2
-1
0
+1
+2
+3

fd
-8
-10
0
+2
+8
+3
Fd=5

fd2
16
10
0
12
16
9
fd2=63

Coe
ff.
Of
Sk.
=

Mea

S.D.

*200

*200

=223.61
Coeff. Of sk. =
=+0.034

Question12: The median of the following data is 525. Find the values of x and y, if
the total frequency is 100.
Class interval

Frequency

0-100
100-200
200-300
300-400
400-500
500-600
600-700
700-800
800-900
900-1000

2
5
X
12
17
20
Y
9
7
4

Answer:
Class interval
0-100
100-200
200-300
300-400
400-500
500-600
600-700
700-800
800-900
900-1000

F
2
5
X
12
17
20
Y
9
7
4

It is given that n = 100


So, 76+x+y=100, i.e., x+y=24

Santoshsahni833@gmail.com

Cf
2
7
7+x
19+x
36+x
56+x
56+x+y
65+x+y
72+x+y
76+x+y

The median is 525, which lies in the class 500-600


Using the formula:
Median =

, we get

525 =
525-500= (14-x)*5
25=70-5x
5x=70-25=45
So,

X=9

Therefore, from (1), we get 9+y=24


Y=24-9
Y=15

Correlation Analysis
Short Questions:
Question1: What is correlation analysis?
Answer: Correlation is a measure of degree of association between two (or more) variables
in a data set. Thus, if it is known that two variables are highly correlated then one can
predict the value of one variable on the basis of the value of the other variable.
two variables say X and Y are said to be correlated if:
a. Both increase and decrease together. In this case the variables are said to be positive
correlated.
b. One increase then the other decrease, when the variables are said to be negatively
correlated.
Question2: What is scatter diagram?
Answer: The simplest device for determining relationship between two variables is a special
type of dot chart called scatter diagram. When this method is used the given data are
plotted on a graph paper in the form of dots, i.e., for each pair of X and Y value we put a
dot and thus obtain as many points the number of observations. By looking to the scatter of
the various points we can form an idea as to whether the variables are related or not. The
more the plotted points scatter over a chart, the less relationship there is between the two
variables. The more nearly the points come falling on a line, the hither the degree of
relationship. If all the points lie on a straight line falling from the left-hand corner to the
upper right corner, correlation is said to be perfectly positive. On other hand, if all the

Santoshsahni833@gmail.com

points are lying on a straight line rising from the upper left hand corner to the lower righthand corner of the diagram correlation is said to be perfectly negative.
Question3: State Karl Pearson Coefficient of Linear Correlation.
Answer: we observed that the more is the covariance the more will be correlation between
the two variables. Therefore, covariance can be treated as a measure of correlation between
two variables. However, the magnitude of covariance will depend on the units of
measurements. The following expression derived from covariance does not suffer from the
of units of measurements and hence is called Karl Pearson coefficient of linear correlation or
simply coefficient of correlation and is denoted by r.
r=
Hence x= (X-X), y= (Y-Y)

N = Number of paired observations.


Question4: Write Properties of Coefficient of Correlation.
Answer: The Karl Pearson Coefficient of Linear Correlation possesses a number of very
interesting properties as described below.
1. The coefficient of linear correlation always lies between -1 and +1 inclusive.
-1<=r<=1
The value 1 suggests perfect positive linear correlation while -1 implies perfect neg7yative
linear correlation. A value 0 indicates that no linear correlation exists between the variables.
2. Coefficient of correlation is not affected by linear transformation of the variables.
Thus if rxy is the correlation between variables X and Y, and r AB is the correlation
between A and B, then
rAr=rxy
where,
A= aX+b and B= cY+d
3. If two variables are not related then they are also not correlated. However, if they are
uncorrelated they may be related. This directly follows from the fact that coefficient of
correlation measures strength of linear relationship. If the variables are related by not
linearly then the coefficient of correlation may turn out to be 0 even though they are
related otherwise.
Question5: What is Regression Analysis?
Answer: The statistical tool with the help of which we are in position to estimate (or predict)
the unknown values of one variable from known value of another variable is called

Santoshsahni833@gmail.com

regression. With the help of regression analysis, we are in a position to find out the
average probable change in one variable given a certain amount of change in another.
Question6: State the Spearmans Rank Correlation.
Answer: This measure is especially useful when quantitative measures for certain factors
(such as in the evaluation of leadership ability or the judgment of female beauty) cannot be
fixed, but the individuals in the group can be arranged in order thereby obtaining for each
individual a number indicating his (her) rank in the group. In any event, the rank correlation
coefficient is applied to a set of ordinal rank numbers, with 1 for the individual ranked first
in quantity, or quality, and so on, to n for the individual ranked last in the group of n
individuals (or n pairs of individuals). Spearmans rank correlation coefficient is defined as:
R = 1Where R denotes rank coefficient of correlation and D refers to the difference of ranks
between paired items in two series.
Question7: What is difference between Regression & Correlation?
Answer: Following are the points of difference between correlation and regression:
1. Whereas correlation coefficient is a measure of degree of co variability between X and
Y, the objective of regression analysis is to study the nature of relationship between
the variables so that we may be able to predict the value of one on the basis of
production is called the interdependent variable and the variable that is to be predicted
is referred to as the dependent variable.
2. The cause and effect relation is clear indicated through regression analysis than by
correlation. Correlation is merely a tool of ascertaining the degree of relationship
between two variable and, therefore, we cannot say that one variable is the cause and
the other the effect.
Question8: What is relationship between Regression and Correlation?
Answer: The two coefficients of regression are related to the coefficient of correlation in a
following way.
Bd=r *r
r2
Or, r =
Hence, coefficient of correlation is geometric mean if the two coefficients of regression.
Question9: What is Partial and Multiple Correlation?
Answer: when three or more variables are studied it is a problem of either multiple or
partial correlation. In multiple correlations three or more variables are studied
simultaneously. For example, when we study the relationship between yield of rice per acre

Santoshsahni833@gmail.com

and both the amount of rainfall and the amount of fertilizer used, it is a problem of multiple
correlation.

Long Questions:
Question1:Define follows:
a. Positive and negative correlation
b. Linear and non-linear correlation
Answer:
a. Positive and Negative Correlation: whether correlation is positive (direct) or negative
(inverse) would depend upon the direction of change of the variable. If both the
variables are varying in the same direction, if as one variable is increasing the other on
an average, is also decreasing, correlation said to be positive. If, on the other hand, the
variables are varying in opposite direction, i.e., as one variable is increasing, the other is
decreasing or vice versa, correlation said to be negative.

b. Linear and Non-linear Correlation: the distinction between linear and non-linear
correlation is based upon the constancy of the ratio of change between variables. If the
amount of change in one variable tends to bear a constant ratio to the amount of
change in the other variable then the correlation is said linear.
Correlation called non-linear or curvilinear if the amount of change in one variable does not
bear a constant ratio to the amount of change in the order variable.

Santoshsahni833@gmail.com

Linear correlation

Non-linear correlation

Question2: Write a short note on Karl Pearsons coefficient of correlation.


Answer: of the several mathematical methods of measuring correlation, the Karl Pearsons
method, popularly known as Pearsonian coefficient of correlation, is most widely used in
practice. The Pearsonian coefficient of correlation is denoted by the symbol r. it is the one of
the very few symbols that is used universally for describing the degree of correlation
between two series. The formula for computing pearsonian r is:
r=
Hence

x = (X X), y = (Y Y)

This method is to be applied only when the deviations of items are taken from actual means
and not from assumed means.
The value of the coefficient of correlation as obtained by the above formula shall always lie
between
when r = +1, it means there is perfect positive correlation between the
variables. When r= -1, it means there is perfect negative correlation between the variables.
When r = 0, it means there no relationship between the variables.
Question3: Two judges in a beauty competition rank the 12 entries as

follows:
X:
Y:

1
12

2
9

3
6

Santoshsahni833@gmail.com

4
10

5
3

6
5

7
4

8
7

9
8

10
2

11
11

12
1

What degree of agreement is there between the judgment of the two judges?
Answer: Calculation of Rank Correlation coefficient

1
2
3
4
5
6
7
8
9
10
11
12

X
R1

12
9
6
10
3
5
4
7
8
2
11
1

Y
R2

(R1 R2)
D
-11
-7
-3
-6
+2
+1
+3
+1
+1
+8
0
+11

D2
121
49
9
36
4
1
9
1
1
64
0
121
D2 = 416

R=1
-

D2 = 416, N = 12
R=1
=1= 1 1.454
= -0.454
Question4: Write a short note on regression lines.

Answer: if we take the case of two variables X and Y, we shall have two regression lines
as the regression of X on Y and of Y on X. The regression line of Y on X givens the most
probable values of Y for given value of X and the regression line of X on Y gives the most
probable values of X for given values of Y. thus we have two regression lines. However,
when there is either perfect positive or perfect negative correlation between the two
variables, the two regression lines will coincide, i.e., we will have only one line. The farther
the two regression lines from each other, the lesser is the degree of correlation and nearer
the two regression lines to each other, the higher the degree of correlation. If the varieties
are independent, r is zero and the lines of regression are at right angles, i.e., parallel to OX
and OY.
It should be noted that the regression lines intersect each other at the point of average of
X and Y, i.e., if from the point where both the regression lines intersect each other a
perpendicular is drawn on the X-axis, we will get the mean value of X and if from that point
a horizontal line is drawn on the Y-axis, we will get the mean value of Y.
Regression equation of Y on X

Santoshsahni833@gmail.com

The regression equation of Y on X is expressed as follows:


Yc = a + bX
To determined the value of a and b the following two normal equations are to be solved
simultaneously:

XY= aX+bX2
Regression equation of X on Y
The regression equation of X on Y is expressed as follows:
Xc = a + bY
To determined the value of a and b the following two normal equations are to be solved
simultaneously:

XY= aY+bY2
Question5: From the following data obtain the regression equation of X on Y, and
also than of Y on X.
X
Y

6
9

2
11

10
5

4
8

8
7

Answer: calculation of regression equations


X
6
2
10
4
8
X =30

X2

(X-6)
X
0
-4
+4
-2
+2
X= 0

0
16
16
4
4
X2 = 40

Regression equation X on Y
X-X = r (Y-Y)

Santoshsahni833@gmail.com

y
9
11
5
8
7
y =40

(y-8)
Y
+1
+3
-3
0
-1
Y= 0

Y2

XY

1
9
9
0
1
Y2 = 20

0
-12
-12
0
-2
XY= -26

X=

X-6 = -1.3(Y-8)
X-6 = -1.3Y + 10.4 or X = 16.4 1.3Y
Regression equation Y on x
Y-Y = r (X-X)

X=

Y-8 = -1.3(X-8)
Y-8 = -1.3X + 10.4 or Y = 16.4 1.3X
Question6: Calculation of Karl Pearsons coefficient of correlation from the

following data:
X
Y

6
10

8
12

12
15

15
15

18
18

20
25

24
22

28
26

31
28

Answer:
X
6
8
12
15
18
20
24
28
31
X=162

(X-18)
x
-12
-10
-6
-3
0
+2
+6
+10
+13
X=0

X2

144
100
36
9
0
4
36
100
169
X2=598

10
12
15
15
18
25
22
26
28
Y=171

r=

Santoshsahni833@gmail.com

(Y-19)
y
-9
-7
-4
-4
-1
+6
+3
+7
+9
Y=0

Y2

xy

81
49
16
16
1
36
9
49
81
Y2=338

+108
+70
+24
+12
0
+12
+18
+70
+117
Xy=431

r=
= +0.959
Question7: What is the utility of the study of correlation?
Answer: The study of correlation is of immense use in practical life because of the following
reasons:
1. Most of the variables show some kind of relationship. For example, there is relationship
between price and supply, income and expenditure, etc. with the help of correlation
analysis we can measure in one figure the degree of relationship exiting between the
variables.
2. Once we know that two variables are closely related, we can estimate the value of one
variable given the value of another.
3. Correlation analysis contributes to the economic behavior, aids in locating the critically
important variables on which others depend, may reveal to the economist the
connection by which disturbances spread and suggest to him the paths through which
stabilizing forces become effective.
In business, correlation analysis enables the executive to estimate costs, sales, prices
and other variables on the basis of some other series with which these costs, sales, or
prices may be functionally related. Some guesswork can be removed from decisions
when the relationship between a variable to be estimated and the one or more other
variables on which it depends are close reasonably invariant.
However, it should be noted that coefficient of correlation is one of the most widely used
and also one of the most widely abused of statistical Measures. It is abused in the sense
that one sometimes overlooks the fact that r measures nothing bit the strength of the
linear relationships and that it does not necessarily imply a cause-effect relationship.
4. Progressive development in the methods of science and philosophy has been
characterized by increase in the knowledge of relationship or correlations. Nature has
been found to be as multiplicity of interrelated forced.

Santoshsahni833@gmail.com

You might also like