You are on page 1of 11

QBA Term Report

SECTION A

A New Way to Compute Pearsons r Without Reliance on Cross-Products


Submitted by Irfan Junejo Kantesh Rathi Vali Mohammad Instructor: Sir Rizwan Ahmed

Table of Contents INTRODUCTION ............................................................................................................................................. 3 BY IRFAN JUNEJO ...................................................................................................................................... 3 THE TEACHING/COMPUTING STRATEGY ...................................................................................................... 6 BY KANTESH RATHI ................................................................................................................................... 6 COMMENTS ABOUT THE FORMULA ............................................................................................................. 8 BY VALI MOHAMMAD ............................................................................................................................... 8 EXAMPLE ....................................................................................................................................................... 9 BY IRFAN JUNEJO ...................................................................................................................................... 9 CONCLUSION............................................................................................................................................... 10 BY VALI MOHAMMAD ............................................................................................................................. 10 INDEX .......................................................................................................................................................... 11 REFERENCES ................................................................................................................................................ 11

2|Page

INTRODUCTION
BY IRFAN JUNEJO The given article is taken from the journal Teaching Statistics which is an international journal for teachers and works under the banner of Teaching Statistics Trust which is a registered charity since 1979 and since then they have been publishing a journal thrice every year. In a recent article published in the same journal entitled Correlation: From Picture to Formula, Peter Holmes1 (2001) accurately points out that scatter diagrams are very useful when introducing students to the subject of correlation and makes it easier for them to judge the relation between X and Y variables. A scatter diagram is basically a tool for determining the potential relation between two variables i.e. how one variable changes with the other one. The scatter diagram does not however indicate the exact relation but it does indicate whether the variables are connected or not. For example, the scatter diagram below shows that there`s no relation between X and Y and because all the data points don`t seem to make a distinguishable pattern.
6 5 4 Y - Axis 3 2 1 0 0 2 X - Axis 4 6 Data Points

However in this next graph on the left there`s a positive relation between the two variables because as the value of one variable increases the other one also increases whereas the scatter diagram on the right represents a negative relation.

Circulation Manager for Teaching Statistics

3|Page

4.5 4 3.5 3 Y - Aixs 2.5 2 1.5 1 0.5 0 0 10 X - Axis 20 Data Points Y - Aixs

25 20 15 10 5 0 0 10 X - Axis 20 Data Points

In this manner the scatter diagrams helps in indicating the relation between the two variables which is further explained under the next heading. The scatter diagram helps in making a rough guess of r`s position which always lies between 1 and +1. This r is the coefficient of correlation. Karl Pearson developed the correlation from a similar but slightly different idea by Francis Galton. The coefficient of correlation i.e. r can also be denoted by . The diagram that follows explains how a scatter diagram helps the students in making a fairly rough guess of the value of r.

Figure 1 - http://en.wikipedia.org/wiki/Correlation_and_dependence

Holmes states that a typical student can make reasonably fair predictions about the value of r but they face difficulty is how the formula for Pearson`s r to quantify its value and the understanding that comes from observing the scatter diagram. Holmes tries to bridge the gap between Pearson`s formula and a scatter diagram in a step by step fashion. 4|Page

In the book Comprehending Behavioral Statistics, Dr. Russell Hurlbert2 also tries to bridge the same gap between scatter diagram and r. Hurlbert first demonstrates how a tic-tac toe grid can be super imposed on the data of the scatter diagram.

Figure 2

Followed by this superimposition Hurlbert argues that the data in the four corners (1, 3, 7, and 9) of the grid have the most significant impact on the sign and magnitude of r. Lastly Hurlbert computes the z-score cross products and then states that the Pearson`s correlation is equal to the mean of these zxzy values. Here`s how to compute the zxzy values: X 25 14 33 28 20 = 120 Mean = 24 SD = 6.54 Zx 0.15 -1.53 -1.38 0.61 -0.61 Y 80 98 50 82 90 = 400 Mean = 80 SD = 16.3 Zy 0.00 1.10 -1.84 0.12 0.61 ZxZy 0.00 -1.68 -2.53 0.08 -0.38 = -4.51 r = -0.90

The Z scores can be computed by subtracting the cell value from it Mean3 and dividing the whole by its Standard Deviation4. For example the Z score for the first class size is (2524)/6.54 = 0.15

2 3

Professor of psychology, University of Nevada For a data set mean is the sum of the observations divided by the number of observations. 4 Standard Deviation shows how much variation there is from the "average" (mean).

5|Page

For the value or r we sum up the ZxZy values and divide the sum by the total number of observation i.e. 5 in this case. Both Holmes and Hurlbert try to connect the scatter diagram with the correlation coefficient by making use of the Z scores. The sum of the cross products although forms the numerator for the value of r but both the authors state that there`s a better way to show that the formula for r truly does quantify the qualitative understanding that one gets from looking at the scatter diagram. The advantage for this alternative approach is that it does not rely on the Z scores instead it involves the creation of separate direct and indirect components of each score. These components, it is argued, are far more accordant with the intuitive feel that one gets when looking at a scatter diagram.

THE TEACHING/COMPUTING STRATEGY


BY KANTESH RATHI The best way for showing direct and indirect influence of each data point in detail is straight forward , closely understand the nature and strength of two variables (which are dependent to each other) relationship and investigate about those variables.

This procedure can be understood easily by four steps: First convert all given scores on X and on Y axes into Z scores. This conversion will not affect the Pearson product-moment correlation coefficient (sometimes referred to as the PMCC, and typically denoted by r) is a measure of the correlation (linear dependence) between two variables X and Y, giving a value between +1 and 1 inclusive. Students will be aware of this important needed feature of Pearson productmoment correlation coefficient (sometimes referred to as the PMCC, and typically denoted by r) if asked the question: If we correlate centigrade and Farehinite, height and weight, meters or centimeters or feet or inches affect the value of the correlation coefficient. Second , draw a scatter diagram with the Z-score Inside this scatter diagram, draw a 6|Page

Line at a 45 angle from the origin to moving upward passing through centroid (centroid is the intersection of all hyper planes that divide X into two parts of equal moment.). This line represent positive (direct ) relationship and reprent this line from (D) . Also draw another line that will be at 90 degree to D line and that should pass from centroid. Second line show negative (indirect) relationship so represent this from I.

70 80 60 40 20 0 0 20 40 60 80

Negative

Positive

60 50 40 30 20 10 0 0 20 40 60 80

Third, determine the projection of each data point on positive and negative lines ,measure the distances from each these projection of D and I points to centroid and represent these distances, direct as d and indirect as i. The distance of positive line indicates direct influence on r and the distance from negative line indicates indirect influence on r. Finally, after getting the value of I and D distances, we can compute the value of r by doing squared of these values , summed and then put into the following formula so as the value for Pearson product-moment correlation coefficient : r

7|Page

COMMENTS ABOUT THE FORMULA BY VALI MOHAMMAD

As we observe the above formula we can see that r will produce a positive value when the distances d are large and the distances i are small. This situation will cause the r to produce a positive value because this will create a compact path hence causing the i distances to be minute and much lower than the d distances hence producing a scatter diagram which is moving from lower left to upper right on the other hand r will produce a negative result in the case when the data points would form a cluster and are moving on the line I or in other words moving perpendicular to D (as can be seen from the figure above). Another unique feature of the formula is that both will equal zero no matter what the data is or what the relationship between X and Y is. This means that it would be useless and a waste of time to calculate the value of and also (i.e. the sum of the unsquared deviation scores) when measuring dispersion in the univariate case. So for people who were wondering if they could find ds and ts and then divide the difference between their sums: would have had their queries solved by the above statement and developed a clear approach on how to use the formula best. By looking at the above diagram some people may wonder what kind of a confusing diagram it is and may form the opinion that it is quite difficult to find the values of as they are represented through the perpendicular axes rather than the vertical and horizontal axes labeled but instead if they take a closer look they will realize that simple functions of 8|Page (as shown below) are

Through the above formulas we can easily calculate the value of and plug the values in the formula to get the final answer however it should be kept in mind that we require the values of and not to calculate r so dont go on wasting your time on something that is not needed instead utilize your time on the given requirements. However you can calculate the values of and square then to get the values of : To calculate the values of we must first determine what sign (positive or negative) does to d or t posses. This signs of can be calculated using a set of rules which are: The sign of d for any data point will be Positive if that data points z-scores meet anyone of these three conditions: (a) Both zx and zy are positive, (b) zy is positive, zx is negative, and zy > |zx|, or (c) zx is positive, zy is negative, and zx > |zy|. If none of those conditions hold, then d will be negative. A similar set of rules can applied to determine the sign of the i values. So we can see it is much easier to calculate the values o f rather than .

EXAMPLE
BY IRFAN JUNEJO

9|Page

The above table contains five pairs of X and Y scores i.e. X and Y values, x and y values (values subtracted from their mean), Z scores and values for d2 and i2. Based on the data given in this table Pearson`s r is computed three times. First r is computed with the traditional formula involving cross product of Z scores. Next, r is computed involving the cross products of deviation scores. Lastly, r is computed using the formula given in the research paper

It is evident that all the three approaches yield the same result for the value of coefficient of correlation.

CONCLUSION
BY VALI MOHAMMAD In the end we can conclude by saying that the formula above is a good and a clear way to calculate r because the students have to rely on their instinct and direct skills to calculate the formula rather than get involved with intensive calculations involving the z scores and the deviations. Not only is the formula a good solid bases for finding r but also the scatter diagram is genuinely very easy to follow once understood. The total variability may be calculated in any scatter diagram by: 1) Finding out the linear distance between the data point and the center 2) Square the distances 3) Adding those values up The above method is similar the Pythagoras theorem and also seems to exist in the denominator of the formula. To further understand the denominator we could take d and t to be two opposing forces which are taking the N values from the graph away from the center hence in other words a total force that is created which acts upon the function. The numerator on the other hand can be seen as a net force. This exists once the denominator has been neutralized through the opposing forces. This ratio of net force to total force is one aspect of looking how the formula actually works.

10 | P a g e

INDEX
C
centroid Correlation 7 3

P
Pearson Pythagoras 6 10

F
Francis Galton 4

S
scatter diagram 3, 4, 5, 6, 8, 10

H
Hurlbert 5, 6

T
Teaching Statistics Teaching Statistics Trust 3 3

K
Karl Pearson 4

Z
Z scores 5, 6, 10

REFERENCES

1. http://en.wikipedia.org/wiki/Karl_Pearson 2. http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient 3. http://wps.prenhall.com/wps/media/objects/2497/2557809/MEDIA/Ch3/learnmorech 3.pdf 4. Research Methods and Statistics: A Critical Thinking Approach by Sherri L. Jackson 5. Statistics for People Who (Think They) Hate Statistics: Excel 2007 Edition by Neil J. Salkind

11 | P a g e

You might also like