You are on page 1of 4

Displaying the data

For small samples the raw data can be listed in their entirety in three columns: one for some sort of
identifier; one for the obtained values for X; and one for the corresponding obtained values for Y. If X
and Y are both continuous variables, a scatterplot of Y against X should be used in addition to or
instead of that three-column list. [An interesting alternative to the scatterplot is the "pair-link" diagram
used by Stanley (1964) and by Campbell and Kenny (1999) to connect corresponding X and Y scores.]
If X is a categorical independent variable, e.g., type of treatment to which randomly assigned in a true
experiment, and Y is a continuous dependent variable, a scatterplot is also appropriate, with values of
X on the horizontal axis and with values of Y on the vertical axis.

For large samples a list of the raw data would usually be unmanageable, and the scatterplot might be
difficult to display with even the most sophisticated statistical software because of coincident or
approximately coincident data points. (See, for example, Cleveland, 1995; Wilkinson, 2001.) If X and
Y are both naturally continuous and the sample is large, some precision might have to be sacrificed by
displaying the data according to intervals of X and Y in a two-way frequency contingency table (cross-
tabulation). Such tables are also the method of choice for categorical variables for large samples.

How small is small and how large is large? That decision must be made by each individual researcher.
If a list of the raw data gets to be too cumbersome, if the scatterplot gets too cluttered, or if cost
considerations such as the amount of space that can be devoted to displaying the data come into play,
the sample can be considered large.

Summarizing the data

For continuous variables it is conventional to compute the means and standard deviations of X and Y
separately, the Pearson product-moment correlation coefficient between X and Y, and the
corresponding regression equation(s), if the objective is to determine the direction and the magnitude of
the degree of linear relationship between the two variables. Other statistics such as the medians and the
ranges of X and Y, the residuals (the differences between the actual values of Y and the values of Y on
the regression line for the various values of X), and the like, may also be of interest. If curvilinear
relationship is of equal or greater concern, the fitting of a quadratic or exponential function might be
considered.

[Note: There are several ways to calculate Pearson's r, all of which are mathematically equivalent.
Rodgers & Nicewander (1988) provided thirteen of them. In an unpublished paper, I (Knapp,1990)
added six more formulas, including a rather strange-looking one I derived several years prior to that in
an article (Knapp, 1979) on estimating covariances using the incidence sampling technique developed
by Sirotnik & Wellington (1974).]

For categorical variables there is a wide variety of choices. If X and Y are both ordinal variables with a
small number of categories (e.g., for Likert-type scales), Goodman and Kruskal's (1979) gamma is an
appropriate statistic. If the data are already in the form of ranks or easily convertible into ranks, one or
more rank-correlation coefficients, e.g., Spearman's rho or Kendall's tau, might be preferable for
summarizing the direction and the strength of the relationship between the two variables.

If X and Y are both nominal variables, indexes such as the phi coefficient (which is mathematically
equivalent to Pearson's r for dichotomous variables), relative risk, or Goodman and Kruskal's (1979)
lambda may be equally defensible alternatives.
For more on displaying data in contingency tables and the summarization of such data, see Simon
(1978), Knapp (1999), and the "Measures for ordered categories " page on Richard Darlington's
website.

Interpreting the data

Determining whether or not a relationship is strong or weak, statistically significant or not, etc. is part
art and part science. If the data are for a full population or for a "convenience" sample, no matter what
size it may be, the interpretation should be restricted to an "eyeballing" of the scatterplot or
contingency table, and the descriptive (summary) statistics . For a probability sample, e.g., a simple
random random or a stratified random sample, statistical significance tests and/or confidence intervals
are usually required for proper interpretation of the findings, as far as any inference from sample to
population is concerned. But sample size must be seriously taken into account for those procedures or
anomalous results could arise, such as a statistically significant relationship that is substantively
inconsequential. (Careful attention to choice of sample size in the design phase of the study should
alleviate most if not all of such problems.)

An example

The following example has been analyzed and scrutinized by many researchers. It is due to Efron and
his colleagues (see, for example, Diaconis & Efron, 1983). [LSAT = Law School Aptitude Test; GPA
= Grade Point Average]

Data display(s)

Law School Average LSAT score Average Undergraduate GPA

1 576 3.39
2 635 3.30
3 558 2.81
4 578 3.03
5 666 3.44
6 580 3.07
7 555 3.00
8 661 3.43
9 651 3.36
10 605 3.13
11 653 3.12
12 575 2.74
13 545 2.76
14 572 2.88
15 594 2.96

680-
LSAT - 2
- *
- *
640+
- *
-
-
- *
600+
- *
- *
- * * * *
-
560+ *
- *
- *
-
--+---------+---------+---------+---------+---------+----GPA
2.70 2.85 3.00 3.15 3.30 3.45

[The 2 indicates there are two data points (for law schools #5 and #8) that are very close to one another
in the (X,Y) space. It doesn't clutter up the scatterplot very much, however. Note: Efron and his
colleagues always plotted GPA against LSAT. I have chosen to plot LSAT against GPA. Although
they were interested only in correlation and not regression, if you cared about predicting one from the
other it would make more sense to have X = GPA and Y = LSAT, wouldn't it? ]

Summary statistics
N MEAN STDEV
lsat 15 600.3 41.8
gpa 15 3.0947 0.2435

Correlation of lsat and gpa = 0.776

The regression equation is


lsat = 188 + 133 gpa (standard error of estimate = 27.34)

Unusual Observations
Obs. gpa lsat Fit Stdev.Fit Residual St.Resid
1 3.39 576.00 639.62 11.33 -63.62 -2.56R

R denotes an obs. with a large st. resid.


Interpretation

The scatterplot looks linear and the correlation is rather high (it would be even higher without
the outlier). Prediction of average LSAT from average GPA should be generally good, but
could be off by about 50 points or so (approximately two standard errors of estimate).

If this sample of 15 law schools were to be "regarded" as a simple random sample of all law
schools, a statistical inference may be warranted. The correlation coefficient of .776 for n =
15 is statistically significant at the .05 level, using Fisher's r-to-z transformation, and the 95%
confidence interval for the population correlation extends from .437 to .922 on the r scale
(see Knapp, Noblitt, & Viragoontavan, 2000), so we can be reasonably assured that in the
population of law schools there is a non-zero linear relationship between average LSAT and
average GPA.

You might also like