Professional Documents
Culture Documents
Tables (Crosstabs)
/***********************************************************
This example illustrates:
How to create user-defined formats
How to recode continuous variables into ordinal categories
How to generate oneway and twoway tables and basic tests
Procs used:
Proc Format
Proc Means
Proc Freq
Proc Contents
Filename: frequencies.sas
************************************************************/
OPTIONS FORMCHAR="|----|+|---+=|-/\<>*";
OPTIONS NODATE PAGENO=1 FORMDLIM=" ";
PROC FORMAT;
VALUE AGEFMT 1 = "1:19-29"
2 = "2:30-39"
3 = "3:>39";
1
The log from that results from these Proc Format commands is shown below. These formats will be stored in
the Work library, and thus will be temporary. In the document that follows, you will see the formats being
applied within each procedure, by using a format statement. These formats will not be automatically attached to
variables, and have to be specified for each procedure.
4 PROC FORMAT;
5 VALUE AGEFMT 1 = "1: Age 19-29"
6 2 = "2: Age 30-39"
7 3 = "3: Age >39";
NOTE: Format AGEFMT has been output.
8
9 VALUE HIAGEFMT 1 = "1: Age >39"
10 2 = "2: Age <=39";
NOTE: Format HIAGEFMT has been output.
11
12 VALUE HICHOLFMT 1 = "1: Chol >=240"
13 2 = "2: Chol <240";
NOTE: Format HICHOLFMT has been output.
14
15 VALUE CHOLCATFMT 1 = "1: Chol <200"
16 2 = "2: Chol 200-239"
17 3 = "3: Chol >=240";
NOTE: Format CHOLCATFMT has been output.
18
19 VALUE PILLFMT 1 = "1: Pill"
20 2 = "2: No Pill";
NOTE: Format PILLFMT has been output.
21
22 VALUE WTFMT 1 = "1: Wt <120kg"
23 2 = "2: Wt 120-139kg"
24 3 = "3: Wt >=140kg";
NOTE: Format WTFMT has been output.
25
26 VALUE HIBMIFMT 1 = "1: BMI>23"
27 2 = "2: BMI<=23";
NOTE: Format HIBMIFMT has been output.
28 RUN;
Now, we create a permanent SAS data set from the raw data file, Werner2.dat. The raw data are read in, then
missing value codes are assigned appropriately, and new variables are created. Note that missing values are
assigned before the new variables are created.
libname b510 "e:\510\";
DATA B510.WERNER;
INFILE "werner2.dat";
INPUT ID 1-4 AGE 5-8 HT 9-12 WT 13-16
PILL 17-20 CHOL 21-24 ALB 25-28
CALC 29-32 URIC 33-36 PAIR 37-39;
IF HT = 999 then HT = .;
IF WT = 999 then WT = .;
IF ALB = 99 then ALB = .;
IF CALC = 99 then CALC = .;
IF URIC = 99 then URIC = .;
WTKG = WT*.39;
HTCM = HT*2.54;
BMI = WTKG/(HTCM/100)**2;
2
IF BMI > 23 then HIBMI = 1;
IF 0<=BMI<=23 then HIBMI = 2;
We use two methods for checking the newly created variables. The simplest one is Proc Means. This tells us
most importantly if we have included all cases in our new variables, and if we have avoided adding data where
there should be none! We will carefully examine the sample size for each original variable, and each new
variable that was created, to be sure they match. This simple check should always be done first!
TITLE "DESCRIPTIVE STATISTICS";
PROC MEANS;
RUN;
3
DESCRIPTIVE STATISTICS
The MEANS Procedure
Variable N Mean Std Dev Minimum Maximum
-------------------------------------------------------------------------------
ID 188 1598.96 1057.09 3.0000000 3519.00
AGE 188 33.8191489 10.1126942 19.0000000 55.0000000
HT 186 64.5107527 2.4850673 57.0000000 71.0000000
WT 186 131.6720430 20.6605767 94.0000000 215.0000000
PILL 188 1.5000000 0.5013351 1.0000000 2.0000000
CHOL 186 236.1505376 42.5555145 155.0000000 390.0000000
ALB 186 4.1112903 0.3579694 3.2000000 5.0000000
CALC 185 9.9621622 0.4795556 8.6000000 11.1000000
URIC 187 4.7705882 1.1572312 2.2000000 9.9000000
PAIR 188 47.5000000 27.2063810 1.0000000 94.0000000
BMI 184 19.0736235 2.6285786 15.2305671 29.6996059
HIBMI 184 1.9021739 0.2978899 1.0000000 2.0000000
AGEGROUP 188 1.9255319 0.8432096 1.0000000 3.0000000
HIAGE 188 1.6808511 0.4673916 1.0000000 2.0000000
HICHOL 186 1.5322581 0.5003051 1.0000000 2.0000000
CHOLCAT 186 2.2634409 0.7783954 1.0000000 3.0000000
WTCAT 186 2.0322581 0.7490767 1.0000000 3.0000000
-------------------------------------------------------------------------------
A second way to check recodes of continuous variables into categories is illustrated below. Basically, you
should check the minimum and maximum value of the original variable in each category of the new categorical
variable to be sure the range of values is specified as you wanted it to be. Do this only after you have checked
the sample sizes by using a simple Proc Means statement, as illustrated above.
TITLE "CHECKING RECODE OF WT INTO WTCAT";
PROC MEANS DATA=B510.WERNER;
CLASS WTCAT;
VAR WT;
FORMAT WTCAT WTFMT.;
RUN;
CHECKING RECODE OF WT INTO WTCAT
The MEANS Procedure
Analysis Variable : WT
N
WTCAT Obs N Mean Std Dev Minimum Maximum
---------------------------------------------------------------------------------------------
1: Wt <120kg 49 49 109.4489796 7.0209841 94.0000000 119.0000000
2: Wt 120-139kg 82 82 128.6097561 5.9103510 120.0000000 138.0000000
3: Wt >=140kg 55 55 156.0363636 17.2969315 140.0000000 215.0000000
---------------------------------------------------------------------------------------------
4
------------------------------------------------------------------------------------------
ONEWAY FREQUENCIES
The FREQ Procedure
Cumulative Cumulative
PILL Frequency Percent Frequency Percent
---------------------------------------------------------------
1: Pill 94 50.00 94 50.00
2: No Pill 94 50.00 188 100.00
Cumulative Cumulative
WTCAT Frequency Percent Frequency Percent
--------------------------------------------------------------------
1: Wt <120kg 49 26.34 49 26.34
2: Wt 120-139kg 82 44.09 131 70.43
3: Wt >=140kg 55 29.57 186 100.00
5
Frequency Missing = 2
Cumulative Cumulative
AGEGROUP Frequency Percent Frequency Percent
-----------------------------------------------------------------
1: Age 19-29 74 39.36 74 39.36
2: Age 30-39 54 28.72 128 68.09
3: Age >39 60 31.91 188 100.00
Cumulative Cumulative
HIAGE Frequency Percent Frequency Percent
----------------------------------------------------------------
1: Age >39 60 31.91 60 31.91
2: Age <=39 128 68.09 188 100.00
Cumulative Cumulative
HICHOL Frequency Percent Frequency Percent
------------------------------------------------------------------
1: Chol >=240 87 46.77 87 46.77
2: Chol <240 99 53.23 186 100.00
Frequency Missing = 2
Cumulative Cumulative
CHOLCAT Frequency Percent Frequency Percent
--------------------------------------------------------------------
1: Chol <200 38 20.43 38 20.43
2: Chol 200-239 61 32.80 99 53.23
3: Chol >=240 87 46.77 186 100.00
Frequency Missing = 2
If you have a categorical variable with only two levels, you can use the binomial option to request a 95%
confidence interval for the proportion in the first level of the variable, and a test of the null hypothesis:
In the option (P= ) you specify the hypothesized proportion in the first category of the tabled variable. By
default, SAS reports both one-sided and two-sided asymptotic p-values.
TITLE "BINOMIAL TEST";
PROC FREQ DATA=B510.WERNER ORDER=INTERNAL;
TABLES HIBMI / BINOMIAL (P=.20);
FORMAT HIBMI HIBMIFMT.;
RUN;
6
Frequency Missing = 4
Binomial Proportion
for HIBMI = 1:BMI>23
-------------------------------------
Proportion (P) 0.0978
ASE 0.0219
95% Lower Conf Limit 0.0549
95% Upper Conf Limit 0.1408
If you wish to obtain an exact binomial test of the null hypothesis, use the exact statement.
Use the chisq option in the tables statement to get a chi-square goodness of fit test, which can be used for
categorical variables with two or more levels. By default SAS assumes that you wish to test the null hypothesis
that the proportion of cases is equal in all categories.
Use the testp= option to specify the proportions that you wish to test, if you don't want to assume equal
proportions in all categories. The total of all the proportions must be 1.0. You can also use percentages, in
which case, the total must add up to 100%. Give the appropriate proportions in the testp= option, specifying
them in order as they apply to each category.
TITLE "CHISQUARE GOODNESS OF FIT TEST";
PROC FREQ DATA=B510.WERNER ORDER=INTERNAL;
TABLES CHOLCAT / CHISQ TESTP=(.20 .30 .50);
FORMAT CHOLCAT CHOLCATFMT.;
RUN;
7
Frequency Missing = 2
Chi-Square Test
for Specified Proportions
-------------------------
Chi-Square 0.8889
DF 2
Pr > ChiSq 0.6412
If you wish to examine the relationship between two categorical variables, you can use Proc Freq. Use the
chisq option to obtain the Pearson chi-square test of independence (or of homogeneity), and use the expected
option to get the expected value in each cell. The commands below can be used to get a cross-tabulation. In this
case, we have a 2 by 2 table, because each categorical variable has two levels. We test:
Note that Fisher’s exact test is produced by default for a 2 x 2 table, when the chisq option is specified. Read
either the one-sided or two-sided p-value for the Fisher’s exact test, which are at the bottom of the respective
panel of output, and shown in bold below.
2x2 TABLE
Table of HIAGE by HICHOL
HIAGE HICHOL
Frequency |
Expected |
Percent |
Row Pct |
Col Pct |1: Chol |2: Chol | Total
|>=240 |<240 |
------------+--------+--------+
1: Age >39 | 42 | 18 | 60
| 28.065 | 31.935 |
| 22.58 | 9.68 | 32.26
| 70.00 | 30.00 |
| 48.28 | 18.18 |
------------+--------+--------+
2: Age <=39 | 45 | 81 | 126
| 58.935 | 67.065 |
| 24.19 | 43.55 | 67.74
| 35.71 | 64.29 |
| 51.72 | 81.82 |
------------+--------+--------+
Total 87 99 186
46.77 53.23 100.00
Frequency Missing = 2
8
Statistics for Table of HIAGE by HICHOL
The Cochran-Armitage test for trend is appropriate when either the row or column variable is binary (has two
levels) and the other variable is ordinal. It tests whether there is a linear trend in the proportion of subjects
having the binary characteristic. The Mantel-Haenszel test statistic tests for a linear by linear association and
can be used when both row and column variables are ordinal; it always has 1 degree of freedom. In the table
below, both the row and column variables could be considered to be ordinal, because a binary variable can be
thought of as a very simple case of an ordinal variable.
TITLE1 "3X2 TABLE";
TITLE2 "THE ROW VARIABLE IS ORDINAL";
PROC FREQ DATA=B510.WERNER ORDER=INTERNAL;
TABLES AGEGROUP*HICHOL / CHISQ TREND NOCOL NOPERCENT;
FORMAT AGEGROUP AGEFMT. HICHOL HICHOLFMT. ;
RUN;
H0: There is a linear trend in the proportion of women with high cholesterol, with increasing age
We are not testing whether the trend is in a positive or negative direction. To see that, simply examine the
proportions of participants with high cholesterol in each age group.
3X2 TABLE
THE ROW VARIABLE IS ORDINAL
The FREQ Procedure
Table of AGEGROUP by HICHOL
AGEGROUP HICHOL
Frequency |
Row Pct |1: Chol |2: Chol | Total
|>=240 |<240 |
-------------+--------+--------+
1: Age 19-29 | 25 | 47 | 72
| 34.72 | 65.28 |
-------------+--------+--------+
2: Age 30-39 | 20 | 34 | 54
| 37.04 | 62.96 |
-------------+--------+--------+
3: Age >39 | 42 | 18 | 60
9
| 70.00 | 30.00 |
-------------+--------+--------+
Total 87 99 186
Frequency Missing = 2
Mantel-Haenszel test for a linear association between two ordinal categorical variables:
R x C table, both row and column variables are ordinal
In the next table, both the row and column variable are ordinal. In this case the Mantel-Haenszel test is
appropriate to test for a linear by linear association between the ordinal row variable and the ordinal column
variable. The Pearson Chi-square test is appropriate for testing general association (H0: the row variable is
independent of the column variable) whether there is ordering of the row and/or column variable or not. In a
table like this, which does have ordering of both row and column variables, the Pearson Chi-square test ignores
the ordering of the variables.
TITLE "3X3 TABLE BOTH ORDINAL VARIABLES";
PROC FREQ DATA=B510.WERNER ORDER=INTERNAL;
TABLES AGEGROUP*WTCAT / CHISQ nocol nopercent;
FORMAT AGEGROUP AGEFMT. WTCAT WTFMT.;
RUN;
Frequency Missing = 2
10
Likelihood Ratio Chi-Square 4 11.4638 0.0218
Mantel-Haenszel Chi-Square 1 8.7820 0.0030
Phi Coefficient 0.2513
Contingency Coefficient 0.2437
Cramer's V 0.1777
title;
proc contents data=b510.cars;
run;
proc format;
value originfmt 1="USA"
2="Europe"
3="Japan";
run;
Output from the SAS log is shown below. Because this format had already been defined in the current run of
SAS, there is a note in the log stating that it is already on the library. If this format were to be resubmitted with
new values, the new values would over-write the old values.
142 proc format;
143 value originfmt 1="USA"
144 2="Europe"
145 3="Japan";
NOTE: Format ORIGINFMT is already on the library.
NOTE: Format ORIGINFMT has been output.
146 run;
We now get simple descriptive statistics for each level of the variable ORIGIN, using a class statement.
11
class origin;
format origin originfmt.;
run;
options nolabel;
proc means data=b510.cars;
class origin;
format origin originfmt.;
run;
The MEANS Procedure
Country
of N
Origin Obs Variable Label N Mean
--------------------------------------------------------------------------------------
USA 253 MPG Miles per Gallon 248 20.1282258
ENGINE Engine Displacement (cu. inches) 253 247.7134387
HORSE Horsepower 249 119.6064257
WEIGHT Vehicle Weight (lbs.) 253 3367.33
ACCEL Time to Accelerate from 0 to 60 mph (sec) 253 14.9284585
YEAR Model Year (modulo 100) 253 75.5217391
CYLINDER Number of Cylinders 253 6.2766798
Country
of N
Origin Obs Variable Label Std Dev Minimum
---------------------------------------------------------------------------------------------
USA 253 MPG Miles per Gallon 6.3768059 10.0000000
ENGINE Engine Displacement (cu. inches) 98.7799678 85.0000000
HORSE Horsepower 39.7991647 52.0000000
WEIGHT Vehicle Weight (lbs.) 788.6117392 1800.00
ACCEL Time to Accelerate from 0 to 60 mph (sec) 2.8011159 8.0000000
YEAR Model Year (modulo 100) 3.7145843 70.0000000
CYLINDER Number of Cylinders 1.6626528 4.0000000
12
Country
of N
Origin Obs Variable Label Maximum
-------------------------------------------------------------------------------
USA 253 MPG Miles per Gallon 39.0000000
ENGINE Engine Displacement (cu. inches) 455.0000000
HORSE Horsepower 230.0000000
WEIGHT Vehicle Weight (lbs.) 5140.00
ACCEL Time to Accelerate from 0 to 60 mph (sec) 22.2000000
YEAR Model Year (modulo 100) 82.0000000
CYLINDER Number of Cylinders 8.0000000
We now take a look at a 3 by 5 table (the row variable has 3 levels and the column variable has 5 levels) to see
if there is any association between Country of Origin, and Number of Cylinders. The Pearson chi-square test is
perhaps appropriate here…but let’s see.
13
Row variable is nominal, column variable is ordinal
Table of ORIGIN by CYLINDER
Frequency|
Expected |
Percent |
Row Pct |
Col Pct | 3| 4| 5| 6| 8| Total
---------+--------+--------+--------+--------+--------+
USA | 0 | 72 | 0 | 74 | 107 | 253
| 2.4988 | 129.31 | 1.8741 | 52.474 | 66.842 |
| 0.00 | 17.78 | 0.00 | 18.27 | 26.42 | 62.47
| 0.00 | 28.46 | 0.00 | 29.25 | 42.29 |
| 0.00 | 34.78 | 0.00 | 88.10 | 100.00 |
---------+--------+--------+--------+--------+--------+
Europe | 0 | 66 | 3| 4| 0 | 73
| 0.721 | 37.311 | 0.5407 | 15.141 | 19.286 |
| 0.00 | 16.30 | 0.74 | 0.99 | 0.00 | 18.02
| 0.00 | 90.41 | 4.11 | 5.48 | 0.00 |
| 0.00 | 31.88 | 100.00 | 4.76 | 0.00 |
---------+--------+--------+--------+--------+--------+
Japan | 4 | 69 | 0| 6| 0 | 79
| 0.7802 | 40.378 | 0.5852 | 16.385 | 20.872 |
| 0.99 | 17.04 | 0.00 | 1.48 | 0.00 | 19.51
| 5.06 | 87.34 | 0.00 | 7.59 | 0.00 |
| 100.00 | 33.33 | 0.00 | 7.14 | 0.00 |
---------+--------+--------+--------+--------+--------+
Total 4 207 3 84 107 405
0.99 51.11 0.74 20.74 26.42 100.00
Frequency Missing = 1
Because the table contains a high proportion of small expected values (less than 5), SAS gives a warning
message in the output. In this case, we can use a Fisher’s exact test. Here are the commands we first try to use:
WARNING: Computing exact p-values for this problem may require much time and memory. Press the
14
system interrupt key to terminate exact computations.
NOTE: There were 406 observations read from the data set B510.CARS.
NOTE: PROCEDURE FREQ used (Total process time):
real time 31.02 seconds
cpu time 23.54 seconds
We now resubmit the commands, using instead the Monte Carlo option in SAS (mc). This will give us a quite
good approximation to the Fisher’s exact test p-value, but based on 10,000 strategically chosen tables.
The output for these tests are shown below. The appropriate p-value is the portion labeled Pr <= P. When
reporting a p-value that is displayed as 0.0000, please use p< 0.0001.
Pr <= P 0.0000
99% Lower Conf Limit 0.0000
99% Upper Conf Limit 4.604E-04
15