Professional Documents
Culture Documents
ABSTRACT
®
SAS is the de facto standard programming language for statistical analysis in the pharmaceutical industry. The mainstay of
its use is in the generation of tables, listings and graphs based upon the rules and instructions described in the statistical analysis
plan on data stored within SAS datasets usually derived from a clinical data management database. This information is
collected on a case report form (CRF) or an electronic data capture (EDC) system processed through a database for query
resolution with the source documents at the site and sent to the Statisticians and SAS programmers for their analysis.
INTRODUCTION
The purpose of this paper is to provide an overview process of table and listing generation as is it applies in the SAS
pharmaceutical programming arena. It is not presented as the only method for table generation. It is an attempt to show the
fundamental data flow process, from data capture to presentation and the methods SAS is used in such presentation.
For the purpose of this paper assume the protocol is a randomized trial, patients can be enrolled equally into either a
compound called Treatment X or Placebo (a sugar pill) equally. That the hypothesis to be tested is that one can
enroll patients into this trial equally.
Figure 1 presents one particular page in a CRF. The data it is interested in collecting is demographic data or patient
characteristic data. A person enrolled in a clinical trial will have information such as this collected to determine the
homogeneity of the patient or subject population enrolled in the trial. A person at the investigational site will complete
the form on this crf. This data will then be entered into a database to create an electronic version of the paper
information.
A Statistical Analysis Plan (SAP) is a document describing the planned analysis that will be performed on the
electronic CRF data. The following represents some sample SAP text:
The purpose of this study is compare study drug X with placebo in demographic information for baseline testing.
Subjects will be enrolled in a 1:1 ratio in this 2 arm open-label trial to see what baseline effects, if any, occur.
Descriptive statistics will be presented for all parameters collected with no inferential analysis being performed.
Statistics for continuous parameters (age) will be presented by N, mean, median, minimum and maximum values.
Age will be calculated from the difference of the study randomization date and the date of birth. Categorical
parameters (gender, ethnicity) will have groupings presented as counts. All information collected will be listed.
As the SAP text is written it is very common for the statistician to create mock data displays which are tables and
listings demonstrating how the analysis described in the SAP will be presented. The mock describes the layout of the
data in listings and the statistics performed in the table. Figures 2 and 3 demonstrate mock a mock table and listing
based on the crf data to be collected and the sample SAP text previously stated. In pharmaceutical SAS
programming, a listing supporting a table is almost always produced. One listing can support many tables.
Age[1] (yrs)
n n n n
Mean x.x x.x x.x
Median x.x x.x x.x
Min, Max x.x, x.x x.x, x.x x.x, x.x
Sex
Male n (%) n (%) n (%)
Female n (%) n (%) n (%)
Race
African n (%) n (%) n (%)
Asian n (%) n (%) n (%)
Caucasian n (%) n (%) n (%)
Hispanic n (%) n (%) n (%)
Other n (%) n (%) n (%)
Percentages are based on the total number of subjects in each treatment group.
[1] Based on date of collection.
Mock Listing 1
Demographics
Intent-to-Treat Subjects
Site/
Subject Date of Age
Treatment Number Birth (yrs) Gender Ethnic Origin
Looking at the dataset through SAS viewer with the label statement turned off see the following information captured:
siteno subjid dminit dmdob dmdobd dmaged dmgndr dmeth dmethsp dtrt
0001 0009 MTW 19530723 51 Male Caucasian placebo
0001 0025 SST 19810828 23 Female Asian x
0003 0023 SIN 19590521 45 Male Asian
0004 0047 NAP 19650312 39 Male Caucasian x
0004 0057 QAA 19841218 20 Female Other Angloindian
0006 0008 TSC 19721001 32 Female Asian x
0008 0040 ECN 19571109 47 Female African placebo
0012 0003 SAV 19580527 46 Male Other American Indian placebo
0021 0065 TTM 19480531 56 Male Hispanic placebo
0033 0005 ADC 19100210 94 Female Hispanic x
Many times the CRF will be annotated with the SAS variable names to aid programming. The next series of steps a
programmer can take are the annotation of the mock tables and listings with the SAS variables to be used to present
each part of the data to be presented. Mock annotation provides a the following benefits:
It provides other people the information on what variables are being presented
It provides the programmer a tool to state what derived (calculated) variables will need to be presented
It records a plan of action to be taken before any SAS code is written
Figures 4 and 5 represent the annotated mocks for the study.
Sex sex_d
Male n (%) n (%) n (%)
Female n (%) n (%) n (%)
Race ethn_d
African n (%) n (%) n (%)
Asian n (%) n (%) n (%)
Caucasian n (%) n (%) n (%)
Hispanic n (%) n (%) n (%)
Other n (%) n (%) n (%)
Percentages are based on the total number of subjects in each treatment group.
[1] Based on date of collection.
Mock Listing 1
Demographics
Intent-to-Treat Subjects DERIVED.itt=1
Site/
Subject Date of Age
Treatment trt_d Number Birth (yrs) Gender Ethnic Origin
sitesubj dob_d dmaged dmgndr dmeth
With this information, programming can now begin. It is important to try to obtain (or create) as many of the
documents while programming. This gives the programmer all the information needed to generate the tables and
listings correctly the first time. The pharmaceutical industry is a regulated industry. As such, a programmer should
always be able to describe the methodology and documentation for generating summarized information.
One approach for programmers to use is to store their calculated fields in a dataset prior to table and listing
generation. These datasets are called derived (as derived from raw) and allow others to see the calculation prior their
display on the output files (tables and listings). It is easier to store an age calculation in a dataset than to duplicate it
in the programs producing the tables and listings. The following program creates a derived dataset called DERIVED.
*******************************************;
* Title: Derived Dataset for Presentation
* Program: derived.sas
* Author: Shia Thomas
* Date: September 30, 2004
********************************************;
data data.derived;
set data.testdemo;
Proc contents and SAS viewer display of the derived dataset based on the mock annotations and the previously
described SAS program.
siteno subjid dminit dmdob dmdobd dmaged dmgndr dmeth dmethsp dtrt itt trt_d sex_d mitt ethn_d dob_d
0001 0009 MTW 19530723 51 Male Caucasian placebo 1 2 1 1 3 7/23/1953
0001 0025 SST 19810828 23 Female Asian x 1 1 2 0 2 8/28/1981
0003 0023 SIN 19590521 45 Male Asian 0 1 0 2 5/21/1959
0004 0047 NAP 19650312 39 Male Caucasian x 1 1 1 1 3 3/12/1965
0004 0057 QAA 19841218 20 Female Other Angloindian 0 2 0 5 12/18/1984
0006 0008 TSC 19721001 32 Female Asian x 1 1 2 0 2 10/1/1972
0008 0040 ECN 19571109 47 Female African placebo 1 2 2 0 1 11/9/1957
0012 0003 SAV 19580527 46 Male Other American placebo 1 2 1 1 5 5/27/1958
0021 0065 TTM 19480531 56 Male Hispanic placebo 1 2 1 1 4 5/31/1948
0033 0005 ADC 19100210 94 Female Hispanic x 1 1 2 0 4 2/10/1910
From the derived dataset one can now write code to produce the table and listing. The following shows the final
output from these programs. The output can be created through many of SAS’s procedures or through a data null
statement.
Age[1] (yrs)
n 4 4 8
Mean 47.0 50.0 48.5
Median 35.5 49.0 46.5
Min, Max 23, 94 46, 56 23, 94
Sex
Male 1 (25%) 3 (75%) 4 (50%)
Female 3 (75%) 1 (25%) 4 (50%)
Race
African - 1 (25%) 1 (12.5%)
Asian 2 (50%) - 2 (25.0%)
Caucasian 1 (25%) 1 (25%) 2 (25.0%)
Hispanic 1 (25%) 1 (25%) 2 (25.0%)
Other - 1 (25%) 1 (12.5%)
Percentages are based on the total number of subjects in each treatment group.
[1] Based on date of collection.
Listing 1
Demographics
Intent-to-Treat Subjects
Site/
Subject Date of Age
Treatment Number Birth (yrs) Gender Ethnic Origin
SAS programming of tables and listings in the pharmaceutical industry is a stepwise process, always dependent on
previous documents and descriptions of what is to be produced. Many companies have various different processes
and documents in addition to those described in this paper. It is important to understand those processes that are
specific to a given company. In general the flow of rules and data can be described as in the figure 8, each step
dependent on the previous one. When the steps are not followed, there is the potential for mistakes.
SAP/Mocks
Database
Annotated CRF
Annotated Mocks
Derived Datasets
Programming Rules
SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS
Institute Inc. in the USA and other countries. ® indicates USA registration.
Other brand and product names are trademarks of their respective companies.