A Method For Cleaning Clinical Trial Datasets

NESUG 2006
Applications
A Method for Cleaning Clinical Trial Analysis Data Sets

Carol R. Vaughn , Bridgewater Crossings, NJ
ABSTRACT
This paper presents a method for using SAS software to search SAS programs in selected directories for references to variables existing in clinical trial analysis data sets slated to be submitted to the FDA. The end product is a list of variables not used in any of the programs searched. A common reason for unused derived variables is due to analyses which were planned but later eliminated or significantly altered. Dropping these unused variables is highly desirable since they require unnecessary validation and serve only as clutter with dubious value. The method involves searching selected directories and rendering programs in those directories into a searchable working SAS data set. This working data set is then searched for the occurrence of a reference to each analysis data set variable.
INTRODUCTION
The first step in this process is to identify the programs in the directories to be searched. This can be accomplished by working with directory information. The next step is to render the programs searchable. This can be accomplished by treating the lines of code as lines of data and reading them into a working data set. Then, the analysis data set variables must be identified in order to search for references to them. One way to achieve this is to select them into macro variables from the SAS COLUMNS dictionary. After searching the program code, the analysis data set variables for which no reference is found in any of the programs, are designated for possible deletion. Finally, the variables identified can be compared against metadata to confirm the acceptability of deleting them. This paper provides example code for each of the steps in this process. The process is broken down into these component steps in order to show how the functionality of each could be used for other applications as well.
WORKING WITH DIRECTORY INFORMATION

Methods for identifying the contents of a directory are dependant on the operating system in which SAS is running. On the UNIX operating system, this can be accomplished by using an X command within SAS to list the directory contents to a file. x ls -1 > dirlist.txt; Figure 1below is an example of a resulting text file:
Figure 1. Example Text File Resulting From X Command
NESUG 2006
Applications
On the Windows operating system, pointing to a PIPE device type in a filename statement can make available details of the contents of the directory referenced. filename SAFT pipe 'dir "P:\Biostat\XXX1234A\3333\pg\rep\saft\"'; Figure 2 below is an example of information resulting from pointing to a PIPE device type :
Figure 2. Example Information Resulting From Pointing to a PIPE Device Type
This paper will focus on using directory information on a Windows operating system. However, similar methods to those to be described could be used to work on a UNIX operating system. In fact, in order to make a program transportable between operating systems, the operating system dependant code could be executed conditionally depending on which operating system the automatic macro variable &SYSSCP returns. In order to make use of directory information, it can be input into a SAS data set. When it is desired to determine the contents of multiple directories, the list of contents of each directory can be appended into a shell data set along with identifier information to tag it with the directory to which the contents pertain. To identify the program files, the file extensions can be isolated by scanning the line of input for the text subsequent to the last special character. If the text equals sas, then the SAS program name can be isolated by scanning the line of input for the text preceding the last special character. %macro getdir_win(_dir_nm=,_path=); filename dl pipe &_path; data dirlist ; length line $1000 dir_nm $5 path infile dl length=reclen; input line $varying1000. reclen; if upcase(scan(line,-1)) = "SAS"; dir_nm path = "&_dir_nm"; = substr(&_path,6,length(&_path)-6); trim(left(scan(line,-2))) || "." || trim(left(scan(line,-1))); $50 program $50;
program =
keep dir_nm path program; run; proc append data = dirlist base = all_dir force; run;
NESUG 2006
Applications
%mend getdir_win; %getdir_win(_dir_nm=SAFT,_path= 'dir "P:\Biostat\XXX1234A\3333\pg\rep\saft\"'); %getdir_win(_dir_nm=DER,_path='dir "P:\Biostat\XXX1234A\3333\pg\der\"');
Figure 3 below is an excerpt of the working data set ALL_DIR created from this code:
Figure 3. Example Working Data Set Resulting From Using Directory Information in a DATA Step
RENDERING PROGRAM FILES INTO A SEARCHABLE SAS DATA SET

A CALL EXECUTE can then be used to loop through each program in each directory identified in this working (ALL_DIR) data set. The code below successively creates a filename called INF for each program, and then creates a working data set called PRG_SET using each successive infile INF. The lines of code in the programs themselves are treated as lines of data and are read into this working data set with an input statement. Then, each working data set for each program in ALL_DIR is appended into the shell data set called PRG. data prg; length code $200 dir_nm $5 prg_nm $50; delete; run; data _null_; set all_dir; call execute("filename inf '" || trim(path) || trim(program) call execute("data prg_set; infile inf truncover; length prg_nm $50 dir_nm $5; input code $1-200; if code ne ''; prg_nm = '" || trim(program) || "';" || "code = upcase(code);
||
"';");
dir_nm = '" || trim(dir_nm) || "'; run;"); call execute("proc append data = prg_set base = prg force; run;"); run;
The data set called PRG will contain every line of code from every program in ALL_DIR with its corresponding program name and directory reference. The code will be in all uppercase and left justified in order to aid in searching.
NESUG 2006
Applications
Figure 4 below is an excerpt of the working data set PRG created with the code:
Figure 4. Example Working Data Set Resulting From Reading in Program Contents with CALL EXCECUTE
SEARCHING PROGRAM FILES

The data set PRG can then be searched for text strings. A simple use for the ability to search programs for text strings would be to search for a programmers name. proc sort data=prg out=searched (keep=dir_nm prg_nm) nodupkey; by dir_nm prg_nm; where index(code,"WONG")>0; run; Figure 5 below is an excerpt of the working data set SEARCHED resulting from this search:
Figure 5. Example Working Data Set Resulting From Searching for a Programmers Name in Programs
It is sometimes desirable to search for a string of characters as a word and not merely a sequence of characters. For example, when searching for the string EVENT using the function INDEX, the string EVENTA will be identified as an occurrence. To circumvent this, the function INDEXW can be used. This function searches a character expression for a specified string as a word preceded and followed by a blank space. When searching program code, often the word being searched for will be preceded or followed by special characters such as a semicolon or equal sign. In order for INDEXW to yield the desired result it may be necessary to strip out many of these special characters in the working data set of code and replace them with spaces prior to searching with the function INDEXW. The function TRANSLATE can be used for this purpose.
NESUG 2006
Applications
data prg; set prg; code = trim(left(translate(code," ","*+-/^=~><)(,;|!"))); run; Note that certain special characters are retained. For example, the special characters &, %, , and are retained so that macro parameters, macro names, and strings are not incorrectly identified as variable names. Figure 6 below is an excerpt of the working data set PRG resulting from this modification:
Figure 6. Example Working Data Set Resulting From Stripping Out Selected Special Characters
APPLICATION OF THIS FUNCTIONALITY

There are many uses for the ability to identify directory contents and render various types of files into searchable SAS data sets. Obviously, this same tact could be used to convert SAS log or lst files into searchable SAS data sets. The application that is the particular subject of this paper is the ability to search SAS programs for references to variables in clinical trial analysis data sets to be submitted to the FDA. As previously stated, the objective of such a search is to identify derived variables in the analysis data sets which are unused in programs and designate them for possible deletion. The process involves identifying the variables in the analysis data sets and searching through the code (as stored in the working data set called PRG described above) for references to each of the variables. There is one inherent complication in this process. The program which created the derived variable will obviously have reference to that variable. Therefore, it is necessary to make sure that the program that created the variable is not searched for that variable. One way to accomplish this is to name all programs which create analysis data sets following a convention which includes the name of the data set in the name of the program. In this way, the search program can be written to skip lines of code tagged with the data set name of the data set in which the variable exists. In the creation of the working data set ALL_DIR described above, an additional variable must be created to identify the data set, if any, the program created. The example code below could be added to the creation of the working data set ALL_DIR if all analysis data set creation programs are found in a directory tagged DER and are the only programs in this directory. If upcase(dir_nm) = DER then ds=upcase(substr(program,1,length(program)-4)); else ds = NAP;
The creation of the variable DS could be achieved similarly under other naming and directory conventions.
NESUG 2006
Applications
Figure 7 below is an excerpt of the working data set ALL_DIR resulting from the addition of the variable DS:
Figure 7. Example Working Data Set Resulting From The Addition of The Variable DS
The variable DS would need to be included as a variable in the working data set PRG. This could be accomplished by adding it to the CALL EXECUTE which used ALL_DIR to create PRG. In order to identify the variables in the derived data sets, the SAS dictionary table COLUMNS can be used. proc sql noprint; create table vars as select distinct upper(memname) as ds, upper(name) as var from dictionary.columns where upper(libname) = "DDS"; quit;
Figure 8 below is an excerpt of the working data set VARS created with this code:
Figure 8. Example Working Data Set Resulting From Reading in Data from DICTIONARY.COLUMNS
By counting the number of variables, and selecting variable names into macro variables, the variables can be looped through and used successively as the string in the INDEXW function. By selecting the corresponding data set names into macro variables, the comparison can be made to the name of the data set the program created. proc sql noprint; select left(put(count(var),4.0)) into :varcnt from vars; select var into :var1 - :var&varcnt from vars; select ds into :ds1 - :ds&varcnt from vars; quit;
NESUG 2006
Applications
The following code could be used to determine if a variable name is found in a line of code which does not have the same data set tag name (the value of the variable DS in the working data set PRG) as the analysis data set in which the variable is found (the value of the variable DS from the working data set VARS which is held as a macro variable). If this condition is met, the line of code plus the value of variable VAR, which identifies which variable reference was found in the line of code, is output to a working data set called REF_PRG. data ref_prg; set prg; length var $20; %macro process; %do i = 1 %to &varcnt; if indexw(code,"&&var&i") > 0 and ds ne "&&ds&i" then do; var = "&&var&i"; output; end; %end; %mend process; %process; run;
Figure 9 below is an excerpt of the working data set REF_PRG created by this code:
Figure 9. Example Working Data Set Resulting From Searching for Reference to Analysis Dataset Variables
Note that if a variable from the working data set VAR was not found in the program code (the working data set PRG), a record will not be written to the working data set REF_PRG.
The unique variables which were referred to in the code (the variable VAR in the working data set REF_PRG) are then compared against all analysis data set variable names in order to determine which are never referred to in code. proc sort data = ref_prg (keep = var) out = used nodupkey by var; run; proc sort data = vars (keep = var) out = all_vars nodupkey; by var; run; data unused; merge all_vars used (in = used); by var; if not used; run;
NESUG 2006
Applications
Figure 10 below is an excerpt of the working data set UNUSED created with this code:
Figure 10. Example Working Data Set Resulting Identifying Unused Variables
Identifying the unused variables is usually not the last step. It may be necessary or desirable to retain some of the variables identified as unused. For example, at times it may be desired to retain raw Case Report Form (CRF) variables in an analysis data set even though they are never used in a program. Or, perhaps there are many derived decode variables (example: a variable storing the values MALE, and FEMALE, which are the decodes of a coded variable with values 1 and 2) which are never used in a program, but it is desired to retain them in the analysis data sets. In such cases, it is valuable to have the metadata for the analysis data sets in such a medium that the resulting unused variables can be programmatically compared against to identify which variables are desirable to retain.
PREREQUISITES/CAVEATS
This method works well as long as certain programming practices and conventions are followed: The line size of code in all programs should not exceed 200 characters. Analysis data set derivation programs should be placed in a separate directory and named following a convention which allows programmatic identification of the data set they create. The final subdirectory of any directory path to be searched should not be named SAS. Also, please note that this method does not discriminate between comments in programs and actual code. Nor
does it differentiate between variable names and data set names.

For example, if there was an analysis data set with the derived variables EVENT and BASE and these variables were never actually used, this method would identify these variable references as having been found if the following code was contained in a program searched:
However, in practical use, these potential problems have not yet presented themselves as actual problems.
CONCLUSION
The functionality of SAS to be able to take information from directory details and files other than data sets, place this information in SAS data sets, and search for references to variables contained in SAS data sets has many applications. The basic concepts presented in this paper for determining variables not used in programs could be modified to accomplish many other tasks.
ACKNOWLEDGEMENT
I would like to thank my colleague, Jeffery Cortez, for coming up with the idea of searching programs to identify analysis data set variables not used in programs.
NESUG 2006
Applications
CONTACT INFORMATION
Your comments and questions are welcome. Contact the author at: Author Name Enterprise Address City State ZIP Work Phone: Email: Carol R. Vaughn The sanofi-aventis Group 200 Bridgewater Crossings Bridgewater, NJ, 08807 908-304-6298 Carol.Vaughn@sanofi-aventis.com
SAS
SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. indicates USA registration.
WINDOWS UNIX
Windows is a registered trademark of Microsoft Corporation in the United States and other countries.
UNIX is a registered trademark of The Open Group.

A Method For Cleaning Clinical Trial Datasets

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A Method For Cleaning Clinical Trial Datasets

Uploaded by

Copyright:

Available Formats

NESUG 2006

A Method for Cleaning Clinical Trial Analysis Data Sets

WORKING WITH DIRECTORY INFORMATION

Figure 1. Example Text File Resulting From X Command

Figure 2. Example Information Resulting From Pointing to a PIPE Device Type

%mend getdir_win; %getdir_win(_dir_nm=SAFT,_path= 'dir "P:\Biostat\XXX1234A\3333\pg\rep\saft\"'); %getdir_win(_dir_nm=DER,_path='dir "P:\Biostat\XXX1234A\3333\pg\der\"');

RENDERING PROGRAM FILES INTO A SEARCHABLE SAS DATA SET

SEARCHING PROGRAM FILES

APPLICATION OF THIS FUNCTIONALITY

does it differentiate between variable names and data set names.

UNIX is a registered trademark of The Open Group.

You might also like