Professional Documents
Culture Documents
Applications
ABSTRACT
This paper presents a method for using SAS software to search SAS programs in selected directories for references to variables existing in clinical trial analysis data sets slated to be submitted to the FDA. The end product is a list of variables not used in any of the programs searched. A common reason for unused derived variables is due to analyses which were planned but later eliminated or significantly altered. Dropping these unused variables is highly desirable since they require unnecessary validation and serve only as clutter with dubious value. The method involves searching selected directories and rendering programs in those directories into a searchable working SAS data set. This working data set is then searched for the occurrence of a reference to each analysis data set variable.
INTRODUCTION
The first step in this process is to identify the programs in the directories to be searched. This can be accomplished by working with directory information. The next step is to render the programs searchable. This can be accomplished by treating the lines of code as lines of data and reading them into a working data set. Then, the analysis data set variables must be identified in order to search for references to them. One way to achieve this is to select them into macro variables from the SAS COLUMNS dictionary. After searching the program code, the analysis data set variables for which no reference is found in any of the programs, are designated for possible deletion. Finally, the variables identified can be compared against metadata to confirm the acceptability of deleting them. This paper provides example code for each of the steps in this process. The process is broken down into these component steps in order to show how the functionality of each could be used for other applications as well.
NESUG 2006
Applications
On the Windows operating system, pointing to a PIPE device type in a filename statement can make available details of the contents of the directory referenced. filename SAFT pipe 'dir "P:\Biostat\XXX1234A\3333\pg\rep\saft\"'; Figure 2 below is an example of information resulting from pointing to a PIPE device type :
This paper will focus on using directory information on a Windows operating system. However, similar methods to those to be described could be used to work on a UNIX operating system. In fact, in order to make a program transportable between operating systems, the operating system dependant code could be executed conditionally depending on which operating system the automatic macro variable &SYSSCP returns. In order to make use of directory information, it can be input into a SAS data set. When it is desired to determine the contents of multiple directories, the list of contents of each directory can be appended into a shell data set along with identifier information to tag it with the directory to which the contents pertain. To identify the program files, the file extensions can be isolated by scanning the line of input for the text subsequent to the last special character. If the text equals sas, then the SAS program name can be isolated by scanning the line of input for the text preceding the last special character. %macro getdir_win(_dir_nm=,_path=); filename dl pipe &_path; data dirlist ; length line $1000 dir_nm $5 path infile dl length=reclen; input line $varying1000. reclen; if upcase(scan(line,-1)) = "SAS"; dir_nm path = "&_dir_nm"; = substr(&_path,6,length(&_path)-6); trim(left(scan(line,-2))) || "." || trim(left(scan(line,-1))); $50 program $50;
program =
keep dir_nm path program; run; proc append data = dirlist base = all_dir force; run;
NESUG 2006
Applications
Figure 3 below is an excerpt of the working data set ALL_DIR created from this code:
Figure 3. Example Working Data Set Resulting From Using Directory Information in a DATA Step
||
"';");
dir_nm = '" || trim(dir_nm) || "'; run;"); call execute("proc append data = prg_set base = prg force; run;"); run;
The data set called PRG will contain every line of code from every program in ALL_DIR with its corresponding program name and directory reference. The code will be in all uppercase and left justified in order to aid in searching.
NESUG 2006
Applications
Figure 4 below is an excerpt of the working data set PRG created with the code:
Figure 4. Example Working Data Set Resulting From Reading in Program Contents with CALL EXCECUTE
Figure 5. Example Working Data Set Resulting From Searching for a Programmers Name in Programs
It is sometimes desirable to search for a string of characters as a word and not merely a sequence of characters. For example, when searching for the string EVENT using the function INDEX, the string EVENTA will be identified as an occurrence. To circumvent this, the function INDEXW can be used. This function searches a character expression for a specified string as a word preceded and followed by a blank space. When searching program code, often the word being searched for will be preceded or followed by special characters such as a semicolon or equal sign. In order for INDEXW to yield the desired result it may be necessary to strip out many of these special characters in the working data set of code and replace them with spaces prior to searching with the function INDEXW. The function TRANSLATE can be used for this purpose.
NESUG 2006
Applications
data prg; set prg; code = trim(left(translate(code," ","*+-/^=~><)(,;|!"))); run; Note that certain special characters are retained. For example, the special characters &, %, , and are retained so that macro parameters, macro names, and strings are not incorrectly identified as variable names. Figure 6 below is an excerpt of the working data set PRG resulting from this modification:
Figure 6. Example Working Data Set Resulting From Stripping Out Selected Special Characters
The creation of the variable DS could be achieved similarly under other naming and directory conventions.
NESUG 2006
Applications
Figure 7 below is an excerpt of the working data set ALL_DIR resulting from the addition of the variable DS:
Figure 7. Example Working Data Set Resulting From The Addition of The Variable DS
The variable DS would need to be included as a variable in the working data set PRG. This could be accomplished by adding it to the CALL EXECUTE which used ALL_DIR to create PRG. In order to identify the variables in the derived data sets, the SAS dictionary table COLUMNS can be used. proc sql noprint; create table vars as select distinct upper(memname) as ds, upper(name) as var from dictionary.columns where upper(libname) = "DDS"; quit;
Figure 8 below is an excerpt of the working data set VARS created with this code:
Figure 8. Example Working Data Set Resulting From Reading in Data from DICTIONARY.COLUMNS
By counting the number of variables, and selecting variable names into macro variables, the variables can be looped through and used successively as the string in the INDEXW function. By selecting the corresponding data set names into macro variables, the comparison can be made to the name of the data set the program created. proc sql noprint; select left(put(count(var),4.0)) into :varcnt from vars; select var into :var1 - :var&varcnt from vars; select ds into :ds1 - :ds&varcnt from vars; quit;
NESUG 2006
Applications
The following code could be used to determine if a variable name is found in a line of code which does not have the same data set tag name (the value of the variable DS in the working data set PRG) as the analysis data set in which the variable is found (the value of the variable DS from the working data set VARS which is held as a macro variable). If this condition is met, the line of code plus the value of variable VAR, which identifies which variable reference was found in the line of code, is output to a working data set called REF_PRG. data ref_prg; set prg; length var $20; %macro process; %do i = 1 %to &varcnt; if indexw(code,"&&var&i") > 0 and ds ne "&&ds&i" then do; var = "&&var&i"; output; end; %end; %mend process; %process; run;
Figure 9 below is an excerpt of the working data set REF_PRG created by this code:
Figure 9. Example Working Data Set Resulting From Searching for Reference to Analysis Dataset Variables
Note that if a variable from the working data set VAR was not found in the program code (the working data set PRG), a record will not be written to the working data set REF_PRG.
The unique variables which were referred to in the code (the variable VAR in the working data set REF_PRG) are then compared against all analysis data set variable names in order to determine which are never referred to in code. proc sort data = ref_prg (keep = var) out = used nodupkey by var; run; proc sort data = vars (keep = var) out = all_vars nodupkey; by var; run; data unused; merge all_vars used (in = used); by var; if not used; run;
NESUG 2006
Applications
Figure 10 below is an excerpt of the working data set UNUSED created with this code:
Figure 10. Example Working Data Set Resulting Identifying Unused Variables
Identifying the unused variables is usually not the last step. It may be necessary or desirable to retain some of the variables identified as unused. For example, at times it may be desired to retain raw Case Report Form (CRF) variables in an analysis data set even though they are never used in a program. Or, perhaps there are many derived decode variables (example: a variable storing the values MALE, and FEMALE, which are the decodes of a coded variable with values 1 and 2) which are never used in a program, but it is desired to retain them in the analysis data sets. In such cases, it is valuable to have the metadata for the analysis data sets in such a medium that the resulting unused variables can be programmatically compared against to identify which variables are desirable to retain.
PREREQUISITES/CAVEATS
This method works well as long as certain programming practices and conventions are followed: The line size of code in all programs should not exceed 200 characters. Analysis data set derivation programs should be placed in a separate directory and named following a convention which allows programmatic identification of the data set they create. The final subdirectory of any directory path to be searched should not be named SAS. Also, please note that this method does not discriminate between comments in programs and actual code. Nor
However, in practical use, these potential problems have not yet presented themselves as actual problems.
CONCLUSION
The functionality of SAS to be able to take information from directory details and files other than data sets, place this information in SAS data sets, and search for references to variables contained in SAS data sets has many applications. The basic concepts presented in this paper for determining variables not used in programs could be modified to accomplish many other tasks.
ACKNOWLEDGEMENT
I would like to thank my colleague, Jeffery Cortez, for coming up with the idea of searching programs to identify analysis data set variables not used in programs.
NESUG 2006
Applications
CONTACT INFORMATION
Your comments and questions are welcome. Contact the author at: Author Name Enterprise Address City State ZIP Work Phone: Email: Carol R. Vaughn The sanofi-aventis Group 200 Bridgewater Crossings Bridgewater, NJ, 08807 908-304-6298 Carol.Vaughn@sanofi-aventis.com
SAS
SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. indicates USA registration.
WINDOWS UNIX
Windows is a registered trademark of Microsoft Corporation in the United States and other countries.