You are on page 1of 4

Paper PO02

Be Careful When You Merge SAS Datasets!


Yan Lei, GlaxoSmithKline, Collegeville, PA
ABSTRACT
When doing SAS programming, we often need to merge two or more datasets into one. The problem that arises with merging is that when MERGE, IF-THEN statement were used within one data step, the results may be wrong or it may cause confusion. This paper presents two examples that show clearly what happens when you use the above method to do MERGE and also provides tips to avoid these kinds of mistakes.

INTRODUCTION
SAS system provides very useful tools for data manipulation, such as DATA step, RETAIN, function LAG, .. etc. It facilitates the joining, merging and manipulating of datasets, in addition to providing with flexibility and being easy to use. However, we should watch the results closely and make sure we use procedures or functions in right steps.
EXAMPLE 1

These are two data sets fname and fdoc, each of them is showed as follows:

Data set fname


Obs 1 2 3 4 5 SOURCE_FILE_NAME LSUR01_ITT_D.LST TDEM01_ITT_D.CEL TDEM01_ITT_D.LST TVI01_LOBELL_ITT_D.CEL TVI01_LOBELL_ITT_D.LST NAME LSUR01_ITT_D TDEM01_ITT_D TDEM01_ITT_D TVI01_LOBELL_ITT_D TVI01_LOBELL_ITT_D

Data set doc


Obs 1 2 3 NAME SUR01_ITT_D TDEM01_ITT_D TVI01_LOBELL_ITT_D DISPLAY_ NUMBER 16.13 13.1 13.9.3 ABBREVIATED_TITLE Listing 16.13 Table 13.1 Table 13.9.3

The following code was writen to join these two datasets by column name. And also add "Cell Index for" to the title if postfix of SOURCE_FILE_NAME is 'CEL'': libname dt 'c:\yan\sascode\'; proc sort data=dt.fname out=xfname; by name; proc sort data=dt.doc out=xdoc; by name; data final; merge xfname(in=a) xdoc(in=b); by name; if a; if upcase (substr(source_file_name, index(source_file_name,'.')+1))='CEL' then abbreviated_title = 'Cell Index for ' || left(abbreviated_title);

run; proc print; run; You can see the IF-THEN statement used in one data step with MERGE. The computation result from this example is as follows:

Example 1 Output
Obs 1 2 3 4 5 SOURCE_FILE_NAME LSUR01_ITT_D.LST TDEM01_ITT_D.CEL TDEM01_ITT_D.LST TVI01_LOBELL_ITT_D.CEL TVI01_LOBELL_ITT_D.LST

11:34 Thursday, September 11, 2003 NAME LSUR01_ITT_D TDEM01_ITT_D TDEM01_ITT_D TVI01_LOBELL_ITT_D TVI01_LOBELL_ITT_D DISPLAY_ NUMBER 16.13 13.1 13.1 13.9.3 13.9.3 ABBREVIATED_TITLE Listing 16.13 Cell Index for Cell Index for Cell Index for Cell Index for Table Table Table Table 13.1 13.1 13.9.3 13.9.3

Obviously, the results in red are wrong. Because Merge joined the abbreviated_title from previous row. For example: in Obs 3, the abbreviated_title copied from Obs 2 as "Cell Index for Table 13.1" and source_file_name is "TDEM01_ITT_D.LST" which does not contain "CEL", so no change to abbreviated_title. To fix the problem, I modified the code to separate MERGE and IFTHEN in two data steps as follows: libname dt 'c:\yan\sascode\'; proc sort data=dt.fname out=xfname; by name; proc sort data=dt.doc out=xdoc; by name; data document; merge xfname(in=a) xdoc(in=b; by name; if a; run; data document; set document; if upcase(substr(source_file_name,index(source_file_name,'.')+1))='CEL' then abbreviated_title='Cell Index for '||left(abbreviated_title); run; The computation results from this modified code are as follows:

Example 1 Output
Obs 1 2 3 4 5 SOURCE_FILE_NAME LSUR01_ITT_D.LST TDEM01_ITT_D.CEL TDEM01_ITT_D.LST TVI01_LOBELL_ITT_D.CEL TVI01_LOBELL_ITT_D.LST

11:34 Thursday, September 11, 2003 NAME DISPLAY_ NUMBER ABBREVIATED_TITLE Listing 16.13 Cell Index for Table 13.1 Table 13.1 Cell Index for Table 13.9.3 Table 13.9.3

LSUR01_ITT_D 16.13 TDEM01_ITT_D 13.1 TDEM01_ITT_D 13.1 TVI01_LOBELL_ITT_D 13.9.3 TVI01_LOBELL_ITT_D 13.9.3

They are exactly what we expected the results to be and is thus the correct solution.
EXAMPLE 2

This example described a misuse case of LAG caused by using MERGE and IF-THEN within one datastep. In this scenario, we want to join dataset stmed and bsa by pid cycle, if pid's bsa is missing in some of cycle then use previous cycle bsa. If previous cycle is also missing then use previous.previous cycle until you get a non-missing bsa. The datasets stmed, bsa and code that was writen to do the job are showed as follows:

proc sort data=stmed; by pid cycle; data stmed(drop=date); merge stmed(in=a) bsa; by pid cycle; if a; lpid = lag(pid); lbsa = lag(bsa); if not(first.cycle and first.pid) then do; if bsa=. and pid=lpid then bsa=lbsa; end; run;
data stmed; input pid cycle date dose; datalines; 001 1 25Jan1999 2.2 001 1 27Jan1999 2.2 001 1 29Jan1999 2.2 001 2 20Feb1999 2.2 001 2 22Feb1999 2.2 001 2 24Feb1999 2.2 001 3 15Mar1999 2.2 001 3 17Mar1999 2.2 001 3 19Mar1999 2.2 002 1 25Jan1999 1.5 002 1 27Jan1999 1.5 002 1 29Jan1999 1.5 002 2 20Feb1999 1.5 002 2 22Feb1999 1.5 002 2 24Feb1999 1.5 002 3 15Mar1999 1.5 002 3 17Mar1999 1.5 002 3 19Mar1999 1.5 run; data bsa; input pid cycle bsa; datalines; 001 1 1.88 001 2 . 001 3 . 002 1 1.67 002 2 1.70 002 3 .

The computation result from this code is as follows:


The SAS System Obs 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 PID 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 CYCLE 1 1 1 2 2 2 3 3 3 1 1 1 2 2 2 3 3 3 14:07 Friday, September 12, 2003 DATE 25Jan1999 27Jan1999 29Jan1999 20Feb1999 22Feb1999 24Feb1999 15Mar1999 17Mar1999 19Mar1999 25Jan1999 27Jan1999 29Jan1999 20Feb1999 22Feb1999 24Feb1999 15Mar1999 17Mar1999 19Mar1999 DOSE 2.2 2.2 2.2 2.2 2.2 2.2 2.2 2.2 2.2 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 BSA 1.88 1.88 1.88 1.88 1.88 1.88 1.88 1.88 1.88 1.67 1.67 1.67 1.70 1.70 1.70 1.70 1.70 1.70 LPID . 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 LBSA . 1.88 1.88 1.88 . 1.88 1.88 . 1.88 1.88 1.67 1.67 1.67 1.70 1.70 1.70 . 1.70

As can be expected, RETAIN retains the value of the variable for use in a subsequent observation, but LAG(lag1) function retains only one level of previous variable value. But in this invocation example, you can see the pid #1's bsa is missing in cycle 2 and 3, program should pick the bsa in cycle 1 to use for cycle 2 and cycle 3. Since the program uses LAG to retain the bsa of previous cycle, so from conception of LAG, pid #1's bsa in cycle 3 and part of cycle 2 should be missing. But in fact, it is not missing as you can see from result. That is because the MERGE caused the LAG act as RETAIN at this specific example. For example, same as above example 1, BSA of Obs 5 is not from LBSA, it is copied from BSA of Obs 4 by MERGE. That

made the confusion between LAG and RETAIN. If you separate MERGE and IF-THEN statement in two data steps: data stmed(drop=date); merge stmed(in=a) bsa; by pid cycle; if a; data stmed; set stmed; by pid cycle; lpid = lag(pid); lbsa = lag(bsa); if not(first.cycle and first.pid) then do; if bsa=. and pid=lpid then bsa=lbsa; end; run; proc print; run; Then the results matched the LAG's definition as follow:

The SAS System Obs 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 PID 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 CYCLE 1 1 1 2 2 2 3 3 3 1 1 1 2 2 2 3 3 3

14:07 Friday, September 12, 2003 DATE DOSE 2.2 2.2 2.2 2.2 2.2 2.2 2.2 2.2 2.2 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 BSA 1.88 1.88 1.88 1.88 . . . . . 1.67 1.67 1.67 1.70 1.70 1.70 1.70 . . LPID . 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 LBSA . 1.88 1.88 1.88 . . . . . . 1.67 1.67 1.67 1.70 1.70 1.70 . .

25Jan1999 27Jan1999 29Jan1999 20Feb1999 22Feb1999 24Feb1999 15Mar1999 17Mar1999 19Mar1999 25Jan1999 27Jan1999 29Jan1999 20Feb1999 22Feb1999 24Feb1999 15Mar1999 17Mar1999 19Mar1999

CONCLUSION
When doing MERGE we should not have MERGE and IF-THEN statement in one data step if the IF-THEN statement involve two variables that come from two different merging data sets. If it is not completely clear when MERGE and IF-THEN can be used in one data step and when it should not be, then it is best to simply always separate them in different data step. By following the above recommendation, it will ensure an error-free merge result.

ACKNOWLEGMENTS
The author would like to thank David Izard, Biostatistics & Data Management of GlaxoSmithKline, for his review and valuable suggestions of this paper.

CONTACT INFORMATION
Yan Lei GlaxoSmithKline Pharmaceuticals 1250 South Collegeville Road, UP4315 Collegeville, PA 19426-0989 Work Phone: (610) 917-6950 Email: yan_lei-1@gsk.com

You might also like