You are on page 1of 8

%SASDIFF: A SAS Macro for Differential File Comparison

Ross Bettinger ABSTRACT Differential file comparison is a technique often used to compare changes between two files. These files may contain code developed for a software project or data collected and revised with new information. In either case, you may want to know the differences between the original file and the new file. The %SASDIFF macro is patterned after the UNIX sdiff program and may be used to reveal differences that have been introduced after an original file has been modified and saved as a new file. KEYWORDS Differential file comparison, sdiff, UNIX, SAS/Macro INTRODUCTION Suppose that you are asked to assume responsibility on a project that involves many files of SAS code. You see several versions of a file and want to know what the differences are between each version. Looking at the timestamp for each file, you can put them into sequential order. You can print out the code for each version and use PROC EYEBALL to compare one version to the next as it has evolved, but you would soon get tired of looking from one sheet of code listing to the next. What you really want is a utility program to do the comparison for you so that all you need to do is distinguish changes between the previous version and its successor. Such a program is called a differential file comparison utility like the UNIX sdiff utility. This useful utility compares sections of text from two files (orig and new) and creates a formatted listing showing text common to both files and in one but not the other. The %SASDIFF macro produces a listing much like the sdiff utility in that it shows lines of code that belong to the orig file only, or to the new file only, or that are common to both files. Perusing the output of the %SASDIFF macro, you can quickly follow the course of revisions between the orig file and the new file.

EXAMPLE OF USE An example suffices to demonstrate the operation of the %SASDIFF macro. The text is taken from [1]. Here is the original text:
This part of the document has stayed the same from version to version. It shouldn't be shown if it doesn't change. Otherwise, that would not be helping to compress the size of the changes. This paragraph contains text that is outdated. It will be deleted in the near future. It is important to spell check this dokument. On the other hand, a misspelled word isn't the end of the world. Nothing in the rest of this paragraph needs to be changed. Things can be added after it.
Figure 1 Original Text

Here is the new text:


This is an important notice! It should therefore be located at the beginning of this document! This part of the document has stayed the same from version to version. It shouldn't be shown if it doesn't change. Otherwise, that would not be helping to compress anything. It is important to spell check this document. On the other hand, a misspelled word isn't the end of the world. Nothing in the rest of this paragraph needs to be changed. Things can be added after it. This paragraph contains important new additions to this document.
Figure 2 New Text

The %SASDIFF macro was invoked to process the original and new files with the following code:
%SASDIFF( , , , , , , orig diff FLOW=Y IGNORE_WHITE_SPACE=Y IGNORE_BLANK_LINES=Y IGNORE_CASE=Y IGNORE_MATCHES=N

, LINESIZE=80 , WINDOW= 4 )

It produced the following results:

%SASDIFF Differential File Comparison


new.txt > This is an important > notice! It should > therefore be located at > the beginning of this > document! This part of the This part of the document has stayed the document has stayed the same from version to same from version to version. It shouldn't version. It shouldn't be shown if it doesn't be shown if it doesn't change. Otherwise, that change. Otherwise, that would not be helping to would not be helping to compress the size of the | compress anything. changes. < This paragraph contains < text that is outdated. < It will be deleted in < the near future. < It is important to spell It is important to spell check this dokument. On | check this document. On the other hand, a the other hand, a misspelled word isn't misspelled word isn't the end of the world. the end of the world. Nothing in the rest of Nothing in the rest of this paragraph needs to this paragraph needs to be changed. Things can be changed. Things can be added after it. be added after it. > This paragraph contains > important new additions > to this document.
Figure 3 %SASDIFF Differential File Comparison

orig.txt

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22

1 2 3 4 5 6 7 8 9 10 11 12 13

14 15 16 17 18 19 20 21 22 23 24 25

Text that is unique to orig is indicated by <, text that is unique to new is indicated by >, text that is the same in both files is indicated by , and text that is changed from orig to new is marked with |.

The macro parameters are described in Table 1. Optional parameters are assigned default values or are taken from the SAS user environment, e.g., Linesize.
Table 1 %SASDIFF Parameters

Parameter
Original File Name New File Name FLOW IGNORE_WHITE_SPAC E IGNORE_BLANK_LINE S IGNORE_CASE IGNORE_MATCHES LINESIZE WINDOW

Default Value None None


N Y Y Y N

Description Name of original text file Name of new text file containing changes Flag to wrap long lines of text Flag to ignore tabs, blanks, control characters Flag to ignore blank lines in file Flag to ignore differences in upper or lower case Flag to ignore matching lines of text Width of comparison table Number of lines of text above or below line being compared

Session setting
10

Another example showing differences between an original SAS program and a revised version is given in the Appendix. DESCRIPTION OF ALGORITHM The %SASDIFF macro builds a table that indicates the results of searching for orig lines that are found in new and for new lines that are found in orig. Each line of orig is compared to each line of new that is within a window of comparison1. If a line in orig matches a line in new during the orig-to-new comparison, it is not tested again during the new-to-orig comparison. Lines in orig that are not found in new are marked with <, and similarly, lines in new that are not found in orig are marked with >. Lines found in both files in the same position which are not identical (after the filters IGNORE_WHITE_SPACE , IGNORE_CASE , and IGNORE_BLANK_LINES have been applied if requested) are marked with |. Lines that occur in the same position and which are identical (after the requested filters have been applied) are considered to match and are marked with . Once the comparison process is completed, sequence information in the table is used to order lines from orig and new so as to group lines in orig only or in new only into left or right columns respectively. Lines common to both files are juxtaposed, with matching lines indicated accordingly. Another way of understanding this process is to realize that %SASDIFF computes the set of lines that are disjoint between orig and new and the set that is common to both files. The macro

The window of comparison represents a set of lines in the file being compared, e.g., new, that is above and below the current line in the reference file, e.g., orig. It is used to limit the search to a reasonable number of lines in the comparison file per line in the reference file so as to reduce the number of operations. The maximum difference between matched lines is the optimal value of the window parameter, and it is displayed in the SAS log file.
1

Page 4 of 8

interleaves the disjoint and intersecting lines according to their order in the two text files and labels the lines in common as matching or nonmatching. CONCLUSION The %SASDIFF macro can be a useful utility for program development or for QA purposes. It is an efficient tool for comparing two sets of text and displaying commonalities and differences between them. REFERENCES 1. Wikipedia contributors, "Diff," Wikipedia, The Free Encyclopedia, http://en.wikipedia.org/w/index.php?title=Diff CONTACT INFORMATION Your comments and questions are valued and encouraged. Please contact the author at: Ross Bettinger Email: rbettinger@modernanalytics.com SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. indicates USA registration. Other brand and product names are trademarks of their respective companies.

Page 5 of 8

APPENDIX The following example applies %SASDIFF to an original SAS program and its revision. The original program is shown below:
/* purpose: compute the first 100 prime numbers */ data _null_ ; array primes( 100 ) p1-p100 ; candidate = 3 ; continue = 1 ; n_primes = 1 ; primes( 1 ) = 2 ; do i = 1 to 1000 while( continue ) ; do j = 2 to ceil( sqrt( candidate )) ; not_prime = mod( candidate, j ) = 0 ; if not_prime then leave ; end ; if not not_prime then do ; n_primes = n_primes + 1 ; primes( n_primes ) = candidate ; end ; candidate = candidate + 1 ; if n_primes < dim( primes ) then continue = 1 ; else continue = 0 ; end ; put 'The first 100 / ( p1 -p25 )( / ( p26-p50 )( / ( p51-p75 )( / ( p76-p100 )( run ;
Appendix Figure 1 Prime1

primes are ' 4. ) 4. ) 4. ) 4. ) ;

Page 6 of 8

The revised program, which contains improvements in the algorithm, is:


/* purpose: compute the first n prime numbers */ %let N_PRIMES = 100 ; data _null_ ; array primes( &N_PRIMES ) _temporary_ ; candidate continue n_primes primes( 1 ) = = = = 3 1 1 2 ; ; ; ;

do while( continue ) ; sum_factors = 0 ; /* if no previous primes are a factor in candidate, * candidate must be prime */ do i = 1 to n_primes ; sum_factors + mod( candidate, primes( i )) = 0 ; end ; if not sum_factors then do ; n_primes + 1 ; primes( n_primes ) = candidate ; end ; candidate + 1 ; continue = n_primes < &N_PRIMES ; end ; put 'The first 100 primes are ' ; do i = 1 to &N_PRIMES ; put ( primes( i )) ( 4. ) @ ; if not mod( i, 10 ) then put ; end ; run ;
Appendix Figure 2 Prime2

Page 7 of 8

The %SASDIFF invocation was


%SASDIFF( , , , , ) Prime1.txt Prime2.txt FLOW=y LINESIZE= 64 WINDOW= 4

with the following differential file comparison results:

%SASDIFF Differential File Comparison


1 Prime1.txt /* purpose: compute the first 100 prime numbers */ data _null_ ; Array primes( 100 ) p1-p100 ; candidate = 3 ; continue = 1 ; n_primes = 1 ; primes( 1 ) = 2 ; do i = 1 to 1000 while( continue ) ; do j = 2 to ceil( sqrt( candidate )) ; not_prime = mod( candidate, j ) = 0 ; if not_prime then leave ; | > 2 3 4 5 6 7 8 9 10 11 | | | | | | | | > > > Prime2.txt /* purpose: compute the first n prime 1 numbers */ %let N_PRIMES = 100 ; 2 data _null_ ; 3 array primes( &N_PRIMES ) _temporary_ ; 4 candidate = 3 ; 5 continue = 1 ; 6 n_primes = 1 ; 7 primes( 1 ) = 2 ; 8 do while( continue ) ; 9 sum_factors = 0 ; 10 /* if no previous primes are a factor in 11 candidate, * candidate must be prime 12 */ 13 do i = 1 to n_primes ; 14 sum_factors + mod( candidate, primes( i )) 15 = 0 ; end ; 16 if not sum_factors 17 then do ; 18 n_primes + 1 ; 19 primes( n_primes ) = candidate ; 20 end ; 21 candidate + 1 ; 22 continue = n_primes < &N_PRIMES ; 23

12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28

end ; if not not_prime then do ; n_primes = n_primes + 1 ; primes( n_primes ) = candidate ; end ; candidate = candidate + 1 ; if n_primes < dim( primes ) then continue = 1 ; else continue = 0 ; end ; put 'The first 100 primes are ' / ( p1 -p25 )( 4. ) / ( p26-p50 )( 4. ) / ( p51-p75 )( 4. ) / ( p76-p100 )( 4. ) ; run ; Appendix Figure 3 Differential File Comparison

| |

| | < < | | | | |

end ; put 'The first 100 primes are ' ; do i = 1 to &N_PRIMES ; put ( primes( i )) ( 4. ) @ ; if not mod( i, 10 ) then put ; end ; run ;

24 25 26 27 28 29 30

Page 8 of 8

You might also like