Professional Documents
Culture Documents
1
Chapter 4: Preparing Data for Analysis
2
LEARNING OBJECTIVES
To be able to label variables with explanatory names
To be able to create new variables
To be able to use SAS IF-THEN-ELSE statements
To be able to use DROP and KEEP to select variables
To be able to use the SET statement
To be able to use PROC SORT
To be able to append and merge data sets
To be able to use PROC FORMAT
Going Deeper: To be able to find first and last values in a
group
3
The information in this chaper is about the DATA
Step
RECALL typical SAS program flow:
4
4.1 LABELING VARIABLES WITH EXPLANATORY
NAMES
SAS labels are used to provide descriptive names for
variables.
The LABEL statement uses the format:
Notice that the LABEL statement is placed within the DATA step.
6
Hands On Exercise p 77
Output without labels:
7
CREATING VARIABLES IN THE SAS DATA STEP
Arithmetic Operators (Table 4.4) page 79.
8
Order of Operations
You may remember the following mnemonic from a math
class that can help you remember the order of
operations: Please excuse my dear Aunt Sally.
9
EXERCISE - CREATING A NEW VARIABLE
Open the file named DCALC.SAS, (p 79) create a new
variable by calculation:
10
RUN THE CODE AND OBSERVE RESULTS
11
NOTE THE PLACEMENT OF CALCULATION
12
ADD TO THE EXERCISE
Suppose ceilings are 8 feet tall, and you want to calculate
the total living area volume (cubic feet for air
conditioning purposes).
Add the calculation
VOL= W*L*8;
To the code. Also add a label to display VOL as Volume
and add VOL to the SUM Statement.
Rerun and observe the results.
DO THE EXERCISE
13
Your new code should look something like this:
Note new
Volume column
Results
14
Creating New Variables as Constant Values
Type in this program and run it:
DATA PI;
INPUT RADIUS;
PI=3.1415927;
NOTE that the value of PI
AREA=PI*RADIUS**2; is a constant used in
CIRCUM=2*PI*RADIUS; subsequent calculations.
DATALINES;
10
100
1000
;
PROC PRINT;RUN;
When you look at the output, notice that PI is a variable in the data set.
15
4.3 USING IF-THEN-ELSE CONDITIONAL STATEMENT
ASSIGNMENTS
Another way to create a new variable in the DATA step is
to use the IF-THEN- ELSE conditional statement construct
Format is:
IF expression THEN statement; ELSE statement;
Thus
IF SBP GE 140 THEN HIGHBP=1; ELSE HIGHBP=0;
16
Comparison Operators
17
Logical Operators
19
Do Hands-On Exercise p 83
Open the file DCONDITION.SAS, and run the program.
Note the IF
statement
20
RESULTS
21
ADD A NEW IF STATEMENT
IF AGE GT 65 THEN GROUP="SENIOR";
ELSE IF AGE GE 18 and AGE LE 65
THEN GROUP="ADULT";
ELSE GROUP="YOUTH";
This code creates a new
variable named GROUP
Note the use of an according to the IF ELSE
ELSEIF Clause IF and ELSE statement.
DO THE EXERCISE
22
RESULTS
Notice the new
column
23
Using IF to Assign Missing Values
Be Careful: Data sets often contain missing data codes to
record when data are missing. For example for the variable age
you might assign an impossible value, say -9, as a missing value
code. Then
In your DATA Step assigns the SAS missing value code . (dot) to
AGE when the value is -9. You MUST do this for SAS to know how
to handle missing values in statistical procedures. For character
variables a missing value is a blank.
24
In this exampleWhat could go wrong?
IF AGE GT 12 AND AGE LT 20 THEN TEEN=1;
ELSE TEEN = 0;
001 12
002 20
003 19
004 .
25
A better way
IF AGE GT 12 AND AGE LT 20 THEN TEEN=1;ELSE TEEN = 0;
IF AGE = . THEN TEEN = .;
26
Do Hands-On p 84
Uses another method to create TEEN
27
Using IF and IF-THEN To Subset Data Sets
Data sets can be quite large. You may have a data set that
contains some group of subjects (records) that you want
to eliminate from your analysis. In that case, you can
subset the data so it will contain only those records you
need.
One method of eliminating certain records from a data
set is to use a subsetting IF statement in the DATA step.
The syntax for this statement is as follows:
IF expression;
28
Subsetting IF
For example, to select records containing the (character)
value F (only females) from a data set, you could use this
statement within a DATA step:
IF GENDER EQ 'F';
29
Subsetting with IF DELETE
The opposite effect can be created by including the
statement THEN DELETE at the end of the statement:
Do Hands-on Example p 86
30
Hands on Example p 86 (DSUBSET1.SAS)
31
Answer
Using the statement
32
Using IF-THEN and DO for Program Control
Another use of the IF statement is to control the flow of
your SAS program in conjunction with a DO statement. In
this case, you can cause a group of SAS commands to be
conditionally executed by using the following type of
code:
33
Example of Using IF THEN - DO
Suppose you want to calculate BMI (Body Mass Index) for
subjects in a data set, but the formula is only relevant for
subjects older than 19 years. Plus, at the same time you
want to assign other values for this same group of
subjects. You could use the code:
35
4.4 USING DROP AND KEEP TO SELECT VARIABLES
The DROP and KEEP statements in the DATA step allow you to
specify which variables to retain in a data set:
DROP variables;
KEEP variables;
For example,
DATA MYDATA;
INPUT A B C D E F G;
DROP E F;
DATALINES;
.. . etc ...
Do Hands on Example p 88
36
Do Hands on Example p 88
Open the program file DKEEP.SAS.
DATA MYDATA;
INFILE 'C:\SASDATA\EXAMPLE.CSV' DLM=','
FIRSTOBS=2 OBS=26;
INPUT GROUP $ AGE TIME1 TIME2 TIME3
TIME4 SOCIO;
KEEP AGE TIME1 SOCIO;
;
PROC PRINT; Notice the KEEP Statement
RUN;
37
Run the code, and observe the results
Notice, only the variables in the KEEP statement are in
the resulting SAS data file.
Exercise:
Change the KEEP statement to
38
Results of the DROP statement:
39
Extra: Using DROP, KEEP, and RENAME in the DATA
statement
Make this change in the program. Take out the DROP
statement and modify the DATA statement:
Notice how you can RENAME, DROP, or KEEP
within the DATA statement. Note that this
version of the DROP statement uses DROP=
41
4.5 USING THE SET STATEMENT TO READ AN
EXISTING DATA SET
Suppose you have a big data set you want to use
modified. Dont modify your ORIGINAL data set modify
a copy.
42
Another way to ENTER data
Suppose you already have a data set named OLD. You can
make a copy using
Now the NEW data set is identical to OLD. You can now
modify NEW without changing the original data set.
43
Creating a Data Set from an Existing Data Set
44
Using SET - Example 1
Suppose you have a data set named ALL. You want to
create two subsets, FEMALE and MALE.
DATA MALES; SET ALL; Creates a data set with only Males
IF GENDER ='M';
RUN;
DATA FEMALES; SET ALL;
IF GENDER =F'; Creates a data set with only Females
RUN;
45
Using SET - Example 2
You receive a data set from the government, and you
need to modify it before using it:
This is the original data set
This is the new (copied) data set
46
Do Hands-On Exercise p 90
Using a subsetting IF statement
DSUBSET3.SAS
47
4.6 USING PROC SORT
The SORT procedure can be used in the DATA step to
rearrange the observations in a SAS data set or create a
new SAS data set containing the rearranged observations.
The Sorting Sequence is shown in the table:
Sorting sequence information for SAS data sets
Character
variables blank!"#$%&'()*+,-./0123456789:;<=>?@
ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_abcdefghijklmnop
qrstuvwxyz(|)~
Numeric
variables: Missing values first, then numeric values
Default
Ascending (or indicate Descending)
48
Syntax for PROC SORT
The syntax for PROC SORT is:
49
Example of PROC SORT
PROC SORT
DATA=MYDATA OUT=MYSORT;
BY RECTIME; OPTIONS Specifies that
a new resulting data set
be created named
Requires a BY variable. MYSORT
This example sorts the MYDATA data set by RECTIME and puts
the resulting data set into a new dataset named MYSORT. The
original data set is NOT CHANGED.
Do Hands-On Examples p 92 & 93. (DSORT1.SAS, DSORT2.SAS)
50
RESULTS:
Do Hands-On Example p 92
Open the file DSORT1.SAS
DATA MYDATA;
INPUT GROUP RECTIME;
DATALINES;
1 4.2
2 3.6
2 3.1
1 2.1
Note create a new Note SORT BY
1 2.8 data set named S1 RECTIME
2 1.5 Note sorted
1 1.8 column.
;
PROC SORT DATA=MYDATA OUT=S1; BY RECTIME;
Title 'Sorting Example - Ascending';
PROC PRINT DATA=S1;
RUN;
51
Exercise for PROC SORT
Change the PROC SORT statement to sort DESCENDING,
and save the results to a SAS data file named WORK.S2.
(Change the PROC PRINT so it will print the new data set.)
PAUSE:
Run this new code, then come back to the tutorial.
52
Results from PROC SORT
Note that now RECTIME
is descending.
53
Example p 93
Open DSORT2.SAS. Note the KEEP= and RENAME= statements.
This is similar to what we recently learned for DROP, KEEP, and
RENAME in the DATA Statement. Run this code.
54
Compare Before and After SORT and KEEP
Data Set Before SORT and KEEP
55
4.7 APPENDING AND MERGING DATA SETS
Appending adds new records to an existing data set. (It is
sometimes called a vertical merge.)
OLD1
APPEND adds records to
the data set
OLD2
OLD1 OLD2
56
APPENDING DATA SETS
Appending is accomplished by including multiple data set
names in the SET statement. For example,
57
HANDS-ON EXAMPLE P 95
Open the file named DAPPEND1.SAS
The goal is to append data set
OLD1 with OLD2 using the
DATA step.
58
RESULTS OF APPENDING
60
Two Steps to a Merge: Sort, then Merge
The technique for merging the data sets using some key
identifier (such as patient ID) is as follows:
61
HANDS-ON EXAMPLE P 97
Open the file DMERGE1.SAS In this example, complete the
2nd PROC SORT by adding BY
CASE. Then do the MERGE.
DATA PREPOST;
MERGE PRE POST;BY CASE;
DIFF=POSTREAT-PRETREAT;
PROC PRINT DATA=PREPOST;RUN;
62
RESULTS
63
Rename While Merging
As with the SORT statement, you can RENAME, DROP, and
KEEP variables during the MERGE. You can also merge
many files at a time. The following shows the syntax for
merging four data sets and performing a RENAME, DROP,
and KEEP on the third data set Note that the RENAME and DROP
DATA newdataset; are occurring in the DATA3 data
set in this example
MERGE datal data2
data3 (RENAME=(oldname=newname)
DROP=variables or KEEP=variables))
data4;
BY keyvar;RUN;
64
Hands On Example page 97 (DMERGE1.SAS)
DATA PREPOST;
MERGE PRE POST; BY CASE;
DIFF=POSTREAT - POSTTREAT;
TITLE 'Merge Example';
PROC PRINT DATA=PREPOST;
RUN; MERGE the data sets
PRE and POST, and
calculate DIFF.
RENAME the variable PRETREAT to BASELINE during the merge (part 4.)
65
Hands On Example continued
To RENAME a variable during the MERGE, use this code:
DATA PREPOST;
MERGE PRE (RENAME=(PRETREAT=BASELINE)) POST;
BY CASE;
DIFF=POSTREAT - BASELINE;
TITLE 'Merge Example';
PROC PRINT DATA=PREPOST; Note the RENAME of the
variable PRETREAT. Also
RUN; make sure you change the
variable name in the
calculation of DIFF.
66
Few-To-Many-Merge
A Few-To-Many merge is used when you have records in
one data set that you want to merge into some table that
contains (typically) a smaller number of categories.
Suppose you own an auto parts store. You sell products to
several kinds of buyers and each get a particular
discount.
You want to produce a report that shows the amount of
actual sales price for a number of purchases.
67
Data for Few-To-Many Merge (Hands-On p 99)
This is the
MANY data set.
68
How to set up the few-to many (match) merge
Define the Discounts (FEW) data set:
Repair Shops: 33% Discount
CONSUMERS 0% Discount
Other Auto Stores 40% Discount
Define the TYPE data set (The FEW)
DATA TYPE;
Note here that because you use a
FORMAT BUYERTYPE $8.; FORMAT statement to specify the
INPUT BUYERTYPE DISCOUNT; format of BUYERTYPE, you dont
DATALINES; have to indicate type in the INPUT
REPAIR .33 statement. Otherwise, that
CONSUMER 0 statement would have to be
STORE .40
INPUT BUYERTYPE $ DISCOUNT;
;
69
Define the MANY data set Note FORMAT
DATA SALES; Statement
DATA REPORT;
MERGE SALES TYPE; BY BUYERTYPE;
FINAL =ROUND(PRICE*(1-DISCOUNT),.01);
RUN;
PROC PRINT DATA=REPORT;RUN; * GET REPORT;
71
Few-to-Many Merge Results
72
4.8 USING PROC FORMAT
The PROC FORMAT procedure allows you to create your
own custom formats.
These custom formats allow you to specify the
information that will be displayed for selected values of a
variable.
For example, suppose youve coded DISEASED and NOT
DISEASED as 0 and 1. You can create a format where 0
means DISEASED and 1 means NOT DISEASED so when
output is displayed the words instead of the number
codes appear.
73
Using PROC FORMAT
The steps for using formatted values are
1. Create a FORMAT definition using PROC FORMAT.
2. Apply the FORMAT to one or more variables. You can
apply a format (once it is defined in PROC FORMAT in a
DATA step or in a data analysis PROC statement.
3. For example: Choose any name for the format (similar
restrictions as for SAS variables.) We name them as
FMTsomething to make the name obvious.
PROC FORMAT;
VALUE FMTMARRIED 0 = "No" l = "Yes";
RUN;
74
Numeric and Character Formats
Define a format for a numeric
PROC FORMAT; variable.
VALUE fmtname1
number1=name1 For a character variable,
number2=name2 the format name must
start with a $, and the
etc; textnames must be in
quotes.
VALUE $fmtname2
textname1=name1
textname2=name2
etc;
RUN;
75
Example Numeric and Character Definitions
PROC FORMAT; Numeric format defined.
VALUE FMTMARRIED 0="No"
1="Yes";
VALUE $FMTGENDER M=Male
F=Female;
RUN;
Character format defined take
note of format name
$FMTGENDER and the values M
and F are in quotes
76
Ways to specify formats
Formats may also use ranges. For example, suppose that you
want to classify your AGE data using the designations Child,
Teen, Adult, and Senior. You could do this with the following
format:
Note different ways to indicate ranges.
PROC FORMAT;
Value FMTAGE LOW- 12 = 'Child'
13,14,15,16,17,18,19 = 'TEEN'
20 - 59 = 'Adult'
60 - HIGH = 'Senior';
RUN;
77
HANDS ON EXERCISE P 102
Open the file DFORMAT1.SAS Create the FORMAT
79
Assigning Formats to Many Variables
You can also assign the same format to several variables.
If you have questionnaire data with variables names Q1,
QS, Q7 where each question is coded as 0 and 1 for
answers Yes and No, respectively, and you have a format
called FMTYN, you could use that FORMAT in a procedure
as in the following example:
PROC PRINT;
FORMAT Q1 Q5 Q7 FMTYN. ;
RUN; Assigns the same format (FMTYN) to three
variables. Note the dot at the end of the
assigned format (REQUIRED)
80
Format Assignments (Data Set vs PROC)
Assign formats to variables within PROC STATEMENTS
Example: Assigning a FORMAT in a PROC makes the format
assignment only within that PROC (temporary)
PROC PRINT;
FORMAT GENDER $FMTGENDER. ;
RUN; Assigning a FORMAT in a DATA
Or in DATA statements statement makes the format
permanent in that data set.
DATA MYDATA;SET OLDDATA;
FORMAT GENDER $FMTGENDER. ;
RUN;
81
Creating Permanent Formats
In all the previous examples, formats were applied in a
PROC step and are considered temporary formats.
When you assign a format in a DATA step, you can also
store those formats in a (permanent) format catalog.
For example, to store an SAS format in a specified
permanent library location, you could use code such as
Creates a FORMAT LIBRARY
83
View Formats Folder
That is, when you create an SAS format catalog, a folder
icon appears in the designated SAS Library.
In this case, it is named FORMATS and appears in the
MYSASLIB library. You can verify its existence by
examining the MYSASLIB library using SAS Explorer.
If you double click on the FORMATS folder, you will see
sub folders named with the names of the formats you
have created.
84
Contents of a Format Folder
Click on the FMTMARRIED Formats folder to see its
contents the definition of the format:
85
Tell SAS About Your Formats
Once you have created permanent formats, you can use
them in both PROC and DATA step statements. To tell SAS
the location of a particular format, use the statement
OPTIONS FMTSEARCH=(proclib);
86
Using Stored SAS Formats
For example, if you have previously created and stored
the FMTMARRIED and $FMTGENDER formats in your
MYSASLIB. FORMATS folder, you could use the following
code to access those formats with PROC PRINT (or any
PROC.) Tells SAS where the formats are located
OPTIONS FMTSEARCH=(MYSASLIB.FORMATS);
PROC PRINT DATA="C:\SASDATA\SURVEY";
VAR SUBJECT MARRIED GENDER;
FORMAT MARRIED FMTMARRIED.
GENDER $FMTGENDER.;
RUN;
87
Discovering SAS Formats
To discover what formats are in a particular format
library, you can use the PROC CATALOG procedure as
shown here. This code displays all of the formats stored in
the MYSASLIB. FORMATS library.
88
Hands On Example p 106 (DFORMAT3.SAS)
89
When Your SAS Format Library is Missing
Suppose you have a . SASB7DAT file that uses created formats, but
you do not have the format library? If you attempt to use that data
set, you will get the following error message in the log.
If this occurs, you must use the following OPTIONS statement (above
the code where you refer to the data set) to tell SAS to access the
data set, or run the procedure without using the defined formats:
OPTIONS NOFMTERR;
In this case, the output displays the raw values of the variables
instead of the assigned format labels.
90
Know the Difference: Format vs Label
A common mistake is to try to use Labels as Formats
or Formats as Labels. Make sure you know the
difference:
91
4.9 GOING DEEPER: FINDING FIRST AND LAST
VALUES
Suppose that you want to identify the first and last
person (ID) in each of those groups.
In an SAS DATA step, you identify the first and last values
by FIRST.GP and LAST.GP, where GP is the name of the
sorted grouping (or key) variable.
Do Hands on Example p 107. (DFINDFIRST.SAS)
92
4.10 SUMMARY
This chapter discussed several techniques for preparing
your data for analysis. In the next chapter, we begin the
discussion of SAS procedures that perform analyses on
the data.
Continue to Chapter 5: Preparing to Use SAS Procedures
93
These slides are based on the book:
These slides are provided for you to use to teach SAS using this book. Feel free to
modify them for your own needs. Please send comments about errors in the slides
(or suggestions for improvements) to acelliott@smu.edu. Thanks.
94