You are on page 1of 21

Understanding SAS Data

Step Processing
Alan C. Elliott
stattutorials.com

Reading Raw Data


Using the following SAS program:
DATA NEW;
INPUT ID $ AGE TEMPC;
TEMPF=TEMPC*(9/5)+32;
DATALINES;
0001 24 37.3
0002 35 38.2
;
run;
proc print;run;
Alan C. Elliott, stattutorials.com

Overview of SAS Data Step


Compile Phase
(Look at Syntax)

Execution Phase
(Read data, Calculate)

Output Phase
(Create Data Set)

Alan C. Elliott, stattutorials.com

Compile Phase
DATA NEW;
INPUT ID $ AGE
TEMPC;
TEMPF=TEMPC*(9/5)
+32;
DATALINES;
0001 24 37.3
0002 35 38.2
;
run;
proc print;run;

SAS Checks the syntax


of the program.
Identifies type and
length of each
variable
Does any variable
need conversion?
If everything is okay,
proceed to the next
step.

If errors are
discovered, SAS
attempts to interpret
what you mean. If SAS
cant correct the error,
it prints an error
Alan C. Elliott, stattutorials.commessage to the log.

Create Input Buffer


SAS creates an input buffer
INPUT BUFFER contains data as it is
read in
DATALINES;
0001 24 37.3
0002 35 38.2
;
1
0

2
0

3
0

4
1

INPUT
BUFFER

6
2

7
4

Alan C. Elliott, stattutorials.com

9 10 11 12
3 7
.
3

Execution Phase
PROGRAM DATA VECTOR (PDV) is created and
contains information about the variables
_N_ _ERROR ID
_
1
0

AG TEMPC
E
.
.

TEMPF
.

Two automatic variables _N_ and _ERROR_ and


a position for each of the four variables in the
DATA step.
Sets _N_= 1_ERROR_= 0 (no initial error)
and remaining variables to missing.
Alan C. Elliott, stattutorials.com

Buffe
r

Buffer to PDV
1

9
3

10 11 12
7

3
PDV

Reads 1st
record_N_
1

_ERROR_

ID

AGE

TEMPC

TEMPF

000
1

24

37.3

Processes the code TEMPF=TEMPC*(9/5)+32;


If there is an executable
statement
_N_ _ERROR_ ID
1

000
1

AGE

TEMPC

TEMPF

24

37.3

99.14

Alan C. Elliott, stattutorials.com

Initially
missing

Calculated
value

Output Phase
The values in the PDV are
written to the output data set
(NEW) as the first observation:
_N_

_ERROR_

ID

AGE

TEMPC

TEMPF

000
1

24

37.3

99.14

Write data to data


set. ID AGE TEMPC
000
1

24

37.3

TEMPF
99.14

Alan C. Elliott, stattutorials.com

From
PDV

This is the first


record in the
output data set
named NEW.
Note that _N_
and _ERROR_ are
dropped.

_N_
1

Exceptions to Missing in
Initial values
PDV
usually set to
_ERRO
R_
0

I
D

AG
E

TEMP
C

TEMPF

missing in PDV

Some data values are not initially set to


missing in the PDV
variables in a RETAIN statement
variables created in a SUM statement
data elements in a _TEMPORARY_ array
variables created with options in the FILE or
INFILE statements

These exceptions are covered later.


Alan C. Elliott, stattutorials.com

Next data record read


Once SAS finished reading the first
data record, it continues the same
process, and reads the second
recordsending results to output
data setID(named
NEWTEMPF
in this case.)
AGE TEMPC
000
1

24

37.3

99.14

000
2

35

38.2

100.76

and so on for all records.


Alan C. Elliott, stattutorials.com

Descriptor Information
For the data set, SAS creates and
maintains a description about each SAS
data set:
data set attributes
variable attributes
the name of the data set
member type, the date and time that the
data set was created, and the number,
names and data types (character or
numeric) of the variables.
Alan C. Elliott, stattutorials.com

Data Set Description


proc datasets ;
Alternate program
contents data=new;
proc contents
run;
data= new;
run;

Contents output (abbreviated)


#

Name

NEW

Member
Type
DATA

File Size
5120

Alan C. Elliott, stattutorials.com

Last
Modified
20Nov13:0
8:59:32

Data Set Name


Member Type
Engine
Created

Description output
continued
Observations
Variables
Indexes
Observation Length

2
4
0
32
0

Protection

Deleted
Observations
Compressed

Data Set Type

Sorted

NO

Last Modified

WORK.NEW
DATA
V9
Wed, Nov 20, 2013
08:59:32 AM
Wed, Nov 20, 2013
08:59:32 AM

Label
Data Representation WINDOWS_64
Encoding

wlatin1 Western
(Windows)

Alan C. Elliott, stattutorials.com

NO

Description output
continued
Alphabetic List of Variables and Attributes
#
Variable
Type
Len
2
AGE
Num
8
1
ID
Char
8
3
TEMPC
Num
8
4
TEMPF
Num
8

Alan C. Elliott, stattutorials.com

Original Program
DATA NEW;
INPUT ID $ AGE TEMPC;
TEMPF=TEMPC*(9/5)+32;
DATALINES;
0001 24 37.3
0002 35 38.2
;
run;
proc print;run;

Alan C. Elliott, stattutorials.com

Original Program
DATA NEW;
INPUT ID $ AGE TEMPC;
Program output
TEMPF=TEMPC*(9/5)+32;
DATALINES;
0001 24 37.3
Obs ID
AGE TEMP TEMP
0002 35 38.2
C
F
1
0001 24
37.3 99.14
;
2
0002 35
38.2 100.76
run;
proc print;run;

Alan C. Elliott, stattutorials.com

Example of Error
DATA NEW;
INPUT ID $ AGE TEMPC;
TEMPF=TEMPC*(9/5)+32
DATALINES;
0001 24 37.3
0002 35 38.2
;
run;
proc print;run;

Missing Semi-colon

proc datasets ;
contents data=new;
run;
Alan C. Elliott, stattutorials.com

76
77
78
79

DATA NEW;
INPUT ID $ AGE TEMPC;
TEMPF=TEMPC*(9/5)+32
Error found during
DATALINES;
compilation
--------22
80
0001 24 37.3
---180
ERROR 22-322: Syntax error, expecting one of the following: !, !!, &, *,
**, +, -, /, <, <=, <>, =, >, ><, >=, AND, EQ, GE,
GT, IN, LE, LT, MAX, MIN, NE, NG, NL, NOTIN, OR, ^=, |, ||,
~=.
ERROR 180-322: Statement is not valid or it is used out of proper order.
81
82
83

0002 35 38.2
;
run;

ERROR: No DATALINES or INFILE statement.

Alan C. Elliott, stattutorials.com

Summary - Compilation
Phase
During Compilation
Check syntax
Identify type and length of each new variable (is a
data type conversion needed?)
creates input buffer if there is an INPUT statement for
an external file
creates the Program Data Vector (PDV)
creates descriptor information for data sets and
variable attributes
Other options not discussed here: DROP; KEEP;
RENAME; RETAIN; WHERE; LABEL; LENGTH; FORMAT;
ARRAY; BY; ATTRIB; END=, IN=, FIRST, LAST, POINT=
Alan C. Elliott, stattutorials.com

Summary Execution Phase


1. The DATA step iterates once for each observation being
created.
2. Each time the DATA statement executes, _N_ is
incremented by 1.
3. Newly created variables set to missing in the PDV.
4. SAS reads a data record from a raw data file into the
input buffer (there are other possibilities not discussed
here).
5. SAS executes any other programming statements for the
current record.
6. At the end of the data statements (RUN;) SAS writes an
observation to the SAS data set (OUTPUT PHASE)
7. SAS returns to the top of the DATA step (Step 3 above)
8. The DATA step terminates when there is no more data.
Alan C. Elliott, stattutorials.com

End

Alan C. Elliott, stattutorials.com

You might also like