Professional Documents
Culture Documents
Paper 9-26
ABSTRACT
The SAS® System has many powerful FLOWOVER The default. Causes the
tools to store, analyze and present INPUT statement to jump
data. However, first programmers to the next record if it
need to get the data into SAS doesn’t find values for
datasets. This presentation will all variables.
delve into the intricacies of reading MISSOVER Sets all empty vars to
data from sequential (text) files missing when reading a
using the DATA step and INFILE and short line. However, it
INPUT statements. Discussion will can also skip values.
focus on the different options STOPOVER Stops the DATA step when
available when reading different it reads a short line.
types of text files. For example, TRUNCOVER Forces the INPUT
when should you use the MISSOVER statement to stop
option and when is the TRUNCOVER reading when it gets to
option more appropriate. This paper the end of a short line.
assumes the audience has basic This option will not
knowledge of reading text files using skip information.
the DATA step (Base SAS®) and is SCANOVER Causes the INPUT
appropriate for users on any statement to search the
Operating System, although some data lines for a
options may be restricted. character string
specified in the INPUT.
INTRODUCTION PAD Pads short lines with
Reading and understanding the SAS blanks to the length of
documentation can sometimes be a the LRECL= option.
challenge. This is evident in the Note: SCANOVER and STOPOVER
INFILE statement. There are no less will not be discussed.
than 34 different options available
for this particular statement. This The following text file was created
can get very sticky when the data with MS-Notepad on Windows-NT then
file you need to read differs from read into a SAS dataset using INFILE
the safe, easy columnar data files. and INPUT statements. Each line
So how can we make sense of the should contain 4 data points; Last
plethora of options? This paper will and First names, Employee ID and Job
attempt to clarify some of the title. The grayed-out area denotes
confusion. Three situations are actual line lengths. (Note: Most Word
explored. First Variable-Length processors on Windows and UNIX create
records; both shorter values, and variable-length lines, whereas
missing data points. Next, reading Mainframe computers files with lines
in multiple files at once. Finally, of uniform length, filled in by
obtaining data from both remote OS's blanks.)
and Web sites using the FILENAME
statement. LANGKAMM SARAH E0045 Mechanic
TORRES JAN E0029 Pilot
SO LITTLE TIME, SO MANY OPTIONS SMITH MICHAEL E0065
When the data lines aren't complete, LEISTNER COLIN E0116 Mechanic
what option will read the data TOMAS HARALD
correctly and completely? INFILE has WADE KIRSTEN E0126 Pilot
a number of options available: WAUGH TIM E0204 Pilot
Advanced Tutorials
List Input;
Then two sets of code were submitted
Obs Lastn Firstn Empid Jobcode
using different options on the INFILE
statement. First the lines were read 1 LANGKAMM SARAH E0045 Mechanic
in with Column Input;
2 TORRES JAN E0029 Pilot
DATA test;
3 SMITH MICHAEL E0065 LEISTNER
INFILE "d:\infile\emplist.dat"
<OPTIONS>; 4 TOMAS HARALD WADE KIRSTEN
INPUT lastn $1-21 Firstn $ 22-31
5 WAUGH TIM E0204 Pilot
Empid $32-36 Jobcode $37-45;
RUN;
In this example the Pilot values are
placed in the appropriate places, but
Then List Input was used;
the INPUT statement still loops to
the next line when unable to fill all
DATA test; variables.
INFILE "d:\infile\emplist2.dat";
INPUT lastn $ Firstn $ MISSOVER:
Empid $ Jobcode $ ; When the MISSOVER option is used on
RUN; the INFILE statement, the INPUT
statement does not jump to the next
FLOWOVER: line when reading a short line.
The FLOWOVER option is the default Instead, MISSOVER sets variables to
option on INFILE. Here, when the missing.
INPUT statement reaches the end of
non-blank characters without having Column input;
filled all variables, a new line is Obs Lastn Firstn Empid Jobcode
read into the Input Buffer and INPUT
attempts to fill the rest of the 1 LANGKAMM SARAH E0045 Mechanic
variables starting from column one.
2 TORRES JAN E0029
The next time an INPUT statement is
executed, a new line is brought into 3 SMITH MICHAEL E0065
the Input Buffer. The results
(printed with PROC PRINT) are below. 4 LEISTNER COLIN E0116 Mechanic
5 TOMAS HARALD
Column Input;
Obs Lastn Firstn Empid Jobcode 6 WADE KIRSTEN E0126
Since List Input doesn't specify Since List Input reads from delimiter
explicit columns, these data lines to delimiter, TRUNCOVER can still
can be correctly read using the work.
MISSOVER option.
PAD:
TRUNCOVER: The PAD option does not replace the
The TRUNCOVER option acts similarly FLOWOVER option. Instead, the PAD
to MISSOVER, and in addition, will option adds blanks to short lines out
take partial values to fill the first to the logical record length(LRECL).
unfilled variable. In this case, PAD takes the LRECL
from the file information, but you
Column Input; can specify LRECL= in the INFILE
statement.
Obs Lastn Firstn Empid Jobcode
1 LANGKAMM SARAH E0045 Mechanic Column Input;
MISSOVER was originally created to be First, just list the files in a series of DATALINES
used in conjunction with PAD and in the DATA step.
works effectively and well in most
situations. However, this can be a DATA one;
CPU intensive process when reading an
LENGTH fil2read $ 40;
extremely large file.
INPUT fil2read $;
STOPOVER is a good tool for checking INFILE dummy FILEVAR=fil2read
code and raw data when dealing with END=done;
large, potentially messy files, since DO WHILE (NOT done);
it forces the DATA step to stop the INPUT lastn $ firstn $
first time it finds a short line. hiredate : mmddyy8.
salary;
TRUNCOVER was developed later than OUTPUT;
the MISSOVER and PAD options, and
END;
deals admirably with not only short
lines but with short values. DATALINES;
TRUNCOVER is more also efficient D:\Infile\emplist.dat
since it doesn't require the extra D:\Infile\emplist1.dat
"padding". D:\Infile\emplist2.dat
D:\Infile\emplist3.dat
One more point about variable-length D:\Infile\emplist4.dat
files. It is possible to copy in a RUN;
subset of any raw data file into the
DATA step and run these options on
the subset. Use an INFILE DATALINES;
Advanced Tutorials