You are on page 1of 90

Analytical Flow

Analytics Flow using SAS Overview


DW

Stored
Data Files

Data Query &


Extraction

Data
Cleaning

Data

Data
Transformation

Data
Analysis

Data
Transformation
& Publishing
Results

Analysis

S for SAS Intro to SAS

What is SAS?
Originally stood for Statistical Analysis System developed by SAS
Institute in Cary, NC.

A few years back, SAS officially dropped the name and is now simply known by its acronym
SAS (pronounced sass)

Main Uses of SAS

data entry, retrieval, and management


report writing and graphics
statistical and mathematical analysis
business planning, forecasting, and decision support
operations research and project management
quality improvement
applications development.

Basics About SAS


SAS is primarily composed of three windows
Program Editor: where you write and submit programs
Log: where SAS displays messages which indicates any errors that
may be in a program
Output: Where result appear after submitting programs

Program Editor Window

Explorer
and
Results
Window

Write your code in


this window

Log Window
View the
Log
Created by
Program
Execution

Output Window
View the
Values of
Data set in
This
window

SAS Application Interface

SAS Application Interface

SAS Interface

Program Editor

Log Window

Output Window

Used to compose the


programs and to
execute it

Displays the log of


execution

Prints the output of


the program.

First Steps With SAS

Many ways to Create SAS Datasets


Reading/ Input from
Multiple data sources

Excel
Txt
Oracle
Access
Teradata

Creating a dataset
Using SAS
Cards or Datalines
Infile
Column Input/ Delimiters
Formats/ Informats

SAS Dataset, Variables &


Observations
SAS expects your data to be in a special form; SAS data set
The SAS data set is a tabular form with Variables and
Observations
The rows are the Observations
The columns are the Variables

Example of a SAS Data Set


Variables

Observations

ID, HT & WT are Numeric Variables


Name is a Character Variable
Character Variables, if blank, are represented by a space
Numeric Variables, if blank, are represented by a .

Building Blocks of a SAS Program


A SAS Program is constituted of two building blocks:
DATA Step
PROC Step

A typical SAS program starts with a DATA Step to create a SAS


data set and then passes on to the PROC Step to do with the
analysis

Basic Components of SAS


Every SAS Program is constructed using DATA Step and/ or
Procedures
Data distance;
miles = 23;
kilometer = 1.61 * miles;
Run;

DATA Step

PROC Step

PROC PRINT DATA = distance;


Run;

Any combination of DATA Step and/or Procedures may be


used
Run Statement is recommended to be used throughout the
program

Basic Components of SAS


continued
DATA Step
SAS statement that read data, create new datasets or variables,
modify data sets, perform calculation

Procedures
SAS statements that can perform statistical analysis, create &
print reports and graphs

First steps with Data & SAS


DATE
Jan-49
Feb-49
Mar-49

AIR
112
118
132

new_dist
179200
188800
211200

calc_date
12Feb07
12Feb07
12Feb07

Data Value - Data value is the basic unit of information. Eg. 112.
Column/Field/Variable - A set of data values that describes a given attribute makes up
a Variable.Eg. AIR , calc_date
Types of Variables(Data Types)

Character Gurgaon, Y, 004171, AK47


Numeric 210.01, -273, 6.023E+23

A date is stored as
a number in SAS

Observation - All the data values associated with a case, a single entity, an account,a
subject, an individual make up an Observation. Each row in a dataset is an Observation.
Obs

DATE

AIR

new_dist
179200

Data Set - Is a collection of data values arranged as1 rows


and columns.
Jan-49
112

calc_date
12Feb07

SAS Data Set - Is the special way that SAS organizes and stores the data. Appears as a
rectangular table with columns and rows.

Rules for SAS Statements

May begin in any column of the line.


Must end with a semicolon (;).
May consist of more than one line of commands.
May continue over more than one line.
One or more blanks should be placed between items in SAS
statements. If the items are special characters such as '=', '+', '$',
the blanks are not necessary.

data mydata;
set sashelp.air;
new_dist = air * 1.6 * 1000;
calc_date = today();
format calc_date date.;
run;

Rules
for
SAS
Names
Can be up to 32 characters long
First character must be a letter (A, b, C, . . ., z) or underscore (_).
Subsequent characters can be letters, numeric digits (0, 1, . . ., 9), or
underscores.
SAS is NOT case-sensitive. SAS processes names as uppercase
regardless of how it is typed.
Blanks cannot appear in SAS names.
Special characters, except for the underscore, are not allowed. In file
reference, the dollar sign ($), pound sign (#), and at sign (@) can be
used.
SAS reserves a few names for automatic variables and variable lists.
For example, _N_ and _ERROR_ .
Dates are stored as Numeric data 01 JAN 1960 stored as 0, 02 JAN
1960 as 1

Data

this_is_the_RISK_score;

THis_is_the_application_DATE = '02Jan1960'd;
run;

A Simple SAS Program


libname loc d:\mydata';
Data sample_accounts;
INPUT Account 1-7 OpenDate $
CreditLimit 25-29;
CARDS;
1234670 11-Sep-04
Z
1234671 12-Sep-04
1234672 13-Sep-04
Z
1234673 14-Sep-04
T
1234674 15-Sep-04
;
run;
data loc.sample_accounts; /*

9-17 StatusCode $ 21-22

2000
3000
2500
3200
8000

stores data permanently*/

set sample_accounts;
run;

Proc Print data = loc.sample_accouts;


Title " Account Sample";
run;

Proc Freq data =loc.sample_accounts ;


tables statuscode;
run;

A Simple SAS Program


explained
LIBNAME Statement:
Libname is a reference used to declare a data library. SAS data
can be stored permanently by saving it to a library.
Syntax: Libname xx <folder reference>;
Temporary Work
Temporary vs. Permanent Datasets
Can be Used to Create Permanent SAS datasets

A Simple Program explained..


contd..
Data Statement - Names the SAS data set

Data sample_accounts;

Input Statement - Defines names and order


of variables to SAS

INPUT Account 1-7 OpenDate $ 9-17 StatusCode $ 21-22


CreditLimit 25-29;

Cards Statement - Signals that input data


will follow

Title Statement Puts TITLES on the


output

Run Statement Instructs SAS to execute


the SAS instructions above

CARDS;
1234670 11-Sep-04 Z 2000
1234671 12-Sep-04 3000
1234672 13-Sep-04 Z 2500
1234673 14-Sep-04 T 3200

Title " Account Sample";

Run;

SAS Data Step

Getting Data Into SAS

Getting Data In

Reading free formatted data instream(LIST INPUT) listing in the


required data.
Data sample_accounts;
INPUT Account OpenDate $ StatusCode $ CreditLimit ;
CARDS;
1234670 11-Sep-04 Z 2000
1234671 12-Sep-04 N 3000
1234672 13-Sep-04 Z 2500
1234673 14-Sep-04 T 3200
;run;

Reading fixed formatted data instream(COLUMN INPUT) - there


are no delimiters (such as spaces, commas, or tabs) to separate fixed
Data sample_accounts;
formatted
data, column definitions are required for every variable in
INPUT Account 1-7 OpenDate $ 8-16 StatusCode $ 19-20 CreditLimit 22-26;
the dataset.
DATALINES;
123467011-Sep-04 Z 2000
123467112-Sep-04
3000
123467213-Sep-04 Z 2500
123467314-Sep-04 T 3200
;run;

Getting
Data
In...
Reading fixed formatted data from an external file .
A text file named sample_accounts.txt is saved in d:\Mydata directory. Contents of
the file:

1234670
1234671
1234672
1234673

11-Sep-04
12-Sep-04
13-Sep-04
14-Sep-04

Z
Z
T

2000
3000
2500
3200

This file can be read into SAS by using an INFILE statement in


DATA step
Data sample_accounts;
INFILE d:\Mydata\sample_accounts.txt";
INPUT Account 1-7 OpenDate $ 9-17 StatusCode $ 21-22 CreditLimit 25-29;
run;

Getting Data In.

Reading comma delimited data from an external file


d:\Mydata\ has a comma seperated file named sample_Accounts with following
content:

Data from sample_accounts.csv can be read into SAS by using


the following:
Data sample_accounts;
INFILE d:\Mydata\sample_accounts.csv" DLM =',';
INPUT Account OpenDate $ StatusCode $ CreditLimit ;
RUN;

Reading Tab delimited data from an external file


Data sample_accounts;
INFILE d:\Mydata\sample_accounts.txt" DLM='09'x
INPUT Account OpenDate $ StatusCode $ CreditLimit ;
RUN;

Getting Data In..


Reading data using formatted input-To read formatted data as it is.

DATA acctinfo;
INPUT acctnum $8. date mmddyy10. amount comma9.;
CARDS;
0074309801/15/2001$1,003.59
1028754301/17/2001$672.05
3320899201/19/2001$702.77
0345900601/19/2001$1,209.61 ;
run;

Format & Informat


What is 20070201 ..?

1st Feb 2007 (YYYYMMDD)


2nd Jan 2007 (YYYYDDMM)
Price (in Rs.) of a penthouse in posh locality. (INR 20,070,201.00)
Distance (in KMs) between two celestial obejcts. (~20 million kms)
Capones secret Swiss Bank account number (020-070-201)

Sometimes raw data is not recognized as straightforward


character or numeric (or date)
Use Informat to read the non-standard data.
input statement
informat statement
general forms - Char: $characterw. Num: numericw.d Date: datew.

Use Format to represent in human readable form.


general forms - Char: $characterw. Num: numericw.d Date: datew.

Format & Informat


DATA Retail_02B;
INFORMAT Revenue COMMA5. Tdate DATE9.;
INPUT @1 City $9. @11 Street $9.
@22 Revenue @29 Cost 2.
@31 Code $1. @33 Tdate;
DaysUntilToday = TODAY() - Tdate;
Profit = Revenue - Cost;
FORMAT Tdate DATE7. Profit COMMA5.;
DATALINES;
BANGALORE Church St
CHENNAI
T NAGAR
DELHI
CP
KANPUR
MG ROAD
;
RUN;

2,341
6572
5,644
552

proc print data =Retail_02B


run;

23A
67B
34B
36A

01JAN2004
15NOV2005
29OCT2005
12SEP2004

Format is just a representation of data. It does not change the value of the variable.

DatasetsDefining Variables
Common Ways of Variable Definition
Using an assignment statement

Data var_test;
id ='JK'; NProducts= 6; pro_price = 4.555;
tot_cost = NProducts*pro_price; final_price = tot_cost;
run;and INPUT statement
Using

DATA acctinfo;
INPUT acctnum $8. date mmddyy10. amount comma9.;
CARDS;
Through
a LENGTH statement
0074309801/15/2001$1,003.59
;run;

As a result of a PROC SQL/PROC IMPORT.

Data var_test;
length id $ 10;length NProducts 4;Run;

Working with Datasets KEEP


C
and
DROP
KEEP
ontrol the number of variables (fields) read into and output into the datasets.
Data target_data (keep = var1 var2 var3 etc);
Set base_data;
Run;

Data target_data ;
Set base_data;
keep var1 var2 var3 ;Run;

DROP

Data target_data (drop = var1 var2 var3 etc);


Set base_data;
Run;

Data target_data ;
Set base_data;
drop var1 var2 var3 ;Run;

Working with Datasets KEEP


and DROP
Process only those variables you need
DROP time shift batchnum;
set payroll (DROP=salary gender);
data plan1 (DROP=salary gender);
KEEP time shift batchnum;
set payroll (KEEP=salary gender);
data plan1 (KEEP=salary gender);

Working with DatasetsVariable FORMATs

A format is an instruction that SAS uses to write data values to control the
written appearance of data values.
Output:
Obs

today

data format_test;
today =today();
1
MAR2006
dollar_amt =13400.5;
format dollar_amt dollar10.2 today monyy7.;
run;
proc print data = format_test;run;

dollar_amt
$13,400.50

Some formats:
$REVERJw. Writes character data in reverse order and preserves blanks
$UPCASEw. Converts character data to uppercase
MONYYw.
Writes date values as the month and the year in the form
mmmyy or mmmyyyy
QTRw.
Writes date values as the quarter of the year
DOLLARw.d Writes numeric values with dollar signs, commas, and decimal
points
WORDSw.
Writes numeric values as words

What happens to formats when we do summary by formatted variable?

Working with Datasets- SAS


Functions

A SAS function performs a computation or manipulation on variables


(arguments) and returns a value.
data function_test;
Max_var= max(100,101);
Length_var = length(Max_var);
This_month = month(today());
run;

Function
Name
LENGTH
LOWCASE
SCAN
SUBSTR
UPCASE
DATEPART
MONTH
TODAY
MAX
MEAN
SUM
LOG
SQRT
RANUNI
INPUT
LAG
PUT
CEIL
ROUND
TRUNC

Description
Returns the length of an argument
Converts all letters in an argument to lowercase
Selects a given word from a character expression
Extracts a substring from an argument
Converts all letters in an argument to uppercase
Extracts the date from a SAS datetime value
Returns the month from a SAS date value
Returns the current date as a SAS date value
Returns the largest value
Returns the arithmetic mean (average)
Returns the sum of the nonmissing arguments
Returns the natural (base e) logarithm
Returns the square root of a value
Returns a random variate from a uniform distribution
Returns the value produced when a SAS expression that uses a
specified informat expression is read
Returns values from a queue
Returns a value using a specified format
Returns the smallest integer that is greater than or equal to the
argument
Rounds to the nearest round-off unit
Truncates a numeric value to a specified length

Working with DatasetsMERGE Statement


MERGE To combine datasets.
DATA intial_cl;

INPUT account credit_limit;


DATALINES;
1002 2000
1003 4000
1004 3000;
DATA new_cl;
INPUT account credit_limit;
DATALINES;
1002 3000

1004 5000;
DATA credit_limit;
MERGE intial_cl new_cl;
BY account;
DATA credit_limit;
MERGE

intial_cl(in=a) new_cl(in=b) ;

Output 1:
Obs account credit_limit
1
1002
3000
2
1003
4000
3
1004
5000
Output 2:
Obs

account

limit

BY account;
IF
RUN;

a and b;

1
2

1002
1004

3000
5000

Working with DatasetsConditional processing

When data needs to be filtered/subset or conditionally processed. Commonly used


conditional statements are : WHERE, IF-ELSE, DO-END

WHERE Condition
DATA account_perf;
INPUT account current_os status_code $;
cards;

Output:

1002 300 A

Obs account current_os status_code

1003 20 A

1004 1200 C

1002
1004

1005 800 Z
;run;
Data perf;

set account_perf;
where current_os >100 and status_code ne 'Z';
run;

300

1200

Working with Datasets Program Flow Control


IF - THEN / ELSE
Example single statements

Example complex condition

If answer=9 THEN flag=NINE;


ELSE flag=NOT NINE;

If answer=9 and 9<y<99 THEN


flag=NINE and INSIDE;
ELSE flag=NOT NINE;

Example block statements

Example nested if/else

If answer=9 THEN
do;
flag1=NINE;
flag2=9;
end;
ELSE
do;
flag1=NOT NINE;
flag2=0;
end;

IF x=0 THEN
IF y ne 0 THEN put 'XY 0N0';
ELSE put 'XY 00';
ELSE put 'X n0';

Program Flow Control


DO - END
Example block code

Example exiting the loop

If answer=9 THEN
do;
flag1=NINE;
flag2=9;
end;

do i=1 to n by m;
...more SAS statements...
if i=10 then leave;
end;
if i=10 then put 'EXITED LOOP';

Example Iterative

Example DO UNTIL

Example DO WHILE

do
do
do
do
do
do

n=0;
do until(n>=5);
put n=;
n+1;
end;

n=0;
do while(n<5);
put n=;
n+1;
end;

month='JAN','FEB','MAR';
count=2,3,5,7,11,13,17;
i=var1, var2, var3;
i=1 to 10;
i=1 to k-1, k+1 to n;
i=n to 1 by -1;

Subsetting the datasets


Data city_info;
input name $ 4. age 2. city $7.;
cards;
abc 23 BNG
City_info
pqr 45 GGN
htr 47 GGN
Name age city
gtr 56 BNG
lpq 77 JPR
abc 23 BNG
pqr 45 GGN
gyt 23 GGN
htr 47 GGN
AWQ 56 JPR
gtr 56 BNG
LGK 43 BNG
lpq 77 JPR
gyt 23 GGN
STY 21 BNG
AWQ 56 JPR
LVS 46 JPR
LGK 43 BNG
;
STY 21 BNG
LVS 46 JPR
run;

Name age city


abc
gtr
LGK
STY

23
56
43
21

Bangalore

BNG
BNG
BNG
BNG

Name age city

Gurgaon

pqr 45 GGN
htr 47 GGN
gyt 23 GGN

Name age city


lpq 77 JPR
AWQ 56 JPR
LVS 46 JPR

Jaipur

Subsetting Solution I
Subsetting IF

Data Bangalore;
Set city_info;
if city EQ 'BNG';
run;
Data Gurgaon;
Set city_info;
if city EQ 'GGN';
run;
Data Jaipur;
Set city_info;
if city EQ 'JPR';
run;

Subsetting Solution II
Output Statement

data Bangalore Gurgaon Jaipur;


set city_info;
if city EQ 'BNG' then output Bangalore;
else if city EQ 'GGN' then output
Gurgaon;
else output Jaipur;
run;

Appending the datasets


Bangalore

Name age city


abc
gtr
LGK
STY

Gurgaon

Jaipur

23
56
43
21

BNG
BNG
BNG
BNG

Name age city


pqr 45 GGN
htr 47 GGN
gyt 23 GGN

Name age city


lpq 77 JPR
AWQ 56 JPR
LVS 46 JPR

City_info
Name age city

abc
gtr
LGK
STY
pqr
htr
gyt
lpq
AWQ
LVS

23
56
43
21
45
47
23
77
56
46

BNG
BNG
BNG
BNG
GGN
GGN
GGN
JPR
JPR
JPR

Appending Solution
data city_info;
set Bangalore Gurgaon Jaipur;
run;

SET Statement Options


Decisions based on the end of dataset (end=)

data Bangalore;
set Bangalore end=end_of_data;
if end_of_data then total_BNG_tourist=_n_;
run;
Decisions based on the source of observation (in=)
data food_quality;
set Gurgaon (in=a) Jaipur (in=b) Bangalore(in=c);
length cafe_food $13.;
if a then cafe_food = 'Very Bad';else
if b then cafe_food = 'Just Bad';else
if c then cafe_food = 'Very Very Bad';
run;

SET Statement Options


Reading a specified no of records
data new;
set old (obs=10);
run;

Begin Processing at a specified observation


data new;
set study (firstobs=5);
run;

Reading a specified block of records


This code will read record 5 to 10 i.e. 6 records

data new;
set study (firstobs=5 obs=10);
run;

Selection of Observations
Processing records which satisfy a condition
data new;
set study;
where age > 21;
run;
data new;
set study;
if age > 21;
run;

What is the difference..?

Merging Datasets

Merging SAS Datasets


One-to-One Merging
One-to-one merging combines observations from two or more SAS data sets
into a single observation in a new data set. To perform a one-to-one merge,
use either the SET statement or the MERGE statement without a BY
statement.

Match Merging
Match-merging combines observations from two or more SAS data sets into
a single observation in a new data set according to the values of a common
variable.

Sorting / Index creation is necessary before Match Merging.

One-to-One Merging
USING SET
Data one2one;
Set animal;
Set plant;
Run;
USING MERGE
Data one2one;
Merge animal plant;
Run;

Sample
data animal;
input zoo $ code $;
datalines;
ant a
ape a
bird b
cat c
dog d
eagle e
;
run;
data plant;
input bot $ code $;
datalines;
apple
a
banana
b
coconut
c
celery
c
dewberry
d
eggplant
e
;
run;

Match Merging
Match merging requires data to be pre-sorted (or grouped) by the match keys:
/* Sorting the Two Sets */
proc sort data = test1;
by acct_no;
run;
proc sort data = test2;
by acct_no;
run;
/* Merging the Two Sets */
data merge1;
merge test1(in=a) test2(in=b);
by acct_no;
if a and b;
run;

AB

Match Merging..
Merging the Two Datasets

AUB

data merge1;
merge test1(in=a) test2(in=b);
by acct_no;
if a or b; /*Outer Join: Default */
run;
Left Join
data merge1;
merge test1(in=a) test2(in=b);
by acct_no;
if a; /* Left Join */
run;

Match Merging
Example

data merge1;
merge test1(in=a) test2(in=b);
by acct_no;
if b and NOT a;
run;
Example
data merge1;
merge test1(in=a) test2(in=b);
by acct_no;
if NOT (a and b);
run;

B Ac

(A B)c

SAS Data/ Proc Options

Obs: to process the specified number of observations (Syntax: Option obs = n)


FirstObs: to cause processing to begin at the specified observation (Syntax: Option
firstobs = n obs = n + x)
Noprint: to direct SAS not to print any results, since we might be saving the results in
a data set
Nmiss: specifies the number of missing values in the data
N: specifies the number of non-missing values in the data
Min/ Max: to specify the lowest and highest value, respectively
Mean/ Median/ Stddev/ Sum: directs SAS to produce the respective statistics for the
data
Out: specifies the output data set (Syntax: output out = dataset name)
Nodupkey: checks for and eliminates multiple occurrences of observations
Noduprec (or Nodup): checks for and eliminates multiple occurrences of records in
the datasets
Nopercent: Suppress the display of percentages in one-way or list format of crosstabulation tables. Suppress the total percentage in each cell and marginal row/
column percentages in two-way cross-tabulation tables
Nocol: Suppress the display of column percentages in each cell in two-way column
cross-tabulation tables
Norow: Suppress the display of row percentages in each cell in two-way column crosstabulation tables
Nofreq: Suppress the frequency count for each cell

SAS Procedures

PROC CONTENTS
PROC PRINT
PROC DATASETS
PROC FORMAT
PROC SORT
PROC FREQ
PROC MEANS
PROC SUMMARY
PROC TABULATE
PROC IMPORT / EXPORT
PROC SQL
PROC UPLOAD
PROC DOWNLOAD

SAS Code - For Practicing


data test1;
input Acct_no $4. TRAN_DT date9.
cards;
261025MAY20022536.93
990208MAY200225357.1
990230MAY200225343.59
990230MAY200225530
419817MAY2002253145.91
419819MAR2002253164.9
395018MAY200225375.15
395006MAY200225380.64
395006APR200225381.04
530424APR200225324.82
530418APR200225331.62
530408APR200225521.65
530425MAY200225336.77
530407MAY200225396.02
530427MAR200225320.88
009701MAY200225371.46
217411MAY200225351.3
217410MAY200225384.7
217423MAR200225377.76
217423MAR200225551.34
217422MAR200225361.6
;

Tran_Code

data test2;
input Acct_no $4. +4 Store $4. +4 Subdiv $1. +4
cards;
2610
2909
B
Alaska
9902
1495
A
Arizona
4198
1241
B
Dakota
3950
1444
A
Arkansas
5304
2537
B
Kansas
0097
1555
A
Virginia
2054
1212
B
Washington
2174
2739
B
Wisconsin
;

proc sort data = test1;


by acct_no;
run ;
proc sort data = test2;
by acct_no;
run ;
data sales1;
merge test1(in=aa) test2(in=bb);
by acct_no;
if aa and bb;
run;

$3. AMT;

District $21-32 Region $15.;


West
SouthWest
MidWest
South
MidWest
South
West
MidWest

PROC CONTENTS

proc contents data=sales1; run;


The CONTENTS Procedure
Data Set Name:
Member Type:
Engine:
Created:
Last Modified:
Protection:
Data Set Type:
Label:

WORK.SALES1
DATA
V8
17:22 Monday, February 17, 2003
17:22 Monday, February 17, 2003

Observations:
Variables:
Indexes:
Observation Length:
Deleted Observations:
Compressed:
Sorted:

-----Engine/Host Dependent Information----Data Set Page Size:


Number of Data Set Pages:
First Data Page:
Max Obs per Page:
Obs in First Data Page:
Number of Data Set Repairs:
File Name:
Release Created:
Host Created:

8192
1
1
145
21
0
C:\DOCUME~1\004012\LOCALS~1\Temp\SAS Temporary
Files\_TD1056\sales1.sas7bdat
8.0202M0
WIN_PRO

-----Alphabetic List of Variables and Attributes----#


Variable
Type
Len
Pos

4
AMT
Num
8
8
1
Acct_no
Char
4
16
7
District
Char
12
28
8
Region
Char
15
40
5
Store
Char
4
23
6
Subdiv
Char
1
27
2
TRAN_DT
Num
8
0
3
Tran_Code
Char
3
20

21
8
0
56
0
NO
NO

PROC PRINT

To Print the data in the output window/external file


Basic Code
proc print data=sales1;
run;
Print all vars & Sales Total By Region
proc sort data=sales1 out=sales2;
by region;
run;

proc print data=sales2;


by region;
sum amt;
run;
Print Only Sales &

District for MidWest

proc print data=sales1;


var amt district;
where region="MidWest";
run;

PROC SORT
Sorts the data set with respect to the variable/s mentioned
Default sorts in ascending order
Default Sorting
proc sort data=sales1 out=sales2;
by acct_no tran_dt;
run;
To Sort Acct_no in descending order
proc sort data=sales1 out=sales2;
by descending acct_no tran_dt;
run;
To Delete Replicative Accounts
proc sort data=sales1 out=sales2 nodupkey;
by descending acct_no tran_dt;
run;

PROC DATASETS

PROC DATASETS is a utility procedure that helps to manage the SAS datasets in various
libraries. In a multi-user environment like ours we are constrained by the system resources
like SAS Workspace or the shared folders. To remove unnecessary files and manage the
datasets.
libname mylib 'D:\myfolder';
DATA mylib.intial_cl;
INPUT account credit_limit;
DATALINES;
1002 2000
1003 4000
1004 3000
;
DATA mylib.new_cl;
INPUT account credit_limit;
DATALINES;
1002 3000

proc datasets library=mylib details;


change new_cl=brand_new_cl;
delete intial_cl;
copy in=mylib out =work; select brand_new_cl;
run;
Line 1: specifying the library to list the detail (filenames and
attributes)
Line 2: change the name new_cl into brand_new_cl
Line 3: delete intial_cl from mylib

1004 5000

Line 4: copy from mylib library to work library the file specified
in select statement

1005 2500

(brand_new_cl)

Proc Means / Proc Summary


Summarizes The Numeric Variables
More Than 30 Statistics Possible
Some of them are
Number
Mean
Std. Dev
Min.
Max
Nmiss
Range
Sum
Median
Kurtosis
Skewness etc..

Default

PROC MEANS
Basic Code
proc means data=sales1;
run;
Specific Statistics for Specific Variable
proc means data=sales1 N Nmiss Sum Mean;
var amt;
run;
Statistics for each Region (1st Option)
proc sort data = sales1 out=sales2;
by region;
run ;
proc means data=sales2 N Nmiss Sum Mean;
by Region;
var amt;
run;

PROC MEANS
Statistics for each Region (2nd Option)
proc means data=sales1 N Nmiss Sum Mean;
class Region;
var amt;
run;

Statistics for each Region outputted to a data file


proc means data=sales1 N Nmiss Sum Mean;
class Region;
var amt;
output out=summ1 n(amt)= no sum(amt)=Tamt;
run;

PROC FREQ
Uni-Dimensional
proc freq data=sales1;
table district;
run;
Two-Dimensional
proc freq data=sales1;
table district*region;
run;
Two-Dimensional with only freq counts
proc freq data=sales1;
table district*region /norow nocol nopercent;
run;

Two-Dimensional freq counts without table borders


proc freq data=sales1 formchar='
';
table district*region
/norow nocol nopercent;
run;

PROC SUMMARY
Summary without NWAY option
Proc summary data = sales1;
class Region District Store;
var Amt;
output out = Sales_Summ sum =;
run;
Proc print data = Sales_Summ;run;
Summary with NWAY option
Proc summary data = sales1 nway missing;
class Region District Store;
var Amt;

output out = Sales_Summ_nway sum =;


run;
Proc print data = Sales_Summ_nway;run;

PROC IMPORT AND PROC EXPORT


PROC IMPORT
Imports data from Excel, Access, CSV files,
Delimited files to SAS data set
proc import datafile = path\filename
dbms = 'dbms-name'
out =dataset-name replace;
run;

PROC EXPORT
Exports data from SAS datasets to Excel,
Access, CSV files, Delimited files
proc export

run;

data = dataset-name
dbms = dbms-name
outfile/outtable = 'path\filename' replace;

SQL Vs Data / Proc Step

Proc SQL does all that Data / Proc step do in SAS


Function
Data set creation

Key Words in SQL


Create tables / Select from

Data Manipulation

Count / Sum / Min / Max / Group by

Sorting

Order by

Merging

Join using Where clause

Access external data source connect to (oracle)

SQL
Manipulations (Vertical Processing)
proc sql;
create table act_info1 as
select
acct_no as act_num,
count(amt) as cnt_trans,
sum(amt) as act_sales,
min(amt) as min_sales,
max(amt) as max_sales
from
sales1
group by
acct_no
order by
acct_no;
quit;

act_num
97
2174
2610
3950
4198
5304
9902

cnt_trans act_sales min_sales max_sales


1
71.46
71.46
71.46
5
326.7
51.3
84.7
1
6.93
6.93
6.93
3
236.83
75.15
81.04
2
310.81
145.91
164.9
6
231.76
20.88
96.02
3
130.69
30
57.1

SQL Manipulations (Embedded


Query)

data act_list;

input acct_no $4.;


cards;
2610
3868
9902
4258
4198
3950
;

proc sql;
create table act_info2 as
Select * from sales1 where acct_no in (select acct_no from act_list);
quit;
proc sql;
create table act_info3 as
Select * from sales1 a, act_list b where
quit;

b.acct_no = a.acct_no;

Uploading & Downloading


datasets

Transferring data from local to remote session

Two ways to Upload datasets to remote session

FTP
Proc Upload Statement

Getting data from remote to local session

Two ways to Download datasets from remote session

FTP

Proc Download Statement

PROC UPLOAD
libname in remote-host-SAS-data-library;
proc upload data= dataname out = in.dataname;run;
data test1;
input acct $4. sale ret;
cards;
0001 20 5
0002 10 6
0001 30 5
;
rsubmit;
libname in1 '/home/rnayakar';
proc upload data = test1 out = in1.test1;
run;
Endrsubmit;
rsubmit;
proc print data = in1.test1;
run;
Endrsubmit;

PROC DOWNLOAD
libname in remote-host-SAS-data-library;
proc download data = in.dataname out = dataname1;
run;
rsubmit;
libname in1 '/home/rnayakar';
proc download data = in1.test1 out = test2;
run;
endrsubmit;

proc print data = test2;


run;

Working in Remote Session


Working with SAS residing in remote servers (e.g. Mason
Server in US) through local SAS (e.g. Local PC)

Instead of downloading data from remote databases


(consumes more time and space), using remote SAS data
sets can be created (and manipulated) more efficiently !

Working in Remote Session


Signon
Rsubmit
Endrsubmit
Script File (RLINK)
UPLOAD & DOWNLOAD

Remote
SAS

Commands
through
Local SAS
Invisible
Interaction

TCP / IP Socket

How to work in remote SAS ? ..

Working In the Remote Sessions


Sign-on to remote SAS using the script file
Script file: Set of system commands to access the SAS in different
location using the Network ( info like Access ID / PW, remote sys IP)
SAS statements to signon :
Model Server Signon Script:
options noxwait noxsync;
data _null_;
rc=system('putty -ssh -L 40000:localhost:7551
sconsas01.usc.consfin.ge.com');
run;
%let server=localhost 40000;
Signon remote=server user= _prompt_ passwd=_prompt_;

Rules for Remote Sessions


Remote SAS commands should
Start with rsubmit
End with endrsubmit
Examples of SIGNON, RSUBMIT, ENDRSUBMIT, and SIGNOFF
Statements
Signon;
rsubmit;
data a; x=1;
run;
endrsubmit;
Signoff;

Working In the Remote Sessions


Remote SAS statements should
begin with rsubmit and
end with endrsubmit
Examples of SIGNON, RSUBMIT, ENDRSUBMIT, and SIGNOFF
Statements
signon;
rsubmit;
data a;
x = 1;
run;
endrsubmit;
signoff;

Commonly Used SAS Functions

Broadly we can categorized in following categories:


Data type conversion (input/put)
Character
Numeric
Date/Time
Random number functions

Data Type Conversion functions

Data found in character, numeric and date/time formats PUT & INPUT functions used to convert data types
PUT function
numeric to character
Zip code is stored as a number in zip_n variable e.g., 65401, 4567
data new;
set data1;
zip_c = put(zip_n, $5.);
zip_c_1 = put(zip_n, z5.);
run;
INPUT function
character to numeric
Zip code is stored as a character in zip_c variable e.g., 65401, 04567
data new;
set data2;
zip_n = input(zip_c, 8.);
run;
character to date
Date is stored as a character string in dt_c variable e.g. 12JUN2008
data new;
set data3;
dt_n = input(dt_c, date9.);
format dt_n date9.;
run;

Character Functions
Function

Use

Syntax

Compress

Removes specified characters from a


variable. One use is to remove
unnecessary spaces from a variable.

Compress (source,characters-to-remove)

Left

Left justifies the variable value.

LEFT (argument)

Length

Returns the number of characters


with a character variable value.

LENGTH (argument)

Right

Right justifies the variable value.

Right (argument)

Scan

Returns a portion of the variable value as defined by a


delimiter. For example, the delimiter could be a space,
comma, semi-colon etc.

Scan (argument, n, delimiters)

Character Functions
Function

Use

Syntax

Substr

Returns a portion of variable value based on a starting position and


number of characters.

SUBSTR (Argument, position, n)

Tranwrd

Replaces a portion of the character string (word) with another character


string or word. For example a delimiter was supposed to be a comma but
data in some cases contains a colon. This function could be used to
replace the comma with a colon.

Tranwrd (argument, to, from)

Trim

Removes the trailing blanks from the left-hand side of a variable value.

Trim (argument)

Numeric Functions
Function

Use

Syntax

Sum

To sum variables (it will give required result even if one or two of the
arguments are missing)

Sum (argument1, argument2, argument3,)

Max/Min

Gives Maximum/Minimum of specified arguments

Max/Min (argument1, argument2, argument3,)

Mean

Gives average

Mean (argument1, argument2, argument3,)

Round

Rounds to the rearest round-off unit

Round (argument, round-off-unit)

Date/Time functions
Function

Use

Syntax

Year

Returns the year from a date value

Year (argument)

Hour

Returns the hour from a time value

Hour (argument)

Minute

Returns the minute from a time value

Minute (argument)

Second

Returns the second from a time value

Second (argument)

Datepart

Returns the date only from a date time value

Datepart (argument)

Timepart

Returns the time only from a date time value

Timepart (argument)

MDY

Returns a SAS date value from the numeric values for month, date and
year

MDY (month, Day, year)

HMS

Returns a SAS time value from the numeric values for hour, minutes and
seconds

HSM (Hour, Minute, Second)

Today

Returns the current date value.

Today()

Date

Returns the current time value.

Date()

Datetime

Returns the current date time value.

Datetime()

Random numbers and Sample


Selection

Random number functions

RANNOR(seed)
returns a random variate from a standard normal distribution
RANBIN(seed, n, p)
returns a random variate from a binomial distribution
RANUNI(seed)
returns a random variate from a uniform distribution
Selecting a% of observation from a large dataset
Sample Selection using Data step and random number function
data dev;
set total;
if ranuni(12345) <= 0.5;
run;
data devnew valnew;
set totalnew;
if ranuni(12345) <= 0.4 then output devnew;
else output valnew;
run;

Random numbers and Sample


Selection

Sample Selection using SURVEYSELECT procedure

proc surveyselect data = sample out=sample_dev samprate=50;


run;

proc surveyselect data = prescreen_resp out= new_sample


sampsize=10000 method=srs;
run;
proc surveyselect data = prescreen_resp out= biased_sample
sampsize=(83296,4384);
strata resp;
run;
Challenge 4
a) Create 100 normal random numbers with mean = 100 variance = 16
b) Simulate a coin tossing experiment with 100 trials
1 Attractive award in store

BY-Group Processing
Key Words: By and Retain
Important Note: Dataset has to be sorted by the Key
variable

proc sort- Example


data = sales3
By Processing

by acct_no tran_code tran_dt_key descending amt_tran;


run ;

data act_sales3 (drop = store subdiv district region ) ;


retain first_purchase ;
set sales3 ;
by acct_no ;
if first.acct_no then

do;
first_purchase = amt_tran ;
sale_am = 0 ;
sale_no = 0 ;
end;
sale_am + amt_tran * (tran_code = 253)
sale_no + (tran_code = 253) ;
if last.acct_no then output ;
run ;

Efficiency and Other Tips

Document programs with comments

Assign descriptive and meaningful variable names


Make important data sets permanent
Use Keep or Drop statements to retain desired variables
Use if then else statements
Avoid creating unnecessary (intermediate) data sets
Avoid unnecessary sorting
Use length statement to reduce variable size
Use data _null_ when you do not need any output dataset

You might also like