You are on page 1of 51

SAS TRAINING SESSION 3

ADVANCED TOPICS USING SAS

Sun Li
Centre for Academic Computing
lsun@smu.edu.sg

OUTLINE

Using arrays in SAS

Introduction to SAS Macro language


(7 steps to get started)

Advanced procedures

Bayesian Analysis example: PROC GENMOD

Cox Hazard Regression: PROC PHREG

Mixed Models: PROC MIXED

Panel Regression: PROC PANEL

USING ARRAYS IN SAS

Using arrays in SAS

Recoding variables

Computing new variables

Collapsing over variables

Identify patterns across variables using arrays

Reshaping data format btw long and wide using arrays (self-study)

ARRAY array_name(n) variable_list;


<DO - END loop>;

Note: SAS arrays usually work with DO-END loop.

USING ARRAYS IN SAS


DATA faminc;
input famid faminc1-faminc12 ;
datalines;
1 3281 3413 3114 2500 2700 3500 3114 -999 3514 1282 2434 2818
2 4042 3084 3108 3150 -999 3100 1531 2914 3819 4124 4274 4471
3 6015 6123 6113 -999 6100 6200 6186 6132 -999 4231 6039 6215
;
**recoding variables;
DATA recode_missing;
set faminc;
array inc[12] faminc1 - faminc12;
do i = 1 to 12;
if inc[i]=-999 then inc[i]=.;
end;
drop i;
RUN;
4

USING ARRAYS IN SAS


DATA tax_manual;
set recode_missing;
taxinc1 =
taxinc2 =
taxinc3 =
taxinc4 =
taxinc5 =
taxinc6 =
taxinc7 =
taxinc8 =
taxinc9 =
taxinc10=
taxinc11=
taxinc12=

faminc1 * .10 ;
faminc2 * .10 ;
faminc3 * .10 ;
faminc4 * .10 ;
faminc5 * .10 ;
faminc6 * .10 ;
faminc7 * .10 ;
faminc8 * .10 ;
faminc9 * .10 ;
faminc10 * .10 ;
faminc11 * .10 ;
faminc12 * .10 ;

**computing new variables;


DATA tax_array;
set recode_missing;
array inc(12) faminc1-faminc12;
array tax(12) taxinc1-taxinc12;
do month = 1 to 12;
tax[month] = inc[month]*0.1;
end;
RUN;

USING ARRAYS IN SAS


DATA quarter_manual;
set faminc;

RUN;

incq1
incq2
incq3
incq4

=
=
=
=

faminc1 + faminc2 +
faminc4 + faminc5 +
faminc7 + faminc8 +
faminc10 + faminc11

faminc3;
faminc6;
faminc9;
+ faminc12;

**collapsing over variables;


DATA quarter_array;
set faminc;
array Afaminc(12) faminc1-faminc12;
array Aquarter(4) incq1-incq4;
do q = 1 to 4;
Aquarter[q] = Afaminc[3*q-2] + Afaminc[3*q-1] + Afaminc[3*q];
end;
RUN;
/* example:
For q=1:
For q=2:

Aquarter[1] =
=
Aquarter[2] =
=

Afaminc[3*1-2] + Afaminc[3*1-1] + Afaminc[3*1]


Afaminc[1] + Afaminc[2] + Afaminc[3]
Afaminc[3*2-2] + Afaminc[3*2-1] + Afaminc[3*2]
Afaminc[4] + Afaminc[5] + Afaminc[6] */

USING ARRAYS IN SAS


To identify the months in which income was less than half of previous
month and store information in the dummy variables lowinc2-lowinc12
looping over months 2-12. And create a new variable ever to indicate
whether the three families had such incident that their income had ever
been less than half of a previous month for any month in this year.
**identify patterns across variables using arrays;
DATA pattern;
set faminc;
length ever $ 4;
array Afaminc(12) faminc1-faminc12;
array Alowinc(2:12) lowinc2-lowinc12;
do m = 2 to 12;
if Afaminc[m] < (Afaminc[m-1] / 2) then Alowinc[m] = 1;
else Alowinc[m] = 0;
end;
sum_low = sum(of lowinc:);
if sum_low > 0 then ever='Yes';
if sum_low = 0 then ever='No';
drop m sum_low;
RUN;

USING ARRAYS IN SAS


Self-study:
Reshape SAS data set from wide to long format, and from long to wide
format.
**reshaping from wide to long;
DATA long_array;
set faminc;
array Afaminc(12) faminc1 - faminc12;
do month = 1 to 12;
faminc = Afaminc[month];
output;
end;
RUN;

FIRST. indicates the first observation


for each unique value of by-variable;
LAST. indicates the last observation for
each unique value of by-variable.

**reshaping from long to wide;


PROC SORT data=long_array;
by famid;
RUN;
DATA wide_array;
set long_array;
by famid;
retain faminc1-faminc12;
array Afaminc(12) faminc1-faminc12;
if first.famid then do;
do i = 1 to 12;
Afaminc[i] = .;
end;
end;
Afaminc(month) = faminc;
if last.famid then output;
drop month faminc i;
RUN;

INTRODUCTION TO SAS MACRO

LANGUAGE

7 steps to get started using SAS Macros


1.

Write your program and make sure it works

2.

Use Macro variables to facilitate text substitution

3.

Use simple Macro functions

4.

Create symput and symget function to pass information to and


from a data step

5.

Make the program into a Macro definition

6.

Use parameters in the Macro and specify the parameters when


the Macro is called

7.

Use the iterative SAS language within a Macro definition to


execute code iteratively.

SAS Macro Language Documentation

SAS MACRO

LANGUAGE

Step 1: Write your program and make sure it works


PROC MEANS data=USPopulation;
var population year yearsq;
RUN;
PROC REG data=USPopulation;
model Population=Year YearSq;
RUN;
QUIT;

Step 2: Use Macro variables to facilitate text substitution


Macro variables:
All the key words in statements that are related to macro variables
or macro programs are preceded by percent sign %

To refer macro variables in your program, preface the name of the


macro variables with an ampersand sign &

%let defines a macro variable.


%put dispalys macro variable values as text in the SAS log.

10

SAS MACRO

LANGUAGE

**Step2: use macro variables to facilitate text substitution;


*defining a macro variable;
%let mydata=uspopulation;
%let indvar=year yearsq;
*using a macro variable;
title "the date is &sysdate9 and today is &sysday";
title2 'the date is &sysdate9 and today is &sysday';
PROC MEANS data=&mydata;
var population &indvar;
RUN;
PROC REG data=&mydata;
model Population=&indvar;
RUN;
QUIT;
*displaying text in log;
%put &sysdate9 is the date on which you invoked SAS.;
*displaying SAS system macro variables;
%put _automatic_;

11

SAS MACRO

LANGUAGE

There are many functions that are related to macro


variables. They include string functions, evaluation
functions and others.
Step 3: Use simple Macro functions
**Step3: use simple Macro functions;
%let k = 1;
%let tot = &k + 1;
%put &tot;

%let tot = %eval(&k + 1);


%put &tot;
%put;

**%eval is only for integer evaluation;

%let tot = %eval(&k + 1.234);


%let tot = %sysevalf(&k + 1.234);
%put &tot;
%put;

12

SAS MACRO

LANGUAGE

Step 4: Create symput and symget function to pass information


to and from a data step
CALL SYMPUT (new_macro_variable, value_in_string_format)
SYMGET (macro_variable')

Note: that the macro variable here has to be in single quotes.

Step 5: Make the program into a Macro definition


Step 6: Use parameters in the Macro and specify the parameters
when the Macro is called
Start the macro definition with %MACRO macro_name;
End the macro with %MEND macro_name;
To invoke the macro definition, use %macro_name

Note: there is no semicolon at the end of macro definition when the


macro is called.

13

SAS MACRO

LANGUAGE

*Step4, 5 and Step 6;


%macro mexample(mydata, indvar);
PROC MEANS data=&mydata;
var population &indvar;
output out=stats mean=avg;
RUN;
PROC PRINT data=stats; RUN;
DATA _null_;
set stats;
dt=put(today(), mmddyy10.);
call symput('date', dt);
call symput('average', put(avg,7.2));
RUN;
DATA new&mydata.;
set &mydata;
avg=symget('average')+0;
RUN;
PROC PRINT data=new&mydata; RUN;
%mend;
%mexample(uspopulation,year yearsq)

14

SAS MACRO

LANGUAGE

Step 7: Use the iterative SAS language within a Macro definition to


execute code iteratively
DATA file1 file2 file3 file4;
input a @@;
if _n_ <= 3 then output file1;
if 3 < _n_<= 6 then output file2;
if 6 < _n_ <= 9 then output file3;
if 9 < _n_ <=12 then output file4;
datalines;
1 2 3 4 5 6 7 8 9 10 11 12
;
RUN;
%macro combine(num);
DATA big;
set
%do i = 1 %to &num;
file&i
%end;
;
RUN;
%mend;

%combine(4)

15

SAS MACRO

LANGUAGE

DATA logit;
input v1-v5 ind1 ind2;
datalines;
1
0
1
0
0
1
0
1
0
1

0
0
1
1
1
1
0
1
1
1

1
1
1
0
0
0
0
0
0
0

;
RUN;

1
0
0
1
1
0
1
0
1
0

0
1
0
1
1
0
1
0
1
0

34
22
12
56
26
46
57
22
44
41

23
32
10
90
80
45
53
77
45
72

%macro mylogit(num);
%do i = 1 %to &num;
PROC LOGISTIC data=logit des;
model v&i = ind1 ind2;
RUN;
%end;
%mend;

%mylogit(5)
16

ADVANCED PROCEDURES

Bayesian Analysis of Linear Models using PROC GENMOD

Bayesian Analysis of the Cox Models using PROC PHREG

Mixed Linear Models using PROC MIXED

Panel Regression: PROC PANEL

For the following advanced models, please take a look at


the notes of the training New features of SAS9.2:

PROC GLIMMIX Generalized linear mixed model


PROC COUNTREG - ZINB model
17

PROC GENMOD
Bayesian theorem:

Gibbs sampling:
It is a special case of the Metropolis-Hastings algorithm, and
thus an example of a Markov chain Monte Carlo algorithm.

It is particularly well-adapted to sampling the posterior


distribution of a Bayesian network, since Bayesian networks
are typically specified as a collection of conditional
distributions.
18

PROC GENMOD
Bayesian Analysis of a Linear Regression Model using Gibbs sampling:

Here is a study of 54 patients undergoing a certain kind of liver


operation in a surgical unit. The data set Surg contains survival
time and certain covariates for each patient.

19

PROC GENMOD

20

PROC GENMOD
**Bayesian Analysis of a Linear Regression Model:

ODS html;
ODS graphics on;
PROC GENMOD data=surg;
model y = Logx1 X2 X3 X4 / dist=normal;
bayes seed=1;
ods output PosteriorSample=PostSurg;
RUN;
ODS graphics off;
ODS html close;

BAYES : produces Bayesian analysis via Gibbs sampling.


SEED: specifies randomization seed. It is used to maintain reproducibility.
OUTPUT : saves the samples in the SAS data set PostSurg for further processing.
By default, a uniform prior distribution is assumed on the regression coefficients. A
noninformative gamma prior is used for the scale parameter sigma.

21

PROC GENMOD

22

PROC GENMOD
DATA prob;
set postsurg;
indicator = (logX1 > 0);
label indicator= 'log(Blood Clotting Score) > 0';
RUN;
PROC MEANS data = prob n mean;
var indicator;
RUN;

There is a 1.00 probability of a positive

relationship between the logarithm of a


blood clotting score and survival time,
adjusted for the other covariates.
23

PROC PHREG

Survival data: time to event data

Reason of using survival model:

The distribution of survival data tends to be positively skewed


and not likely to be normal distribution and it may not be
possible to find a transformation.

Time-varying covariates could not be handled.

In addition, some duration is censored. (censored obs - right


truncation, left truncation, right censoring and left censoring)

24

PROC PHREG
T
h
(
t
)

h
(
t
)
exp(

xi )
Cox Regression Model: i
0
h0 (t ) is the baseline hazard function.

exp( T ( xi x j )) is the hazard ratio (HR) or incident rate ratio.


For example, for a simple Cox Model with only one class variable, the
model is written as:

h0 (t )
h(t )
h0 (t ) exp( )

if x 0

if x 1

Bayesian Analysis:
The probability that the hazard of x=0 is greater than that of x=1 is:

Pr{h0 (t ) h0 (t ) exp( )} Pr{ 0}

25

PROC PHREG
To study the probability of customers switching to other
telecommunication companies: telco.csv
Variable name

Variable information

age

Age in years

marital

Marital status

address

Years in current address

income

Household income in thousands

ed

Level of educations
1= didnt complete high school
2= high school degree
3= college degree
4= undergraduate 5= postgraduate

employ

Years with current employer

reside

Number of people in household

gender

Gender

tenure

Months with service

churn

Churn within last month


0 = No 1=Yes

custcat

0=unmarried 1=married

0=male

1=female

Customer categories
1= basic service 2= E-service 3= plus service 4=complete service

26

PROC PHREG
Bayesian Analysis of the Cox Regression Model:

PROC PHREG <DATA=SAS-data-set> ;


MODEL tvar*cvar(list) =predictors;
<program statements;>
BAYES <options> ;
TEST var_list;
BASELINE OUT=<> COVARIATES=<>;
RUN;
MODEL : identifies the model elements.
BAYES: requests a Bayesian analysis of the regression model by using Gibbs sampling
TEST: tests linear hypotheses about the regression coefficients.
BASELINE OUT: creates a new SAS data set that contains the baseline function
estimates at the event times of each stratum for every set of covariates () given in 27
the COVARIATES= data set.

PROC PHREG
DATA telco;
set sas3.telco;
RUN;
PROC PRINT data=telco (obs=10); RUN;
*Cox Regression Model;
PROC PHREG data=telco;
model tenure*churn(0)=marital address income ed employ custcat1
custcat2 custcat3;
custcat1=(custcat=1);
custcat2=(custcat=2);
custcat3=(custcat=3);
cust_categories: test custcat1, custcat2, custcat3;
RUN;
28

PROC PHREG
*Bayesian Analysis of the Cox Regression Model;

ODS html;
ODS graphics on;
PROC PHREG data=telco;
model tenure*churn(0)= custcat1 custcat2 custcat3;
custcat1=(custcat=1);
custcat2=(custcat=2);
custcat3=(custcat=3);
bayes seed=1 outpost=post;
RUN;
ODS graphics off;
ODS html close;
DATA New;
set Post;
Indicator=(custcat1 < 0);
label Indicator='Basic service < 0';
RUN;
PROC MEANS data=New(keep=Indicator) n mean;
RUN;

29

PROC PHREG

30

PROC PHREG
Self Study: Prediction
DATA cov_pat;
marital = 1;
address = 1;
employ = 3;
custcat2 = 0;
RUN;

custcat3 = 1;

custcat4 = 0;

PROC PHREG data=sas3.telco;


model tenure*churn(0)=custcat2 custcat3 custcat4;
custcat2=(custcat=2);
custcat3=(custcat=3);
custcat4=(custcat=4);
baseline out=surv covariates=cov_pat survival=surv / nomean;
RUN;
goptions reset=all;
axis1 label=(a=90 'Survivorship function');
PROC GPLOT data=surv;
plot surv*tenure=marital / vaxis=axis1;
RUN;
QUIT;

31

PROC PHREG

32

PROC MIXED

Recommended reading:

Introduction to Multilevel Modeling by Ita Kreft and Jan de Leeuw

Multilevel Analysis: An Introduction to Basic and Advanced Multilevel Modeling


by Tom Snijders and Roel Bosker

Multilevel Analysis: Techniques and Applications by Joop Hox

Repeated measures, nested data: PROC MIXED

Fixed-effects parameters :the parameters of the mean model.

Covariance parameters :the parameters of the variance-covariance model.

The most common of these structures arises from the use of random-effects
parameters, which are additional unknown random variables assumed to affect
the variability of the data. The variances of the random-effects parameters,
commonly known as variance components, become the covariance parameters
for this particular structure.
33

PROC MIXED

Hierarchical notation:

Level 1:

Level 2:

Yij 0 j 1 j X ij rij

0 j 00 01Z j u0 j
1 j 10 11Z j u1 j

Mixed model notation:

Yij 00 01Z j 10 X ij 11Z j X ij


u0 j u1 j X ij rij

34

PROC MIXED

Nesting structured data (example: hsb.sas7bdat)


Variable name

Variable information

MATHACH

student-level math achievement score (outcome variable)

SES

social-economic-status of a student -- student-level

MEANSES

the group mean of SES (school-level)

CSES

the group centered SES (school-level)

SECTOR

Indicating if a school is public or catholic (school-level)


0= public schools 1= catholic schools

PROC SQL;
create table hsb2 as
select *, mean(ses) as meanses, ses-mean(ses) as cses
from sas3.hsb
group by schoolid;
QUIT;

35

PROC MIXED
Yij 00 01Z j 10 X ij 11Z j X ij
u0 j u1 j X ij rij

The fixed effect would refer to the overall expected effect of a students
socioeconomic status on test scores; the random effect gives information
on whether or not this effect differs between schools.

36

PROC MIXED
PROC MIXED COVTEST <DATA=SAS-data-set> ;
CLASS variables;
MODEL dep_var = predictors / SOLUTION ddfm=bw;
RANDOM variables / SUBJECT=var SOLUTION;
RUN;
MODEL : identifies the model elements.
CLASS : specifies the classification variables.
NOTEST : specifies no hypothesis test for fixed effects.

SOLUTION : displays parameter estimates.


DDFM=bw : request SAS to use between and within method for computing the
denominator degrees of freedom for the tests of fixed effects, instead of the default,
containment method.

RANDOM : requests random coefficients.


SUBJECT : identifies the subjects in the mixed model.

37

PROC MIXED
**Multilevel model (mixed model): PROC MIXED;

PROC PRINT data=hsb2 (obs=10);


var mathach ses meanses cses sector;
RUN;

PROC MIXED covtest data=hsb2;


class schoolid;
model mathach= meanses sector cses meanses*cses sector*cses
/solution ddfm=bw notest;
random intercept cses /subject=schoolid solution;
RUN;

38

PROC MIXED
Self Study: Prediction
To plot the predicted math achievement scores constraining the meanses to
low, medium and high. Please use 25th/50th/75th percentiles to define the
strata of low, medium and high.
PROC UNIVARIATE data=hsb2;
var meanses;
RUN;
DATA toplot;
set hsb2;
if meanses<=-0.323 then do;
ms=-0.323;
strata="Low";
end;
else if meanses>=0.327 then do;
ms=0.327;
strata="Hig";
end;
else do;
ms=0.032; strata="Med" ; end;
predicted=12.1282+5.3367*ms+1.2245*sector+2.9407*cses+1.0345
39
*ms*cses-1.6388*sector*cses;
RUN;

PROC MIXED
PROC SORT data=toplot;
by strata;
RUN;
goptions reset=all;
symbol1 i=join c=red ;
symbol2 i=join c=blue ;
axis1 order=(-4 to 3 by 1) label=("Group Centered SES");
axis2 order=(0 to 22 by 2) label=(a=90 "Math Achievement Score");
PROC GPLOT data = toplot;
by strata;
plot predicted*cses = sector / vaxis = axis2 haxis = axis1;
RUN;
QUIT;

40

PROC MIXED

41

PROC PANEL
Panel data structure: We document values of a total of j factors
for a total of n subjects (e.g. firms) at a time point t.
Variables
Cases(nt)

x1

x2

x3

xj

11

12

1t

21

22

2t

31

32

3t

.
.

.
.

.
.

.
.

nt

42

PROC PANEL

Panel data:

also called cross-sectional time series data with multiple cases


(people, nations, firms, etc) for two or more time periods.

Cross sectional information: difference btw subjects, btw


subject effects.

Time series: changes within subjects over time, within-subject


effects.

Two effects models:

43

PROC PANEL
PROC REG
LSDV1
LSDV2
LSDV3

w/o dummy
/NOINT
RESTRICT

44

PROC PANEL

PROC PANEL <DATA=SAS-data-set> ;


CLASS variables;
ID cross-section-id time-series-id;
MODEL dep_var = predictors / options;
RUN;
MODEL : identifies the model elements. Options indicate the
model and the type of variance component estimate, such as,
ranone vcomp=fb;
CLASS : specifies the classification variables.
ID: specifies cross-section variable and time-series variable.
45

PROC PANEL
Airline

cost data (airline.sas7bdat)

The data measure costs, prices of inputs, and utilization rates for six
airlines over the time span 19701984. This example analyzes the log
transformations of the cost, price and quantity, and the raw (not logged)
capacity utilization measure.
Variable name

Variable information

Firm number (CSID)

Time period (TSID)

LF

Load factor (utilization index)

IC

Log transformation of costs

IQ

Log transformation of quantity

IPF

Log transformation of price of fuel


46

PROC PANEL
The following fix-two effects model is speculated first:

**Fixed effect Panel Model;


PROC PANEL data=airline;
id i t;
model lC = lQ lPF LF / fixtwo;
RUN;
PROC PANEL data=airline;
id i t;
model lC = lQ lPF LF / fixone;
RUN;

47

PROC PANEL
Further analysis on random effects:
*random effect panel model;
PROC PANEL data=airline
id I T;
RANONE:
model lC =
RANONEwk: model lC =
RANONEwh: model lC =
RANONEnl: model lC =
RANTWO:
model lC =
RANTWOwk: model lC =
RANTWOwh: model lC =
RANTWOnl: model lC =
RUN;

outest=estimates;
lQ
lQ
lQ
lQ
lQ
lQ
lQ
lQ

lPF
lPF
lPF
lPF
lPF
lPF
lPF
lPF

lF
lF
lF
lF
lF
lF
lF
lF

/
/
/
/
/
/
/
/

ranone
ranone
ranone
ranone
rantwo
rantwo
rantwo
rantwo

vcomp=fb;
vcomp=wk;
vcomp=wh;
vcomp=nl;
vcomp=fb;
vcomp=wk;
vcomp=wh;
vcomp=nl;

There are four ways of computing the variance components in the one-way
random-effects model. The method by Fuller and Battese (FB -default), uses a
"fitting of constants" methods to estimate them. The Wansbeek and Kapteyn
(WK) method uses the true disturbances, while the Wallace and Hussain(WH)
method uses ordinary least squares residuals. The Nerlove method (NL) is
assured to give estimates of the variance components that are always positive.

48

PROC PANEL
Self-study: output results of these random model estimates in tables;

49

PROC PANEL
*self study;
DATA table;
set estimates;
VarCS = round(_VARCS_,.00001);
VarTS = round(_VARTS_,.00001);
VarErr = round(_VARERR_,.00001);
Int
= round(Intercept,.0001);
lQ2
= round(lQ,.0001);
lPF2
= round(lPF,.0001);
lF2
= round(lF,.0001);
keep _MODEL_ _METHOD_ VarCS VarTS VarErr Int lQ2 lPF2 lF2;
RUN;
title 'Parameter Estimates';
title 'Variance Component Estimates';

PROC PRINT data=table label noobs;


label _MODEL_ = "Model"
PROC PRINT data=table label noobs;
_METHOD_ = "Method"
label _MODEL_ = "Model"
Int = "Intercept"
_METHOD_ = "Method"
lQ2 = "lQ"
VarCS = "Cross Sections"
lPF2 = "lPF"
VarTS = "Time Series"
lF2 = "lF";
VarErr = "Error Term";
50
var _METHOD_ _MODEL_ Int lQ2 lPF2 lF2; var _METHOD_ _MODEL_ VarCS VarTS VarErr;
RUN;
RUN;

THANKS!
CAC statistical WIKI page:
http://research2.smu.edu.sg/CAC/StatisticalComputing/Wiki/SAS.aspx

Statistical consultation service: lsun@smu.edu.sg

51

You might also like