Better Looking in Sas

SAS Essentials II: Better-Looking SAS for a Better Community
AnnMaria De Mars, The Julia Group, Santa Monica, CA

ABSTRACT
Experienced programmers don't just write code that runs, they also look professional. That doesn't refer to their designer
wardrobe (a quick glance around at your co-workers probably told you that) but to their code, log and output. In this paper, betterlooking and better-designed programs are demonstrated using PROC FREQ and macros. SAS system level options are examined for
their effectiveness in producing better-looking logs and output. PROC FORMAT, PROC TABULATE, PROC PRINT and ODS are
used to create better-looking reports. SAS/Graph and Graph-N-Go are used to make good-looking graphs.
Now here's the catch - where do you get the time and opportunity to try out new techniques? The examples used here are from
projects done for various community service organizations, from sports organizations to public schools. In the end, your program
looks good, your output looks good and you've improved both your programming skills and your community. Use your SAS skills to
produce reports for your child's sports league and you'll never have to chaperone in the freezing cold again! And who says nice guys
finish last?
INTRODUCTION
One way to break out of that vicious circle of "can't get a job without experience, can't get experience without a job" is to take
advantage of the many opportunities to use open data to help inform your community. Open data, that is data freely available to
anyone to use or publish, is available from an enormous range of government and non-profit sources. My personal favorite sources
for open data are the websites from data.gov , U.S. Census Bureau and National Center for Education Statistics. These three sites
alone provide several hundred thousand data sets to choose from. Applying your SAS skills to open data gives you experience with
different data set types, statistical techniques and procedures as well as the potential for doing some good for your community.
This is the second of a three-paper series on SAS applied to open data. There is often substantial work involved in preparing a data
set for analysis and an earlier paper dealt with that process (De Mars, 2011a). Were going to assume that work has already been
done. The next step is to produce presentation quality results.
EXAMPLE 1: AMERICAN COMMUNITY SURVEY - MAKING PRESENTATION-QUALITY TABLES

This example uses the 2009 American Community Survey Public Use Microdata Sample (U.S. Census, 2009). The Public Use
Microdata Sample (PUMS) contains a sample of actual responses to the American Community Survey (ACS). There are 3,030,728
records in the data set. Now you see an advantage of open data - it's often huge. You get the opportunity to use BIG DATA (or at
least relatively big data) which is an experience everyone wants you to have but people are rightfully a bit nervous about letting
novices touch their data sets with millions of records, because, well, they might need that data later, no?
Lets begin with a simple procedure for a presentation to middle school students. We want to know how many people on the census
forms select White as their race, how many put Black as their race, how many checked both the boxes for Black and White and
how many checked neither. We could do a PROC FREQ.
PROC FREQ DATA = lib.pums9 ;
TABLES racblk*racwht;
No matter how much you love SAS you must confess that this produces some of the ugliest output ever.
The SAS System
23:18 Saturday| August 6| 2011
The FREQ Procedure

Table of RACBLK by RACWHT
RACBLK(Race includes Black)
RACWHT(Race includes White)
Frequency|
Percent |
Row Pct |
Col Pct |0
|1
| Total
---------|--------|--------|
0
|3.188E7 |2.342E8 |2.661E8
| 10.38 | 76.28 | 86.66
| 11.98 | 88.02 |
| 45.10 | 99.09 |
---------|--------|--------|
1
|3.881E7 |2148908 |4.095E7
| 12.64 |
0.70 | 13.34
| 94.75 |
5.25 |
| 54.90 |
0.91 |
---------|--------|--------|
Total
7.068E7 2.363E8
3.07E8
23.02
76.98
100.00
Not only is this table really ugly, its not even usable for a presentation. Most middle school students are not going to be able to
interpret numbers in scientific notation. Wed like to get rid of the date, give it a title other than The SAS System, have Black and
White show up for race instead of 0 and 1. Wed also like to get rid of the scientific notation and have numbers as human beings
read them, not calculators. According to the SAS Procedures Guide for Version 8 (SAS Institute, 1999), "When scientific notation is
used, only the first few significant digits are shown. If you need more significant digits than PROC FREQ displays, create an output
data set by specifying OUT= in the TABLES statement. Then use PROC PRINT and assign an appropriate format to the variable
COUNT."
The later documentation has been silent on this issue, so if there is a simple way to get rid of scientific notation in PROC FREQ,
I havent found it. We want to create an output data set anyway. Think about the fact that you are working with three million records.
Charting, sorting, applying IF statements based on the value of 3,000,000+ records is inefficient. So, our challenge is to both become
more efficient and change our output to the improved version below, in one and the same program.
Population Distribution by Race
Race
includes
Black
Race
includes
White
2009
Population
Percent
of
Population
No
Yes
234,175,873
76.3
Yes
No
38,805,561
12.6
No
No
31,876,214
10.4
Yes
Yes
2,148,908
0.7
2009 American Community Survey Data
Here is the code to produce the table above. At first glance, it may seem an unreasonable amount of code for one little table, but
there is a method at work here, trust me.
OPTIONS NODATE NONUMBER ;
PROC FORMAT ;
VALUE $YN
"0" = "No"
"1" = "Yes" ;
PROC FREQ DATA = lib.pums9 NOPRINT;
TABLES racblk* racwht / OUT = lib.blkwhitmix ;
WEIGHT pwgtp ;
PROC SORT DATA = lib.blkwhitmix ;
BY DESCENDING PERCENT ;
ODS RTF FILE = "C:\Users\AnnMaria\Documents\pc_pus\sasout\RaceDist.rtf" STYLE = OCEAN ;
TITLE "Population Distribution by Race" ;
FOOTNOTE "2009 American Community Survey Data" ;
PROC PRINT DATA = lib.blkwhitmix SPLIT = " " ;
ID racblk ;
VAR racwht COUNT PERCENT ;
FORMAT COUNT COMMA14. PERCENT 8.1 racblk racwht $yn. ;
LABEL COUNT = "2009 Population"
PERCENT = "Percent of Population" ;
WHAT WERE DOING WITH THIS PROGRAM AND WHY WE ARE DOING IT
Starting from the top of our program
OPTIONS NODATE NONUMBER ;
This removes the date and number from the first line of our output.
PROC FORMAT ;
The format procedure begins with PROC FORMAT statement.
VALUE $YN
"0" = "No"
"1" = "Yes" ;
The VALUE statement will create a format. You can have as many VALUE statements as you like. A few points to note here:
A character format begins with a $
Even if the values are numbers, if the variable to which your format is going to be applied is a character variable, you need to
put those numbers in quotes, just like any time you are referencing a character variables value.
Unlike other SAS names, a format name cannot end in a number.
This format is temporary because I did not store it anywhere. Just like a temporary data set, when the program is ended, this
format will be gone. For this reason, you probably want to give it some thought before applying a temporary format to
variables stored in a permanent data set.
With these two statements, I have created a new temporary format. A format has to be defined before it can be used in your SAS
program, which is why it is a good habit to put your FORMAT procedure before any other DATA or PROC steps.
PROC FREQ DATA = lib.pums9 NOPRINT;
NOPRINT option doesnt really matter for this case, but it s a good habit to get into when you dont need the printed output, as with
some variables, for example, income, the procedure could produce thousands of lines of useless output.
TABLES racblk* racwht / OUT = lib.blkwhitmix ;
This TABLES statement will produce a cross-tabulation with the first variable being the row variable and the second one the column
variable. Dont forget the * . If you leave out the asterisk youll get two tables, one a frequency distribution of the variable racblk and
a second a frequency distribution of the racwht variable. The OUT = option will write the counts and frequencies to a data set.
WEIGHT pwgtp ;
Many open data sets are surveys and usually will include a WEIGHT statement. Dont forget this!!! In the case of the American
Community Survey, leaving off the WEIGHT statement means that your counts will be off by a factor of 101.
To see your output data set, click on the EXPLORER tab, double-click on the LIBRARIES tab, double-click on the name of the library,
in this case, Lib, and then double-click on the data set. The data set created by the FREQ procedure can be seen in the window to
the right.
PROC SORT DATA = lib.blkwhitmix ;

BY DESCENDING PERCENT ;
This step will sort the data set, listing the groups in order of their percentage, from highest to lowest.
ODS RTF FILE = "C:\Users\AnnMaria\Documents\pc_pus\sasout\RaceDist.rtf" STYLE = OCEAN ;
This statement opens an RTF file in the specified directory. The STYLE = is optional, but I live by the ocean so I was feeling like an
ocean style. The STYLE = option sets the colors and fonts. SAS has dozens of different styles to choose from with no more effort
than changing the word after the STYLES = . If youd like to see what sort of output styles like Meadow, Harvest and Brick
produce, this site from Louisiana State University http://stat.lsu.edu/SAS_ODS_styles/SAS_ODS_styles.htm gives dozens of
examples of SAS styles you can select.
TITLE "Population Distribution by Race" ;
FOOTNOTE "2009 American Community Survey Data" ;
The TITLE and FOOTNOTE statements add a descriptive title at the top and a footnote at the bottom.
PROC PRINT DATA = lib.blkwhitmix SPLIT = " " ;

The SPLIT= option in the PROC PRINT statement will cause the labels for each variable to split and go to a new line whenever the
character in the quotes is encountered. In this case, the label will start a new line after a blank space.
ID racblk ;
VAR racwht COUNT PERCENT ;
The ID statement will print the racblk variable first in each line rather than an observation number. The VAR statement lists, in
order, the variables to print. These will come after the ID variable.
FORMAT COUNT
COMMA14. PERCENT 8.1 racblk racwht $yn. ;
The FORMAT statement specifies that the count variable will have a width of 14, and include commas. The percent variable will have
a width of 8 with one decimal place. Since there are two variables listed before the $yn format, both racblk and racwht will use the $yn
format we created above. NOTE! I created a temporary format and I am using it in the PROC PRINT step. A format used in a PROC
step does not permanently change the format of the stored variable, it only changes it for that step. As a general rule, avoid using
temporary formats for permanent data sets if you can and you will run into fewer format error problems.
LABEL COUNT = "2009 Population"
PERCENT = "Percent of Population" ;
The LABEL statement puts a label for each variable, and because the SPLIT = option was used, these variables will be split to a new
line between words.
EXAMPLE 2: AMERICAN COMMUNITY SURVEY - GRAPHS

The first part of our presentation to students involves explaining the choices that statisticians make. Results are presented as
unbiased reality, but, in fact, at each point along the way, many decisions were made that involve judgment calls. The students first
decision well have them make is whether or not to keep in those people who checked that their race was both white and black. To
decide whether this group whose answer is both should be kept as a separate group, we create a bar chart and look at how they
stack up relative to the rest of the population. As we can see below, this is a very small group relative to the other three groups.
The program to create this graph is shown below.

DATA byrace ;
SET lib.blkwhitmix ;
IF racblk = 1 AND racwht = 0 THEN Race = "Black" ;
ELSE IF racblk = 0 AND racwht = 1 THEN Race = "White" ;
ELSE IF racblk = 0 AND racwht = 0 THEN Race = "Other" ;
ELSE IF racblk = 1 AND racwht = 1 THEN Race = "Mixed" ;
PERCENT = PERCENT/ 100 ;
RUN;
AXIS1 LABEL = ( ANGLE = 90 "Percent") ORDER = (0 to 1 by .1 ) ;
AXIS2 ORDER = ("White" "Black" "Mixed" "Other" ) ;
PATTERN1 COLOR = BLACK ;
PATTERN2 COLOR= GRAY ;

PATTERN3 COLOR = BROWN ;
PATTERN4 COLOR=WHITE ;
PROC GCHART DATA=byrace ;
VBAR Race / raxis = axis1 maxis = axis2
SUMVAR= percent
TYPE=SUM
OUTSIDE= SUM
PATTERNID = MIDPOINT ;
LABEL Race = "Race" ;
FORMAT percent percent8.1 ;
At first glance, it really seems a bit of overkill. Why not just open Excel, type the numbers in and be done with it? There are (as well
see in the last example in this paper) simpler options for output. The reason for going to this extent with the American Community
Survey is that we are going to produce a lot of output, bar charts, pie charts, tables, scatter grams. Setting up the first chart requires
some effort, but as you will see, the options set stay set throughout the program, and at each step, less and less effort is required.
We want to create a new variable, race, based on the survey respondents checked answers to the two boxes for black and white,
that is the variables racblk and racwht. Here is the first time well be glad we created an output dataset. Rather than perform the logic
in the DATA step for the 3,030,728 records in the PUMS data set, we only need to do it for the four records in the output file from the
frequency procedure.
DATA byrace ;
SET lib.blkwhitmix ;
IF racblk = 1 AND racwht = 0 THEN Race
ELSE IF racblk = 0 AND racwht =
PERCENT = PERCENT/ 100 ;
run;
=
1
0
1
"Black" ;
THEN Race = "White" ;
THEN Race = "Other" ;
THEN Race = "Mixed" ;
In the data set saved by the frequency procedure, PERCENT is not saved as a decimal, rather, 40.1% is saved as 40.1. For later use,
we want that to be an actual decimal, so divide it by 100.
Note that the OPTIONS, TITLE and FOOTNOTE statements earlier in our program set the title and footnote, removed the number
and date. We dont need to do it again. All of these - TITLE, FOOTNOTE, OPTIONS - will remain the same throughout our program
unless we use another TITLE, FOOTNOTE or OPTIONS statement t change them.
These next six statements will also apply to any relevant output throughout our program.
AXIS1 LABEL = ( ANGLE = 90 "Percent") ORDER = (0 to 1 by .1 ) ;
The first part of this statement labels the axis with the text in quotes, in this case Percent. It also sets the angle for the axis label
rotation to be 90 degrees, in other words, it prints sideways instead of at the end of the axis. The ORDER = option specifies that the
axis minimum will be 0 and maximum will be 1 with tick marks at .1 . Without the ORDER = , the axis is set based on the data. In this
case, it would have had a maximum of 80%.
AXIS2 ORDER = ("White" "Black" "Mixed" "Other" ) ;
This statement specifies the order to display the categories. Because the main point of this chart is for the students to use in
discussing whether the Mixed group should be include in a comparison of black and white survey respondents, we wanted it put
right after the black and white bars. The ORDER = option forces that order. Without this option, the responses would have been in
alphabetical order.
PATTERN1
PATTERN2
PATTERN3
PATTERN4
COLOR = BLACK ;
COLOR= GRAY ;
COLOR = BROWN ;
COLOR=WHITE ;
Without PATTERN statements, SAS will pick the colors by default. It makes a more effective graphic to have the bar representing
Black respondents colored black and the one representing White respondents colored white. NOTE! If you look back at the graph,
this can be confusing, since the first bar is white, the second is black, the third gray and the fourth brown. This doesnt seem to match
the PATTERN statements. It does, however, if you consider the fact that the PATTERN statements are separate from the AXIS
statements. The PATTERN statement for the response variables is assigned in alphabetical order. Since black is first, it is assigned
the color from PATTERN1.
PROC GCHART DATA=byrace ;
This statement begins the GCHART procedure, using the data set byrace.
VBAR Race / raxis = axis1 maxis = axis2

SUMVAR= percent
TYPE=SUM
OUTSIDE= SUM
PATTERNID = MIDPOINT ;
Note! This is all one statement. The VBAR statement will create a vertical bar chart. The variable to be charted is race. Stop right here
and youll get a no-frills bar chart. We, however, would like a lot of frills, which is what all of the options after the / will give us.
The RAXIS = option specifies the AXIS1 statement be used for the response axis. Why a separate AXIS statement? Doesnt this
seem silly? Why couldnt you just say X and be done with it here and not have a separate statement? If you think about this for a
moment, youll realize that if you changed your mind and wanted a horizontal bar chart, all of a sudden the response is going to be
the Y axis. Note! If you dont specify to use axis1 or axis2 on this statement, they will not be used and SAS will use the defaults for
those axes.
The MAXIS = option specifies the AXIS2 statement be used for the midpoint axis (that would be the ones with the categories, or
midpoints).
The SUMVAR = option specifies that the value charted for race is the summary of the given variable. In this example, it is the
percent variable, but any numeric variable in the data set could be used. Without the SUMVAR = option, the value for every
category would be 1 because we are analyzing the data that came from the frequency procedure and that we recoded in our DATA
step above. There is only one record for each race category. Note! SUMVAR = does not stand for sum but rather for summary.
Several types of summary statistics can be specfied.
The TYPE = option specifies the type of summary statistic to use. Together with the SUMVAR = option, the TYPE = option
causes SAS to chart for each category the sum of the percent variable. Since there is only one record for each category, the sum
charted will be the percent given on that record.
OUTSIDE = SUM causes SAS to print the value of the sum outside of each bar.
PATTERNID = MIDPOINT assigns patterns based on the value of the midpoint. Given that this is a categorical variable, the
midpoint is each category value. As noted above, the patterns will be assigned in alphabetical order.
LABEL Race = "Race" ;
FORMAT percent percent8.1 ;
These two statements are pretty obvious. LABEL determines the label printed for race. FORMAT uses the percent format for the
percent variable.
ADDING A PIE CHART

After all of this work, were pretty satisfied, but in discussions with the teacher, we decide it would be useful to have a pie chart and
talk to the students about the value of different types of graphics in answering different questions. In the first chart, we were really
interested in deciding if it would make a big difference whether or not we included the people who had checked both black and white
for race. Our second question involves what percentage of the population is composed of black and white races combined relative to
all other races. We decide that a pie chart would make a good graphic for this.
Since we already have the TITLE, FOOTNOTE, OPTIONS, LABEL, PATTERN, PROC GCHART and FORMAT statements written,
we only need to add one statement to create our pie chart. We dont even need a new procedure. Both PIE and VBAR statements
can be used in the same procedure. Notice we are still using the four records created in our original frequency procedure. We havent
touched the 3,000,000 plus records since step one.
The one statement is
PIE Race / NOHEADING ASCENDING
SUMVAR= percent
TYPE= SUM
;
PIE is, no surprise here, the statement to create a pie chart. All that is required is PIE variable-name. However, as usual, wed like a
few options.
NOHEADING removes the default heading, which in this case would be Sum of Percentage by Race. Thats unnecessary
information.
ASCENDING will order the pie slices by size. Because our main question to be answered here is, How much of the population is
black or white, versus everything else, I want these two, largest slices, to be together.
The pie chart this statement produces is shown below. There are other options we could have used, for example EXPLODE
= Other to pull out the Other slice and see how much of the pie was left. Very good advice on using some of these options to
produce communication-effective pie charts can be found in the presentation by that name from Bessler (20070.
EXAMPLE 3: AMERICAN COMMUNITY SURVEY - PROC TABULATE

Still in our discussion with middle school students on race in America, we ask them if they think people are biased in who they choose
to date or marry (see De Mars, 2011). To help them answer this question, wed like a table that shows race by gender, and the
percentage of possible dating / marriage partners in the population if, in fact, dating and marriage occurred without respect to race.
Here is the program to produce the results we want.
PROC FORMAT ;
VALUE $YN
"0" = "No"
"1" = "Yes" ;
VALUE $sex
1 = Male
2 = Female ;
TABLES sex*racblk / OUT = lib.blkwhtsex ;
TABLES st*racblk
/ OUT = lib.blkwhtst ;
WEIGHT pwgtp ;
WHERE racblk = "1" OR racwht = "1" ;
run ;
%MACRO mkrace(dsn) ;
DATA &dsn ;
SET lib.&dsn ;
IF racblk = 1 THEN Race = "Black" ;
ELSE IF racblk = 0 THEN Race = "White" ;
Percent = percent/ 100 ;
RUN;
%MEND mkrace ;
PROC TABULATE DATA = blkwhtsex ;
CLASS race sex ;
VAR count percent ;
TABLE race* sex ALL, count*(SUM= ' '*F=COMMA12.0) percent*(SUM = ' '*F=PERCENT8.1) ;
LABEL Count = "2009 Population"
Percent = "Percent" ;
FORMAT sex $sex. ;
First, go back to the PROC FORMAT and create one more format, for sex. The PROC FORMAT is now:
PROC FORMAT ;
VALUE
"0" =
"1" =
VALUE
1 =
2 =
$YN
"No"
"Yes" ;
$sex
Male
Female ;
Having learned from our earlier experience, the first thing we are going to do is create an output data set from the frequency
procedure.
TABLES sex*racblk / OUT = lib.blkwhtsex ;
TABLES st*racblk
/ OUT = lib.blkwhtst ;
WEIGHT pwgtp ;
WHERE racblk = "1" OR racwht = "1" ;
run ;
Remember the WEIGHT statement!!! Even though I said that above, Im reminding you here because forgetting the WEIGHT
statement is a common novice mistake and, in this case, it will cause your results to be, literally, 10,000% wrong. Thats a lot of
wrong. The second TABLES statement is not needed for the PROC TABULATE output at the end, but since well need the data set of
race by state later, we went ahead and created it in this step. In the actual project, there were a lot of TABLES statements in this step.
The only new statement here from our first step is the WHERE statement. Having concluded our discussion above we have decided
to drop out the Other category and include those who had checked both categories. We also decided to consider everyone who
checked Black for their race as black, whether they also checked White or not. If some students disagree with us, that is good
because the point of this whole project with the schools is to get them talking and thinking about statistics. If they think our
designation is wrong or unfair, this is going to be the most passion theyve ever had about statistics.
Were going to do the same two things with a lot of data sets, because now we have made two decisions. The first is to use the output
from PROC FREQ for analysis, so were going to be dividing that percent variable by 100 each time. The second is to categorize
people who selected Black as their race as black. Whenever you find yourself doing the same bit of code over and over, think about
creating a macro. Macro programming is not nearly as scary as some people make it out to be. The trick is to start early in your career
with very simple macros and just get progressively more complex. The example below, with only one macro parameter, is about as
simple as you can get. Although we only use it one time in this paper, in the actual project, we used it over and over.
Lets look at this macro line by line.
%MACRO mkrace(dsn) ;
Create a macro named mkrace and specifies that this macro will require one parameter, which is named dsn.
DATA &dsn ;
This statement creates a data set. When the macro is run &dsn will be replaced with whatever we provided when we called the
macro.
SET lib.&dsn ;
This statement reads in a data set from the library referenced by lib and named whatever value I had supplied for &dsn.
IF racblk = 1 THEN Race = "Black" ;
ELSE IF racblk = 0 THEN Race = "White" ;
Percent = percent/ 100 ;
RUN;
These are just IF, ELSE and assignment statements like every other IF, ELSE and assignment statement you have written in your life.
The fact that they occur in the middle of a macro makes no difference whatsoever.
%MEND mkrace ;
This sends the mkrace macro. Now, to call this macro, all I need to do is:
%mkrace(blkwhtsex) ;
Before moving on to the next procedure, lets recap what we did here, because its important. We used a PROC FREQ to create a
couple of permanent SAS data sets. The first one, blkwhtsex, included four records. We read this tiny data set and created a new,
temporary data set, also four records, with a new variable, race, and the variable percent now in a decimal format.
Your mileage may vary. There are a couple of choices I made here for reasons of my
own. I mention these choices because part of becoming an experienced programmer is making
decisions and judgments. Even if your decision is to copy an example, you should know why the
example includes the specific choices it does. Here are the choices I made and why.
I did not supply a libref for the project directory. I always used lib in the LIBNAME
statement in this project because it saves me having to specify a library as well as a
data set name when I use a macro. To see how to specify both the library and data set,
see the earlier paper (De Mars,2011a).
Could I have just created a format using PROC FORMAT for race? Yes. The reason I chose not to do that is,
This is a temporary data set with four records. The time and storage to read every record and create a new variable
is as close to nothing as one could get. Thus, the advantage of using PROC FORMAT in many cases, that is, it is
faster and takes up less storage space than creating a new variable, is really irrelevant, and
I am going to use this race variable a lot. The odds of me forgetting to apply the format at some point and having to
re-run the analysis to produce some output is great. Given this, its less trouble for me to create the macro.
Now, were going to do a PROC TABULATE using this temporary data set
PROC TABULATE DATA = blkwhtsex ;

This statement is pretty obvious. It begins the TABULATE procedure, using the data set blkwht sex we created with our macro.
CLASS race sex ;
The class statement specifies the classification or categorical variables that well use for our table. All variables used in a table must
have been specified in either a CLASS or a VAR statement. Variables in a CLASS statement can be either character or numeric.
VAR count percent ;
These are numeric variables that will be used in an analysis.
The TABLE statement takes the form
TABLE row-variables , column variables ;
You can also have a page variable, not included in this example. The last set of variables specified will be the column variables.
Just like the frequency procedure, crossing two variables with an * means that these variables will be cross-classified. Without the
asterisk, results will be produced for each variable separately. The keyword ALL requests that statistics be produced for the total
population. Statistics and format for a variable are specified by an * followed by the format or statistic. To specify multiple statistics or
formats, you can use parentheses.
count*(SUM= ' '*F=COMMA12.0)
is the same as
count*SUM= ' '
count*F=COMMA12.0
TABLE race* sex ALL, count*(SUM= ' '*F=COMMA12.0) percent*(SUM = ' '*F=PERCENT8.1) ;
The first part of our TABLE statement, then, requests statistics for race by sex and for the total population. The second part
specifies the first column variable will be count, with the SUM statistic, and this statistic will not have a label over it, that is, the label
text is a blank space. The format will be a width of 12, 0 decimal places and commas. The second column variable will be percent,
with the SUM statistic, again, no label, and in a percent format with one decimal place.
LABEL Count = "2009 Population"
Percent = "Percent" ;
FORMAT sex $sex. ;
These last two statements should be familiar from above. These just define the labels for the two column variables and specify the
format for sex, which uses the $sex format we created in the PROC FORMAT.
Population Distribution by Race
2009
Population
Percent
Race
Sex
Black
Male
19,565,078
7.1%
Female
21,389,391
7.8%
Male
115,771,666
42.1%
Female
118,404,207
43.0%
275,130,342
100.0%
White
All
2009 American Community Survey Data

Here is our table. Since this is part of the same ODS RTF FILE we specified at the beginning, it is still using the STYLE=OCEAN. It
still uses the same title and footnote specified at the beginning. This is actually a good thing. Having all of your tables match in style
gives your presentation a more professional appearance.
EXAMPLE 4: AMERICAN COMMUNITY SURVEY - MAKING A MAP

Our final demographic graphic with the American Community Survey is a map. After trying several possible types of maps (not shown
here), it seemed that the best graphic would be a map of the United States with the states shaded by the percentage of their
population that is African-American. Before showing this graph students are asked to give their guesses as to which states have the
highest and lowest percentage of African-Americans in their population.
To force the percentages to fit specific categories, another VALUE statement was added to the PROC FORMAT at the top of our
program.
VALUE grays
LOW - .002
.003 - .005
.006 - .009
.010 - high
=
=
=
=
"<2%"
"3-5%"
"6-9%"
"10 - 12%" ;
The syntax
LOW - some number = formatted value
assigns the formatted value on the right hand side of the equals sign to all of the values from the minimum in the data set to the
specified number. Similarly,
some number - HIGH = formatted value
will assign the values from the specified number to the maximum value.
The rest of the program is:
%mkrace(blkwhtst) ;
DATA blkwhtst ;
SET blkwhtst ;
STATE = INPUT(st,BEST8.) ;
WHERE racblk = "1" ;
pct = ROUND(percent,.001) ;
TITLE "African-American Population by State " ;
TITLE2 "By Percent" ;
PATTERN1 COLOR = White ;
PATTERN2 V=M3N45 color=black;
PATTERN3 COLOR= Gray ;
PATTERN4 COLOR= Black ;
PROC GMAP DATA = blkwhtst
MAP = MAPS.US ;
ID STATE ;
CHORO pct / DISCRETE STATISTIC=MEAN ;
WHERE STATE NE 72 ;
FORMAT pct grays. ;
LABEL pct = "Percentage African-American" ;
%mkrace(blkwhtst) ;
This calls our macro we created above (remember our macro?), creates a temporary data set named blkwhtst, sets the value of
percent to a decimal and creates a variable named race. Its reading in the permanent data set named blkwhtst that we created in
the PROC FREQ step earlier.
WHERE DO WE GET THE MAP?

SAS 9.2 ships with a library of maps data sets. In the normal installation, these maps will be stored in a library clearly labeled
maps. You dont have to assign this library or do anything. You should be able to look in your explorer window and see it. Go ahead,
try it. If you open the US data set, and select COLUMN NAMES from the VIEW menu, youll see that you have a variable named ST
and it matches exactly the st variable in our own data set, blkwhtst, except for one small problem. The STATE variable in the
MAPS.US is numeric. You can tell this by the fact that it is right-justified, while character variables, such as STATECODE are leftjustified. You know that your st variable is a character variable because you did the PROC CONTENTS when you were running all of
the data quality checks on your open data set. (What? You didnt do the data quality checks? Go back and read the paper on data
quality (De Mars, 2011a) right now!)
Before I can use the MAPS.US data set I need to have a variable in my data set that matches it. This next step creates a
variable to match the STATE variable in the MAPS.US data set. It also does a little more clean up of the data set while youre at it.
DATA blkwhtst ;
SET blkwhtst ;
STATE = INPUT(st,BEST8.) ;
After reading in the data from the blwhtst data set, our assignment statement creates a new, numeric variable, STATE. The
INPUT function inputs the st variable in a numeric format. If there were character values in this field, that could cause problems, but
there are no characters, just numbers 1 - 72.
where racblk = "1" ;
Because I want to map the percentage of African-American residents in each state, I only need to keep the records where the
respondent checked black as his or her race.
pct = ROUND(percent,.001) ;
The use of the ROUND function will round the variable, percent, to the nearest .001. Without this, SAS would map each value of
percent with a different color. There is another reason for creating a new variable here. PERCENT is a keyword in the GMAP
procedure. It is generally both a bad idea and confusing to use keywords as variable names.
TITLE "African-American Population by State " ;
Title2 "By Percent" ;
Now I need to change the title. This will replace the previous TITLE statement and add a second title line underneath. Notice
that the footnote on the graph stays the same, since there is no new FOOTNOTE statement.
PATTERN1 COLOR = White ;
PATTERN2 V=M3N45 color=black;
PATTERN3 COLOR= Gray ;
PATTERN4 COLOR= Black ;
The previous patterns started with black. The patterns are used from the lowest percentage of our variable to be graphed African-American residents to the highest. It would be confusing to have the states with the lowest percentage of African-Americans
black and those with the highest percentage shown on the chart in white. PATTERN1, the states with the lowest percentage, will now
show up in white. We need something between white and gray, though. The V = option on the PATTERN statement gives a value for
shading. The default, without the V= option is solid.

In this case, it doesnt really matter what value we select, as long as it isn't solid, so the M3N45 pattern was as good as any
(R)
other. You can find a list of pattern types in more detail than you could possibly ever want to know in the SAS/GRAPH Reference
Guide (SAS Institute (2011)
PROC GMAP DATA = blkwhtst
MAP = MAPS.US ;
The GMAP procedure will create a map using the data from the blkwhtst data set and the mapping data from the MAPS.US.
ID STATE ;
This is the variable that defines the map area. It must be in both the DATA = and MAP = data sets and the name, type and
length must match in both data sets. We have no worries, because we made sure of that in our DATA step above.
CHORO pct / DISCRETE STATISTIC=MEAN ;
The CHORO statement assigns patterns to the map area based on the formatted value of the variable given, in this case, pct.
The DISCRETE option specifies that a separate color and pattern be used for each discrete response. Without this the map will be
different shades of one color and have a different color for every individual value from 0 to the maximum response. We dont want
that, we want different patterns just for the four different categories.
WHERE STATE NE 72 ;
If you peek into the MAPS.US data set, youll notice that 72 is Puerto Rico. In this particular analysis, we didnt want Puerto
Rico, so we dropped it. Do not get clever and put WHERE STATE >50. That will drop the people in Virginia through Wyoming from
your data and make them angry.
FORMAT pct grays. ;
This applies the format we created above so that our data fall into four categories and also so that the text from the format shows in
the legend.
Label pct = "Percentage African-American" ;
This labels our variable for the legend at the bottom of the page.
EXAMPLE 5: A DIFFERENT TYPE OF OPEN DATA & ODS STATISTICAL GRAPHICS

This next example is the complete opposite of the first four. Rather than a data set with millions of records and hundreds of variables,
it uses 21 records and three variables. The data are not from the federal government but rather from a non-profit sports organization.
The target audience is not middle school students but adults. The purpose of this analysis was not to present information about
statistics and the work of statisticians, but rather, to answer two specific questions. The only common factors between these two
analyses is that both used open data freely available to anyone with an Internet connection and both involved presentation of
information to a non-technical audience.
There had been some discussion regarding whether the number of competitors in judo in the U.S. was declining or not. This may
seem like a simple question, but there are several different organizations that register judo competitors. To complicate matters further,
the national championships had added new divisions several times. Originally, the U.S. national championships were contested only
in the male and female weight divisions that competed in the Olympics. Over the years, separate divisions were added for
competitors over 35 years of age, for visually-impaired competitors and other categories not contested in the Olympics. Also, any
variable observed, is going to fluctuate over time. The question was whether this year to year fluctuation masked a significant
downward trend. The data were posted to a forum on judo with the request that anyone with expertise be kind enough to analyze the
data and report back.
Reading the data into SAS was a simple matter of writing an INPUT statement, followed by a CARDS statement , pasting the data,
and adding a semi-colon at the end. Yes, a DATALINES statement would have worked as well as a CARDS statement and yes,
anyone who still uses the CARDS statement is old.
DATA competitors ;
INPUT year females males ;
CARDS ;
1990 118 267
1991 84 210
....
2011 66 146
;
The first chart I need to produce to answer this question is very simple and I am going to use Graph-N-Go which seems to be
custom made for really simple questions. A surprising number of SAS programmers dont even know they have Graph-N-Go
available, but its right there next to the RUN menu under SOLUTIONS.
When you select Graph-N-Go, a new window will pop-up with an icon you are supposed to recognize as a SAS dataset.
Click on that and a new window will pop up with a the words SAS dataset at the top, an empty box and, next to it, a button with
three dots, causing you to ask yourself, What the heck am I supposed to do now? The answer is to click on the button with the .
Click on that and yet another window will pop up. The next window should look familiar. It has the libraries available to you in the
left pane, including the WORK library, SASUSER, MAPS and any libraries you might have defined with a LIBNAME statement. Select
the library you want to use. Then, in the right pane, select the dataset you want to use. In this case we are going to select the
WORK library and the dataset named competitors.
On the left of the window are several buttons. We want a line plot, so were going to click on the line plot and drag it to the large
pane in the bottom right. An empty box appears with the title Plot 1. We right-click on the empty box and from the drop-down menu
select PROPERTIES.
In the pop-up window is a drop-down menu with the title DATA MODEL. By this point we are wondering if it might be easier to
learn SAS/GRAPH after all, but we forge ahead, selecting from the drop-down menu the one dataset that we identified previously,
work.competitors.
There are five tabs at the top of the PROPERTIES window, these are General, Data, Titles/Footnotes, Appearance and
Object Size. Were going to click on DATA tab and from the drop down menu next to X, select Males as the variable that we want
to plot and under Y, well select Year. Well also select REGRESSION from the drop-down menu under PLOT STYLES.
Well click the TITLES tab and give a title for the plot.
We click OK and the chart below is produced. If we didnt like the size, we could right-click on it, select Grow/Shrink and then
drag on the side of the plot to increase or decrease its size, or check the box next to MAXIMIZE, which will make the graph the
maximize to size to fit in the window.
We click on MAXIMIZE and are happy with this size, so we simply right-click on the chart, pick EXPORT and from the options
select External File. There will be a pre-filled default directory, name and type, something like :
C:\Users\Yourname\My SAS files\9.2\males.bmp
If you want to change any of that, to the right is the ubiquitous box with the three dots again. Click on that and a new pop-up
window will allow you to change the folder, file name and type.
Here is our plot and it seems pretty clear that there is a downward trend. The middle line is our regression line, showing that the
prediction is a straight line downward. The two dashed lines are the confidence intervals.
The plot for male competitors worked fine but when we do the same steps to get a plot for female competitors it looks decidedly odd.
There appears to be an upward trend to a point, and then a downward trend. In 1988, womens competition was added to the
Olympics for the first time. I speculate that this may have caused womens competition to swing up, counter to the overall downward
trend, but then after the excitement of having qualified as an Olympic sport faded, they, too would show a downward trend. Also,
elections are held for a new board of the National Governing Body each Olympic year and they take office the following year.
To test these hypotheses, that there was an upward trend followed by a downward trend, we can use PROC REG, the SAS
regression procedure .
ODS GRAPHICS ON ;
This statement turns ODS Statistical Graphics output on. If you have not used ODS Graphics yet, you need to try it. Simply put, SAS
tries to guess what you would most likely want as graphics output and produces it. Its as simple as that.
PROC REG ;
MODEL females = year / STB ;
WHERE year < 2002 ;
The PROC REG statement calls the regression procedure. It will use the most recently created data set which is the temporary
file created above. The MODEL statement gives the dependent variable (females) = the independent (year). The option STB is for
standardized regression coefficient. More information about standardized coefficients can be found in the related paper on Statistics
for Hamsters (De Mars, 2011b). The WHERE statement selects only those records where the year is less than 2002.
The next procedure is identical except that for the WHERE statement and produces the same analyses for the years after 2001.
PROC REG ;
MODEL females = year / STB ;
WHERE year > 2001 ;
RUN;
ODS GRAPHICS OFF ;
The statement at the end turns ODS graphics off.
The REG procedure with ODS graphics produces a lot of output. This is the reason you probably want to turn it off if you dont
specifically need the graphics. In with all of the other charts is the one below that addresses our particular question. On the right side
it gives the R-Square value of .1501. The square root of this, that is the R value, is .39. In other words, we can tell the inquiring minds
that want to know that from 1990 to 2001 there was a correlation of about .40 between year and the number of competitors, which
means the number of competitors was increasing each year.
Examining the output for our second PROC REG, we find this next plot. This plot has an R-square of .63. In other words, the
correlation between year and the number of female competitors is -.80. You dont really need the numbers though to see what we had
here was a somewhat modest upward trend followed by a very steep downward trend in the number of competitors. What to do about
it is the decision of the people in the organization, but the facts are very hard to deny when presented in this manner. The number of
competitors is clearly in decline for both males and females, a trend that has been going on for over a decade for women and much
longer for men.
CONCLUSION
One advantage of using open data has over the data sets used with most textbooks is the potential for analysis of big data sets.
These analyses almost force the programmer to learn more efficient techniques for processing data. While the first example seems an
awful lot of effort to produce a single table, most of this work was re-used over and over throughout out example. The output data set
created in the frequency procedure was used repeatedly, the TITLE ,FOOTNOTE and OPTIONS statement applied to several graphs
and tables. The PATTERN and AXIS statements applied to several charts. The formats created in the PROC FORMAT step were
also used in various output produced for this project. Completing a textbook exercise, one might not see the advantages of going
through all of these extra steps for one chart or table. We have the numbers from the PROC FREQ, you can just make a table in
Word or PowerPoint and insert those numbers, a new programmer is likely to complain. To make one graph or table, that is probably
true, but when there are multiple tables and graphs to be produced, the time put in up front pays off. Similarly, repeating code to
perform a simple task like creating a new variable or formatting a variable can be repeated using a simple macro.
A second advantage of the use of open data is that, given the number of data sets available, these can be used for almost any type of
project, procedure or analysis that the programmer wishes to experience.
The third advantage, as can be seen from our last example, is that even small, simple data sets can lend themselves to a moderately
sophisticated statistical analysis.
A further advantage of the use of open data occurs when analysis is done to assist a particular audience. This in itself is a learning
experience. The days of transom engineering are over. The value of the ability to produce accurate numbers is greatly increased
when paired with the ability to convey information based on that information. Presenting national demographics to a class of seventhgraders or presenting regression analyses to an audience of judo coaches are real challenges that cause the programmer to seek
new and better means of presentation.
Creating and implementing an open data project for a community program provides experience not just in trying different SAS
techniques but also in tailoring the output of those to the needs of the intended audience. Not only does the community organization
served benefit from this technique, but it also increases the marketable skills of the programmer and provides him or her a larger
portfolio to point to of statements, options and procedures with which he or she has professional experience.
REFERENCES
Besler, L. (2007). Communication-effective pie charts. Presentation at the annual meeting of the SAS Users Group International.
www2.sas.com/proceedings/forum2007/134-2007.pdf
De Mars, A. (2011a). SAS Functions for a Better Functioning Community. Paper presented at the annual meeting of Western
Users of SAS Software. San Francisco, CA.
De Mars, A. (2011b). SAS Essentials III: Statistics for Hamsters. Paper presented at the annual meeting of Western Users of
SAS Software. San Francisco, CA.
SAS Institute (1999) SAS Procedures Guide. SAS Institute Inc, Cary, NC
SAS Institute (2011).SAS/GRAPH(R) 9.2: Reference, Second Edition SAS Institute, Cary, NC.
ACKNOWLEDGMENTS
Thank you to Kirby Posey of the U.S. Census Bureau for invaluable assistance in verifying the variable coding and estimates.
CONTACT INFORMATION
Your comments and questions are valued and encouraged. Contact the author at:
AnnMaria De Mars
The Julia Group
2111 7th St. #8
Santa Monica, CA 90405
(310) 717-9089
annmaria@thejuliagroup.com
http://www.thejuliagroup.com
SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the
USA and other countries. indicates USA registration.
Other brand and product names are trademarks of their respective companies.

Better Looking in Sas

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Better Looking in Sas

Uploaded by

Copyright:

Available Formats

SAS Essentials II: Better-Looking SAS for a Better Community

AnnMaria De Mars, The Julia Group, Santa Monica, CA

EXAMPLE 1: AMERICAN COMMUNITY SURVEY - MAKING PRESENTATION-QUALITY TABLES

23:18 Saturday| August 6| 2011

The FREQ Procedure

2009 American Community Survey Data

A character format begins with a $

Unlike other SAS names, a format name cannot end in a number.

PROC SORT DATA = lib.blkwhitmix ;

PROC PRINT DATA = lib.blkwhitmix SPLIT = " " ;

COMMA14. PERCENT 8.1 racblk racwht $yn. ;

EXAMPLE 2: AMERICAN COMMUNITY SURVEY - GRAPHS

The program to create this graph is shown below.

PATTERN2 COLOR= GRAY ;

VBAR Race / raxis = axis1 maxis = axis2

ADDING A PIE CHART

EXAMPLE 3: AMERICAN COMMUNITY SURVEY - PROC TABULATE

PROC TABULATE DATA = blkwhtsex ;

2009 American Community Survey Data

EXAMPLE 4: AMERICAN COMMUNITY SURVEY - MAKING A MAP

WHERE DO WE GET THE MAP?

shading. The default, without the V= option is solid.

EXAMPLE 5: A DIFFERENT TYPE OF OPEN DATA & ODS STATISTICAL GRAPHICS

You might also like