You are on page 1of 5

Stata Tutorial

Reading Data and Saving Stata Files

Use: The Stata use command reads data that has been saved in Stata format:

use "C:\data\ncr00.dta"
use "C:\data\ncr00.dta", clear

save "C:\data\ncr00r.dta"
save "C:\data\ncr00.dta", replace

If you already have a Stata file named "ncr00.dta" and wish to save an updated version of
the file under the same name, then use the Stata save command with the replace option.

This command will destroy the previous version of your file so use the replace option
only if you are certain that you will not need the older version of your file. There is no
way to retrieve your original file once another file has written over it.

How to Increase Memory

Sometimes you may need to allocate additional memory for your Stata session, such as
when you are working with a large file. If you recieve this message from Stata:

no room to add more observations

then you should increase the amount of memory available to your Stata session. Here's
how.

1. Find out how large the file is. First, issue the clear command to remove the file
from memory. Then issue the desc using filename command:

desc using ncr00.dta

At the top of the information listed is the size of the file, in bytes. There are 1,000
bytes in a kilobyte, and 1,000 kilobytes in a megabyte, so if the size is 11,000 then
the file is 11 kilobytes. For example:

Contains data
obs: 899,094 3 Jul 2010 08:39
vars: 76
size: 91,707,588

This shows that the file has 899,094 observations, 76 variables, and is just slightly
over 91.7 megabytes.

2. From the Stata dot prompt, issue the command set memory to increase the
amount of memory. For example, the following command allocates 12 megabyes
of memory to the current Stata session:

set memory 100m

Set the memory to a number slightly larger than the size of the file you are trying
to read.

3. Now read your data file.

Log and Do Files

Log files

log using "C:\data\ir299_01.log"


log using "C:\data\ir299_01.log", replace
log using "C:\data\ir299_01.log", append
log close

Log and do files are very useful. Logs keep a record of what commands you have issued
and their results during your Stata session.

You will find it helpful to use names that will help you to remember what you did during
that session. Stata will automatically append an extension of ".log" to the filename. By
default, everything displayed on the screen will be recorded in the log file. You can give
the file any name you like, but you should use names that will help you remember what
analyses you did.

Do files are good for long series of commands that may need to be "tweaked" to work
properly. They are also necessary to replicate things that you have done on new or
modified datasets.

A "do" file is a set of commands just as you would type them in one-by-one during a
regular Stata session. Any command you use in Stata can be part of a do file. Do files are
very useful, particularly when you have many commands to issue repeatedly, or to
reproduce results with minor or no changes.

Examining your Data

Describe
Once you have the data in Stata, you will want to make sure that all the variables are
there and that they are in the format you need. You can do this with the "describe"
command. Describe, which can be abbreviated as simply "d," will provide basic
information about the file and the variables. You don’t have to call the data into Stata to
be able to describe it, though. The command:

d using "C:\data\ncr00.dta"

will accomplish this.

Keep, Drop, and Rename

Well, now we’ve created many new variables and converted some old ones. Since we no
longer need all of these variables, we’ll want to eliminate some of the ones we don’t
really need and perhaps rename some of the ones we keep. We can either keep the
variable we are interested in:

keep if prov==71
drop if prov==72
ren p6 age

Creating Variables

Stata can store data as either numbers or characters. Stata will allow you to do most
analyses only on numeric data. Since Stata allows you to do analyses only on numeric
variables, you will need to convert string data to numeric data.

Generate

Often, you will need to create new variables based on the ones you have already. The two
most common ways of creating new variables is by using "generate" and "egen." Here
are some examples of gen:

gen total= var1 + var2 + var3


gen kid014= kids06 + kids714
gen lnwage= ln(wage)

In the first example, we generate a new variable called "total" which is simply the
addition of var1n , var2, and var3; in third example, we generate a variable called
“lnwage” which is the log of wage.

Dummy variables
Sometimes we need to generate a "dummy" variable, or variables. Stata makes this very
easy:

If you want a dummy variable to indicate only a particular size category:

ren p7 sex
gen male=(sex==1)
ren p9 mstat
gen married=(mstat==2 | mstat==5)

ren p22 educ


gen grader=1 if educ<16
replace grader=2 if (educ==16 | educ==17)
replace grader=3 if (educ>=21 & educ<=24)
replace grader=4 if (educ==25)
replace grader=5 if (educ>=31 & educ<=58)
replace grader=6 if (educ>=61 & educ<=71)

Here Stata will create a dummy variable such that: male = 1 if the gender is male, 0
otherwise. In general, using this command will make Stata create a dummy variable equal
to 1 for each observation where the expression in brackets is true, and equal to zero
otherwise.

Extended Generate (egen)

"egen", or "extended generate" is useful when you need a new variable that is the mean,
median, etc. of another variable, for all observations or for groups of observations. Egen
is also useful when you need to simply number groups of observations based on some
classification variables. Here are some examples:

ren p2h rel


gen kids=(rel>=3 & rel<=8)
gen kids06=(kids==1 & (age>=0 & age<=6))
bysort hhid: egen hhkids20=sum(kids20)

Basic Commands

Now that you have your data in a format you want, check it before doing any analyses.
This can save you quite a bit of frustration later on.
Summarize

"sum", short for summarize, will give you the means, sd’s, etc. of the variables listed. If
you don’t list any variables, it will give you the information for all numeric variables. If a
variable you thought was numeric shows up as having 0 observations and a mean of 0,
then, most likely, Stata still thinks it’s a character variable.
sum
sum age male
bysort male: sum age

Tabulate

"tab", short for tabulate, will produce frequency tables. By specifying two variables, you
will get a crosstab.

tab grader
tab1 grader sex prov
tab prov sex
tab prov sex, col
tab prov sex, col row
tab prov sex, col nofreq
bysort prov: tab grader sex

You might also like