You are on page 1of 43

Applied Financial Econometrics using Stata

2. Working with Data


Stan Hurn
Queensland University of Technology
& National Centre for Econometric Research
Hurn (NCER) Applied Financial Econometrics using Stata 1 / 43
1
Data Sources
2
General Data Issues
3
Time Series Data
4
Reconguring Data
Hurn (NCER) Applied Financial Econometrics using Stata 2 / 43
Data Sources
Basic Principle
Stata stores data in .dta les.
One of the implications of the concern for reproducible work: avoid
altering data in a spreadsheet. Rather, you should transfer external data
into the Stata environment as early as possible in the process of analysis,
and only make changes to the data with do-les that can give you an audit
trail of every change made.
Hurn (NCER) Applied Financial Econometrics using Stata 3 / 43
Data Sources
Stata Data Files
Reading and writing binary (.dta) les is much faster than dealing
with text (ASCII) les using the insheet or inle commands.
As we shall see, binary les permit variable labels, value labels, and
other characteristics of the le to be saved.
To write a Stata binary le, the command
save file [,replace]
is employed.
The compress command can be used to economise on the disk space
(and memory) required to store variables.
To make the binary le readable by earlier versions, use the saveold
command. Note, however, that Stata 13 uses a new dataset format
to accommodate long string variables. saveold in Stata 13 will create
a dataset usable (except for long strings, or strLs) in version 11 or 12.
Hurn (NCER) Applied Financial Econometrics using Stata 4 / 43
Data Sources
Loading Data
For data saved in other le formats (such as SAS or SPSS etc.) Stata
supports a handy bit of software called Stat Transfer.
http://www.stata.com/products/stat-transfer/
Unfortunately Stat Transfer doesnt come with Stata 13 and is an
additional expense. Neither does it handle EViews les at this point!
From a data perspective, two really useful commands are:
1
webuse
2
freduse
Hurn (NCER) Applied Financial Econometrics using Stata 5 / 43
Data Sources
webuse
Specify URL from which dataset will be obtained
webuse set [http://]url[/]
then simply issue the command and the name of the le
webuse mydata.dta
Stata will automatically use the specied URL to access the web and
load the named data le (which must be a .dta le).
All the data used in this set of lectures may be accessed using the
URL
http://www.ncer.edu.au/stata-resources/data/Singapore
Hurn (NCER) Applied Financial Econometrics using Stata 6 / 43
Data Sources
freduse
An important source of macro and nancial data is the St. Louis Federal
Reserve database (FRED)
http://research.stlouisfed.org/fred2
To use a facility that allows you to download data directly from the
database you need to download and install a package from the web
findit freduse
click on st0110 from http://www.stata-journal.com/software/sj6-3
This will install the package.
There is also a YouTube video on how to use this facility
YouTubeVideo
Hurn (NCER) Applied Financial Econometrics using Stata 7 / 43
Data Sources
Example -webuse
The following series of commands in a .do le
// real us gdp
clear
webuse set http://www.ncer.edu.au/stata-resources/data/Singapore
webuse "usrealgdp.dta", replace
will go to the NCER website and load the Stata usrealgdp.dta le.
Hurn (NCER) Applied Financial Econometrics using Stata 8 / 43
Data Sources
Hurn (NCER) Applied Financial Econometrics using Stata 9 / 43
Data Sources
Example -freduse
The Federal Reserve Economic Data can be accessed to nd a tag for real
US gpd.
FRED
// real us gdp
freduse GDPC1
One thing to be a little careful about is the date convention used by FRED.
Hurn (NCER) Applied Financial Econometrics using Stata 10 / 43
Data Sources
Hurn (NCER) Applied Financial Econometrics using Stata 11 / 43
General Data Issues
Missing Data
Missing value codes in Stata appear as the dot (.)
the missing function mi(varname) will return 1 if the observation is a
missing value, 0 otherwise
the missing value in fact takes the value positive innity, so in the
presence of missing data you do not want to say
generate hiprice = (price > 10000)
but rather
generate hiprice = (price > 10000) if price <.
or preferably
generate hiprice = (price > 10000) if !mi(price)
which then generates an indicator (dummy) variable equal to zero for
low-prices and missing when price is missing.
Hurn (NCER) Applied Financial Econometrics using Stata 12 / 43
General Data Issues
More on Missing Data
Stata actually allows for allows for multiple missing value codes (.a,
.b, .c, ..., .z) The standard missing value code (.) is the smallest of
these. This is useful if you wish to encode in the data the fact that
data are missing for dierent reasons.
Spreadsheet les often use NA to denote missing values, while in
some datasets codes such as -9, -999, or -0.001 are used.
the former is problematic because Stata will read an entire variable as a
string and not a numeric value if NA appears anywhere .... (this is
particularly irksome when dealing with Excel spreadsheets).
the latter instances are particularly worrisome as they may not be
detected unless the variables? values are carefully scrutinized.
Stata provides two further commands to deal with missing values,
namely, the mvdecode and mvencode commands. They allow you
to map various missing values into numeric values and vice versa.
Hurn (NCER) Applied Financial Econometrics using Stata 13 / 43
General Data Issues
Labels
Variable labels
is a character string (maximum 80 characters) which describes the
variable
label variable varname "text"
Variable labels are used to identify the variable in printed output and
graphs
Value labels
associate numeric values with character strings
label define sexlbl 0 male 1 female
label values sex sexlbl
they exist separately from variables, so that the same mapping of
numerics can be applied to a set of variables (e.g. 1=very
satised...5=not satised may be applied to all responses to questions
about consumer satisfaction).
Hurn (NCER) Applied Financial Econometrics using Stata 14 / 43
General Data Issues
Notes on Individual Series
The command
notes : text
will enter notes on the dataset.
. notes list _dta
_dta:
1. This file was constructed from an excel data file provided by Mardi Dungey. It allows reproduction of the
results of estimating a SVAR of the Australian economy developed in a series of papers by Mardi Dungey and
Adrian Pagan (2000, 2009) published in the Economic Record.
2. Dungey, M. and Pagan, A.R. (2000) A Structural VAR of the Australian Economy, Economic Record, 76, 321-342.
3. Dungey, M. and Pagan, A.R. (2009) Revisting an SVAR of the Australian Economy, Economic Record, 85, 1-20.
Hurn (NCER) Applied Financial Econometrics using Stata 15 / 43
General Data Issues
Notes on Individual Series
The command
notes variablename: text
will enter notes on the named variable.
. notes rgdp tot uscpi djindus
rgdp:
1. Real US GDP in US$ 1979Q3 to 2006Q4 (seasonally adjusted). Source - DATASTREAM
tot:
1. Australian Terms of Trade 1979Q3 to 2006Q4 (seasonally adjusted). Source - RBA Bulletin
uscpi:
1. US Consumer Price Index 1979Q3 to 2006Q4 (not seasonally adjusted). Source - DATASTREAM
djindus:
1. Dow Jones Industrial Price Index 1979Q3 to 2006Q4. Source - DATASTREAM
Hurn (NCER) Applied Financial Econometrics using Stata 16 / 43
General Data Issues
Order
Stata has one important feature that will be foreign to people used to
coding in Gauss or Matlab. The order of variables in the dataset
matters.
This is because you can use hyphenated lists to include all variables
between rst and last.
You can also use wildcards to refer to all variables with a certain
prex. If you have variables pop60, pop70, pop80, pop90, you can
refer to them in a varlist as pop* or pop?0.
The order and move commands can alter the order of variables.
Hurn (NCER) Applied Financial Econometrics using Stata 17 / 43
Time Series Data
Dates
Stata supports date (and time) variables and the creation of a time
series calendar variable.
Dates are expressed, as they are in Excel, as the number of days from
a base date. In Statas case, that date is 1 Jan 1960 (like
Unix/Linux). Negative numbers represent dates and times before
January 1, 1960, and positive numbers represent dates and times
afterward.
You may set up data on an annual, half-yearly, quarterly, monthly,
weekly or daily calendar, as well as a calendar that merely uses the
observation number. High-frequency data are also supported.
An observation-number calendar is generally necessary for
business-daily data where you want to avoid gaps for weekends,
holidays etc. which will cause lagged values and dierences to contain
missing values (creating two calendar variables for the same time
series data can be useful).
Hurn (NCER) Applied Financial Econometrics using Stata 18 / 43
Time Series Data
Dates and Times
Hurn (NCER) Applied Financial Econometrics using Stata 19 / 43
Time Series Data
tsset
Useful commands to create date variables are
gen daten = %tq(1970q1) + _n-1 (quarterly)
gen daten = %tm(1970m1) + _n-1 (monthly)
gen daten = %tw(1970w1) + _n-1 (weekly)
gen daten = _n (observation number)
After creating the date variable you need to tell Stata that the data you
have is time series data. You do this using the command
tsset daten, monthly
Hurn (NCER) Applied Financial Econometrics using Stata 20 / 43
Time Series Data
Reading Dates
Say you have date strings like November 3, 2010, 11/3/2010 or
2010-11-03 08:35:12 in a le and you wish to read these into Stata
date format.
This can be achieved using the date function. The date function
takes two arguments, the string to be converted, and a series of
letters called a mask that tells Stata how the string is structured.
In a date mask, Y means year, M means month, D means day and #
means an element should be skipped.
gen date1=date(dateString1,"MDY")
gen date2=date(dateString2,"MDY")
gen date3=date(dateString3,"YMD###")
Be aware that if you have a string which works with two-digit dates,
Stata can become confused .... You can overcome this by adding an
optional third argument which species the largest year in the data.
This way Stata will ensure that the last year is numbered correctly.
Hurn (NCER) Applied Financial Econometrics using Stata 21 / 43
Time Series Data
Seasonal Dummies
The process of creating a set of quarterly or monthly dummy variables is a
bit more elaborate in Stata than in a package such as EViews.
Assume there is a suitably dened and formatted date identier called
dateid.
quarterly dummy variables
for values i = 1/3 {
gen seasi = (quarter(dofq(dateid)) == i)
}
monthly dummy variables
for values i = 1/11 {
gen seasi = (month(dofm(dateid)) == i)
}
Hurn (NCER) Applied Financial Econometrics using Stata 22 / 43
Time Series Data
Business Days
A problem with Stata Version 12 and below was its inability to handle
business day calendars. This required a bit of trickery to overcome
running a calendar with both observations and date identiers
using say FRED to create the date identier
In Version 13 there is a new command bcal.
For example, bcal create creates a business calendar le from the
current dataset and describes the new calendar. If sp500.dta is a
dataset installed with Stata that has daily records on the S&P 500
stock market index in 2001 business calendar for stock trading in
2001 can be automatically created from this dataset as follows:
bcal create sp500, from(date) purpose(S&P 500 for 2001) generate(bizdate)
Hurn (NCER) Applied Financial Econometrics using Stata 23 / 43
Time Series Data
Time Series Operators
The D., L., and F. operators may be used under a time series
calendar (including panel data) to specify rst dierences, lags, and
leads, respectively.
These operators understand missing data, and number lists: e.g.
L(1/4).x is the rst through fourth lags of x.
The time series operators respect the time series calendar, and will
not mistakenly compute a lag or dierence from a prior period if it is
missing (this may be particularly important when working with panel
data).
You can refer to time-series operators on the y, for example
regress y L(1/4).y
regress y L(-4/4).x
regress D.y L.y
Hurn (NCER) Applied Financial Econometrics using Stata 24 / 43
Time Series Data
Basic Plotting
// load data
webuse set http://www.ncer.edu.au/stata-resources/data/Singapore
webuse usrealgdp.dta, clear
// declare data to be time series
tsset daten, quarterly
// plot real us gdp
twoway (line rgdp daten), name(realusgdp) ///
xtitle(" ") ytitle("$ Billions") title("Real U.S. GDP")
// graph export "realusgdp.pdf", as(pdf) replace
Could also use the
twoway tsline rgdp
Hurn (NCER) Applied Financial Econometrics using Stata 25 / 43
Time Series Data
A few simple manipulations
// compute growth rate in gdp
gen gr = 400*(log(rgdp)-log(L1.rgdp))
label var gr "Growth rate of real US GDP"
twoway (line gr daten), name(growth) ///
xtitle(" ") ytitle("%") title("Growth Rate of Real U.S. GDP")
// graph export "growth.pdf", as(pdf) replace
// smooth growth rate with 7 period moving average
tssmooth ma c = gr, window(3 1 3)
label var c "US Business Cycle"
twoway (line c daten, lwidth(medthick)) ///
|| (line gr daten, lwidth(vthin)lpattern(dash)), name(cycle) ///
xtitle(" ") ytitle("%") title("U.S. Business Cycle")legend(off)
Hurn (NCER) Applied Financial Econometrics using Stata 26 / 43
Time Series Data
Some annotation
// provide some annotation
gen max = 10
// major recessions
twoway (area max daten if tin(1973q3,1975q1), base(-5) color(gs12)) ///
(area max daten if tin(2007q3,2009q2), base(-5) color(gs12)) ///
(line c daten), xtitle(" ") ytitle("%") legend(off) ///
title("Major U.S. Contractions") name(g1,replace) nodraw
// major expansions
twoway (area max daten if tin(1991q2,2007q1), base(-5) color(eltgreen)) ///
(line c daten), xtitle(" ") ytitle("%") legend(off) ///
title("Major U.S. Expansion") name(g2,replace) nodraw
graph combine g1 g2, xcom ycom rows(2) cols(1)
graph export "realusgpdannotated.pdf", as(pdf) replace
Hurn (NCER) Applied Financial Econometrics using Stata 27 / 43
Time Series Data
Snapshot of the log le
Hurn (NCER) Applied Financial Econometrics using Stata 28 / 43
Time Series Data
Annotated gdp graph
-
5
0
5
1
0
%
1950q1 1960q1 1970q1 1980q1 1990q1 2000q1 2010q1

Major U.S. Contractions
-
5
0
5
1
0
%
1950q1 1960q1 1970q1 1980q1 1990q1 2000q1 2010q1

Major U.S. Expansion
Hurn (NCER) Applied Financial Econometrics using Stata 29 / 43
Time Series Data
Graph Schemes
Hurn (NCER) Applied Financial Econometrics using Stata 30 / 43
Reconguring Data
Manipulating Datasets
Stata only permits a single data set to be accessed at one time. How,
then, do you work with multiple data sets?
At least one of the datasets to be combined must already have been
saved in Stata format.
Several commands are available, including append, merge, and joinby.
Hurn (NCER) Applied Financial Econometrics using Stata 31 / 43
Reconguring Data
append
The append combines two Stata-format data sets that possess variables in
common, adding observations to the existing variables.
The datasets to be combined should share the same variable names
and datatypes (string vs. numeric). Appending these two datasets
with common variable names creates a single dataset containing all of
the observations. It is important to note that PRICE and price
are dierent variables, and one will not be appended to the other.
The same variables need not be present in both les, as long as a
subset of the variables are common to the master and using data
sets. Where required, the values of any variables which are not
common are set to missing.
Hurn (NCER) Applied Financial Econometrics using Stata 32 / 43
Reconguring Data
Two hypothetical datasets
dataset1

id var 1 var 2
112
.
.
.
.
.
.
216
.
.
.
.
.
.
449
.
.
.
.
.
.

dataset2

id var 1 var 2
126
.
.
.
.
.
.
309
.
.
.
.
.
.
421
.
.
.
.
.
.
604
.
.
.
.
.
.

Hurn (NCER) Applied Financial Econometrics using Stata 33 / 43


Reconguring Data
using append
We may append these datasets:
dataset

id var 1 var 2
112
.
.
.
.
.
.
216
.
.
.
.
.
.
449
.
.
.
.
.
.
126
.
.
.
.
.
.
309
.
.
.
.
.
.
421
.
.
.
.
.
.
604
.
.
.
.
.
.

The rule for append, then, is that if datasets are to be combined, they
should share the same variable names and datatypes (string vs. numeric).
Hurn (NCER) Applied Financial Econometrics using Stata 34 / 43
Reconguring Data
merge
merge is Statas basic tool for working with more than one dataset.
merge works on a master dataset (the current contents of
memory) and a single using dataset. This distinction is important.
Statas default behaviour is to hold the master data inviolate and
discard the using datasets copy of that variable. This may be
modied by the update option, which species that non-missing
values in the using dataset should replace missing values in the
master, and the even stronger update replace, which species that
non-missing values in the using dataset should take precedence.
the variable merge takes on integer values indicating whether an
observation appears in the master only, the using only, or appears in
both. This may be used to determine whether the merge has been
successful, or to remove those observations which remain unmatched.
The merge variable must be dropped before another merge is
performed on this data set.
Hurn (NCER) Applied Financial Econometrics using Stata 35 / 43
Reconguring Data
Two hypothetical datasets
dataset1

id var 1 var 2
112
.
.
.
.
.
.
216
.
.
.
.
.
.
449
.
.
.
.
.
.

dataset2

id var 3 var 4 var 5


112
.
.
.
.
.
.
.
.
.
216
.
.
.
.
.
.
.
.
.
449
.
.
.
.
.
.
.
.
.

Hurn (NCER) Applied Financial Econometrics using Stata 36 / 43


Reconguring Data
using merge
We may merge these datasets on the common merge key: in this case, the
id variable:
merged dataset

id var 1 var 2 var 3 var 4 var 5


112
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
216
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
449
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

The rule for merge, then, is that if datasets are to be combined on one or
more merge keys, they each must have one or more variables with a
common name and datatype (string vs. numeric).
Hurn (NCER) Applied Financial Econometrics using Stata 37 / 43
Reconguring Data
post le and post
A very useful capability is provided by the postle and post
commands, which permit a Stata data set to be created in the course
of a program.
If you are simulating the distribution of a statistic, tting a model
over separate samples, or bootstrapping standard errors, you may post
certain numeric values to a postle within the looping structure using
the command post.
This will create a separate Stata binary data set, which may then be
opened in a later Stata run and analysed. Note, however, that only
numeric expressions may be written to the postle.
Hurn (NCER) Applied Financial Econometrics using Stata 38 / 43
Reconguring Data
Panel Data
Data are often provided in a dierent orientation than that required for
statistical analysis. The most common example of this occurs with panel,
or longitudinal, data, in which each observation conceptually has both
cross-section (i) and time-series (t) subscripts.
Dierent models applied to longitudinal data require dierent
orientations of those data.
SUR needs wide data
Fixedeects or randomeects regression models prefer that the data
be stacked in thelong format.
The reshape command allows you to transfer the data from wide
to long format or vice versa.
Hurn (NCER) Applied Financial Econometrics using Stata 39 / 43
Reconguring Data
Wide format
Consider this data fragment of U.S. state, with the unit?s name for the
rows and the year for the columns.This data is in wide form with the same
measurement (population) for dierent years denoted as separate Stata
variables.
Hurn (NCER) Applied Financial Econometrics using Stata 40 / 43
Reconguring Data
Long format
The command
. reshape long pop, i(state) j(year)
will produce
Hurn (NCER) Applied Financial Econometrics using Stata 41 / 43
Reconguring Data
collapse
But what if you want to use average values for each time period,
averaged over states?
The resulting dataset of T observations can be easily created by the
collapse command, which permits you to generate a new data set
comprised of summary statistics of speci?ed variables.
More than one summary statistic can be generated per input variable
as collapse can produce counts means, medians, percentiles, extrema,
and standard deviations.
collapse kills your dataset so be sure to save it before you use the
command!
Hurn (NCER) Applied Financial Econometrics using Stata 42 / 43
Reconguring Data
A word or two of caution
With the proper use of reshape, writing data out and reading them
back in is not necessary in Stata. In situations beyond the simple
application of reshape, it may require some experimentation to
construct the appropriate command syntax. This is all the more
reason for enshrining that code in a do-le as some day you are likely
to come upon a similar application for reshape.
Stata provides two commands preserve and restore.
preserve preserves the data, guaranteeing that data will be restored
after program termination
restore forces a restore of the data when required.
These can be helpful in complex .do les.
Current advice from a contact in Stata suggests sparing use of these
commands and saving the data instead!
Hurn (NCER) Applied Financial Econometrics using Stata 43 / 43

You might also like