You are on page 1of 22

STATA LESSON

Applied Economics March 2012 Instructor: Ainhoa Aparicio-Fenoll Outline: 1. Introduction 2. Dataset management 3. Generating and recoding arrays 4. Control flows 5. Data overview 6. Stata for econometrics Materials: http://intranet.barcelonagse.eu http://www.stata.com/links/resources1.html

1. INTRODUCTION STATA is a general command-drive package for statistical analysis, data management and graphics. It can be considered a stat package, like SAS, SPSS, RATS, or eViews. STATA operates in a graphical (windowed) environment. Basic features When to use STATA? STATA has three major strengths, namely, data manipulation, statistics and graphics. Which STATA to use? There are different versions in terms of maximum number of variables and matrix size. In general, Small STATA<STATA/IC<STATA/SE/MP. The number of observations is determined only by the memory in your computer. The last version is STATA 11.

You may want to update regularly your STATA by typing update query in the command window. You may also want to incorporate specific user written commands using ssc install. You may abbreviate commands and variables as long as STATA does not get confused. You should type commands instead of using the menus. Common files extensions:

.dta: data files in STATA format .raw: data files in ASCII/Text format .log: STATA output .do: command file .gph: STATA graphic file .dct: instructions to read ASCII data file .ado: STATA macro files

Interface description

Variabels window

Result window

Review window

Command window

STATA, by default opens with four windows: Results - where all of your commands and their results are displayed (with the exception of graphs which are displayed in their own window). Anything displayed in blue can be clicked on to get help or other information. When results are too long it shows the word more. You should click on it to continue watching the results. If you want all the results to be displayed even if you miss the upper part, just type: set more off, permanently Review - just your commands are displayed here. You can click on any command in the window and it will be pasted to the command window. The Review window has one extra option in its windows icon menu: "Save Review Contents." This will allow you to save everything in the review window to a file for later use. Although this is not a substitute for using log and do files. Command - this is where you type your commands when working in interactive mode. Any command typed here would be executed just by pressing the enter key. Everything you type in here is echoed in the Results window as well as the Review window. The "Page Up" and "Page Down" keys can be used to scroll back and forth through

commands you have executed previously. You can also copy and paste between this window and your "do" file. Variables - a list of all your variables and their labels is displayed here. You can click on a variable here and it will be pasted to the command window. There are three other windows in STATA: Do-file editor - this is STATA's simple text editor for writing do-files, or programs. You should do all of your work in a do-file so you can reproduce what you did later on. This is accessible through Window>Do-file editor>New Do-file or by clicking on the envelope icon. You can execute all commands or just some of them and you may choose the results to be displayed on the results window or not. Viewer - the viewer is used for displaying help and log files. Like the Results window, anything displayed in blue can be clicked on for more information. This window is reached by clicking on Help>Content or by typing help in the command window. Graph - as you may have guessed, this is where all of your graphs will be displayed.

The tool bar:

(1)

(2)

(3)

(4)

(5)

(6)

(7)

(8) (9)

(10)

(1): Opens a dataset (2): Saves a dataset (3): Prints graphs or contents of viewers (4): Begins, closes or suspends a log-file (5): Brings viewer in front (6): Brings the result window in front (7): Opens a do-file (8): Opens the data editor (9): Opens the data browser (10): Breaks/stops the operation Menu bar: File: To open, save, view, launch do-files, save graphs, print graphs or results

Edit: Copy and paste options Prefs: Change color scheme, print options, window preferences Windows: Brings various windows in front such as the command, result, variables, etc. Help: Contents, search for term or command, updates from the STATA website Data/Graphics/Statistics/User: Instead of typing the commands for data manipulation, estimations, creating graphs, or statistics in the command window, you can click several buttons. Getting help There are several ways of getting help in STATA. The first, of course, is the help command. You can enter the help command from the Command window and the help text will be displayed in the Results window. Or, you can use the "STATA command" option in the Help menu and the help text will be displayed in the Viewer window. The help command is good only if you know the command for which you want help. If you do not know the command, then you should use the search command which actually does a keyword search of STATA documentation. The search command (or help menu option) has two versions: one which limits the search to your computer and one which searches STATA's website. Of course, you must be connected to the Internet to search the website ("Search net resources"), and if you are, this is the better choice.

The help site has always the same structure: First, you find the syntax of the command, second, the description of the command and all its parameters, third, examples and last, you can find related commands. Whenever an expression is written in blue you can click on it to find further information about it.

Data Editor and Data Browser Data Editor: It looks like an excell sheet. You can access them by typing edit in the command window or by clicking on the corresponding icon. You can input data by clicking on the corresponding cell, writing the data in the bar above and pressing the enter key. You should save the data after inputting it. Data Browser: It is equivalent to the previous one but it only allows you to browse the data. You can not modify it. Log-file The log-file saves STATA output in a txt file. It is recommendable to create always a log-file since this way you can have a look at all previous results after you close your session.

You open a log-file by log using /logfilename.log. You close a log-file by typing log close


Do-file

You can add more results to an already closed log by typing: log using /logfilename.log, append Or you can simply replace an existing log with the command: log using /logfilename.log, replace To take a look at a log-file go to File>Log>View and browse it.

The do-file is an ASCII file that collects a collection of commands to be executed sequentially. It is very useful because it makes it very easy for everybody to replicate your results. When starting a do-file, you should specify:

That the old data should be removed from the memory, otherwise you would not be able to upload your dataset. Use the command clear The directory where you will be working (that directory should contain your data, dictionary, etc.). The corresponding command is cd C:/EXAMPLE The amount of memory to be allocated to STATA. You must make sufficient memory available to STATA to load the entire le, since STATAs speed is largely derived from holding the entire data set in memory. Use the command set memory

Alternatively you may want to set the maximum number of variables (set maxvar), the maximum matrix size (set matsize) and set the more option off (set more off). Now you can upload your new dataset and work on it.

If you want to see the results of the operations in the do-file displayed in the results window, type or click the icon for run dofilename and if you do not want to see the results in the do-file displayed in the results window, type or click the icon for do dofilename. If you want to execute only some of the commands in the do-file, select them and type or click on the icon for run dofilename or do dofilename. (1)(2) (3)(4) (5)(6) (7)(8)

(1): New do-file (2): Open a folder where you can find an old do-file (3): Save the do-file (4): Print the do-file (5): Find the do-file (6): Un-do (7): Run the commands of the do-file (8): Do the commands of the do-file (9): Search 2. DATASET MANAGEMENT Getting data into STATA You can input data by hand:

set obs edit Type the data in the bar above and press enter. Or alternatively, input v1 v2.vn And typing the numbers on the command window end Note: For the missing values, type . There are several commands to input already created datasets. You will need to use one or another depending on the format in which the dataset is provided.

insheet: This is used to input ASCII files where the file is delimited by tabulations or commas. For this command to work the data must contain only one observation per line. For instance, you may type: insheet using mydata.txt. You may even want to specify which the delimiter is, then type: insheet using mydata.txt, delimiter(;)

infix: You may need this command if the variables in your data are not in a fixed format. You should declare their position by typing infix v1 1-4 v2 6-7 v3 9-11 infile: It is useful when inputting complex ASCII files where several observations appear in the same line or where the variables are separated by tabulations, commas or spaces. For instance, you may type infile var1 var2varn using filename.txt, automatic

use: This is the right command for inputting data in STATA format (extension .dta) as in use mydata.dta. You may want to convert data into this format. This is done automatically after typing save mydata, replace.

Finally, it is possible to create a dictionary (extension .dct) containing the instructions to read the data. The dictionary can be written in windows note pad and must follow the following structure:

dictionary using lfs.txt { byte time byte region byte npers byte age byte hh byte male byte ms byte stud byte ocu long wage byte satis %2f %4f %4f %4f %4f %4f %4f %4f %4f %8f %4f "period" "region of residence" "person identifier" "age in years" "household head" "male" "marital status" "level of studies" "occupation" "wage" "job satisfaction"}

where byte and long are types of variables. You should use the most appropriate type in each case depending on whether the variable is a string or a numeric variable and the length of the variable. The different existing variable types are summarized in the following table: Storage type byte int long Float double Minimum -127 -32,767 -2,147,483,647 -1.70141173319*10^38 -8.9884656743*10^307 Maximum 100 32,740 2,147,483,620 1.70141173319*10^38 8.9884656743*10^307 Closest to 0 +/-1 +/-1 +/-1 +/-10^-38 +/-10^-323 Bytes 1 2 4 4 8

Precision for float is 3.795x10^-8. Precision for double is 1.414x10^-16. Then, use the syntax infile lfs using dictionaryname.dct. Note: The data and the dictionary must be in the current working directory

Joining, expanding and collapsing datasets Joining datasets 1. Merge joins corresponding observations from the dataset currently in memory (called the master dataset) with those from the STATA-format dataset stored as filename (called the using dataset) into single observations. Sintax: merge [varlist] using filename [, keep(varlist) unique uniqmaster uniqusing nolabel update replace nokeep _merge(varname) ] After each merge a new variable is created (the default is _merge) whose values are: _merge==1 _merge==2 _merge==3 obs. from master data obs. from only one using dataset obs. from at least two datasets, master or using

2. Append appends a Stata-format dataset stored on disk to the end of the dataset in memory. Sintax: append using filename [, nolabel keep(varlist) ] Expanding and collapsing datasets 1. collapse [(stat)] varlist [ [(stat)] ... ] [if] [in] [weight] [, options] It transforms the database in memory into a database of sums, means, medians, etc. Example: "collapse (sum) wage, by(zone education age)" 2. expand [=]exp [if] [in] Example: "expand 2" / "expand pop. It multiplies the number of lines a certain number of times (2 in the first example) or according to the value of one variable in that line (pop in the second example). 3. GENERATING AND RECODING ARRAYS Variables generate varname=x replace varname=x label variable varname to create a variable to replace the value of a variable to give a variable a certain description

Note: these commands follow the common syntax [by varlist:] command [ varlist] [if exp] [in range ], [options] and thus can be applied to a restricted sample or as an extended version. Where: by varlist: It is used to group data (before you have to sort the data according to the varlist by the command sort varlist) command: You can find a list of commands below in: It is used to restrict the sample to a certain range (e.g. in 4, restricts to case 4, in 1/4, which shows cases 1 to 4) Options: Various options are possible depending on the command (see help) e.g. ,replace replaces a variable//dataset etc. in case it already exists if: You can restrict the sample by using the following expressions: Arithmetic addition subtraction multiplication division power Logical ~ not | or & and (numeric and string) > greater than < less than >= greater or equal <= less than or equal == equal != not equal

+ * / ^

For the construction of a variable you might want to use certain mathematical or statistical functions. In the list below you find a selection. Command ln (varx) sqrt (varx) sin (varx) Cos (varx) Exp (varx) Abs (varx) norm (varx) normden (varx) Uniform () Min (varx1 varx2 varxK) max (varx1 varx2 varxK) cond (exp, varx varz) sign (varx) Mathematical & Statistical Function Natural logarithm of variable x Square root of variable x Sinus of variable x Cosinus of variable x E x of variable x Absolute value of variable x cumulative standard normal of variable x standard normal density of variable x uniform(0,1) random number returns the minimum of the variables x1-xK Returns the maximum of the variables x1-xK When the expression is true it returns variable x, else variable z Returns 1 if x>0, 0 if x=0 and -1 if x<0

Furthermore you might want to use some system variables, here a selection: Command _n _N _all _b System variable Index of current observation Total number of observations All variables Vector of regression coefficients Vector of standard errors of regression coefficients Note: Depending on the mathematical/statistical function you want to use you might have to use an extension of generate called egen. egen varname=function(varx) another variable Below a selection of functions where you have to use egen: Command mean (varx) Median (varx) sum (varx) rank (varx) count(varx) Examples of discrete variables "generate byte rich=(wage>100)": It generates a dummy variable. "generate byte rich=(wage>100) if !missing wage": It generates missing values for rich if wage is missing. "generate young=inrange(age,18,25)". Young takes value one if age is between 18 and 25. "generate cs=2 if inlist(csp,1,2,3,4)". The variable cs takes value 2 if csp takes any of the listed values. "tabulate csp, gen(csp)". This generates as many dummy variables as values are taken by csp. "generate size=recode(nb,0,5,10,15)". This command generates a new variable size that takes the values 0, 5, 10 and 15 whenever nb is zero, between 1 and 5, between 6 and 10 and 11 or more, respectively. "separate treff, by(csp)". This generates as many variables as values takes csp. Each variable is equal to csp for a certain value and zero otherwise. "gen szone=group(size zone)" is used for generating identifiers. It gives a code to any combination of the values of size and zone. Mathematical & Statistical Function Mean of variable x Median of variable x Sum of variable x (Note: gen varz=sum(varx) varz= cumulative sum) Rank of variable x, highest value gets value1 The number of nonmissing observations of x is used to create a variable which is the mean, median,of

"egen sizegr=cut(size), group(5)" It divides the variable size in 5 different categories with homogenous frequencies. "egen sizegr=cut(size), at(5,10,15)". It divides size into two categories, one from 5 to 9, and another from 10 to 15. "egen delta=diff(v1 v2 v3)". Delta is equal to one only if the three variables take the same value. Examples of continuous variables "gen suma=sum(wage)". This gives the cumulative sum of the variable wage. This is different form "egen suma=sum(wage)" which generates the total sum. "generate id=_n". It gives the values 1, 2, 3,... to each observation following the order in which they appear in the data. "(bysort year:) egen avg=mean(wage)"/ "egen mini=min(wage)" / "egen maxi=max(wage)" / "egen sd=std(wage)" / "egen moda=mode(wage)". It generates a variable equal to this statistic for all observations (or observations in the same group in the case of using bysort). "egen avgr=rowmean(wage1 wage2 wage3)" calculates the mean of the specified variables for each observation. "egen sumr=rowtotal(wage1 wage2 wage3)" calculates the sum of the specified variables for each observation. Examples of string variables "generate str4 zone4=substr(zone,1,4)". It takes the 4 first digits of the variable zone. Note: zone must be a string. "egen abc=concat(a b c)". It generates one variable that results from concatenating the three variables a, b and c.

Variable elimination Once we have created the variables we might want to get rid of some of them. These we can either do by eliminating the variables that we are not interested anymore (var1 varn) drop var2 var1 varn or by keeping the ones that we still want to use (var1 var2 varn) keep var1 var2 varn Furthermore we can change the name of a variable (from oldname to newname) rename oldname newname

Examples of data elimination "rename sex gender": It renames the variable sex, gender. "replace zone=4 if zone==2 | zone==3". It will write 4 whenever the variable of zone was 2 or 3. "recode zone (2 3=4) (6 7=5), gen(zone2)". This way you generate a new variable with values 4 and 5 according to whether the values of zone were 2 / 3 or 6 / 7. Macros Macros store information as strings. You can collect several variables under a common macroname. global macroglobalname var1 var2 varn local macrolocalname var1 var2 varn Both macros are working exactly the same way (you assign strings to specified macro names and can call them later by calling it `macroname). The main difference is that a global macro may be used by any program whereas a local macro is only for private use of the program in which you define it. When you want to use the variables contained in a macro, you should use the following syntax: sum $macroglobalname sum `macrolocalname (be careful with the commas)

Scalars The scalars are variables that contain one single element. To handle them, you must use an specific syntax: To define the content of the scalar:

scalar a=normal(0.7) scalar b=199876 scalar c=V[1,3] To list the contents of scalars:

scalar list scalar list a b

- To eliminate scalars from memory: scalar drop c scalar drop _all Matrices There is an specific package called MATA which deals with matrices. This package is similar to MATLAB. However, there are many user written packages that are available in MATLAB but not in MATA (yet). In the context of STATA, you may be using the matrices that result as output of the estimation. You may want to input some matrices by hand. For instance, matrix A=(1,2\3,4\5,6) and matrix list A give as output: A[3,2] c1 c2 r1 1 2 r2 3 4 r3 5 6 or, you may want to make small operations with matrices: matrix B=inv(C) matrix D=A+E-F

4. CONTROL FLOWS If, foreach, forvalue, while if The if command allows you to set several conditions according to which one is true a certain command will be executed. if expression1 { command } else if expression2{ command

} else { command } The if command evaluates expression1 (e.g. var1>0). If the result is true, then it executes the command inside the first set of brackets. If not, it evaluates expression2 (e.g. vr1<0) and executes the second block of commands if true. If it is also not true, it will execute the last set of commands. foreach This is a fast way to execute repeatedly a certain command (enclosed in brackets) for a list of variables that before have been set in a local macro. local varlist var1 var2 tsset timevar foreach v of local varlist { gen lagged`v=l.`v } forvalues This is the fastest way to execute a block of codes for different numeric values given in the local macro. forvalues i = 1(1)100 { generate x`i' = uniform() } while It evaluates an expression/condition and, if it is true, executes the command enclosed in the braces. Further whiles may be nested within a while. local i = 1 while `i' < 100 { gen u`i' = uniform() local i = `i' + 1 } /*l. is a lag operator*/

5. DATA OVERVIEW Descriptive statistics describe It provides basic information about the file and the variables (number of observations, number and names of the variables, format of the variables, labels of variables if attached) browse It opens the data-browser where you can see the dataset list varlist It diplays all the data about the listed variables on the screen Summarize varlist It provides a summary statistic about the variables listed (frequencies, mean, min, max). Note: a possible option is , detail which provides more details such as percentiles, kurtosis etc. tabulate varlist It shows all different values of the variables with their frequencies table varlist It displays higher dimensional tables (see help for options) correlate varlist It gives the correlation coefficient between the variables named Graphs An easy way to create graphs is to use the graph option in the menu. There you can find the whole variety of graphs offered by STATA.

There are two different graph commands in STATA: graph plottype [var1] [var2] graph twoway plottype [var1] [var2] The first command draws (in the announced plot-type) a one dimensional graph of the means of the variable listed. The second graph command shows the two-dimensional relationship between the variables listed. [var1] appears on the y-axis, and [var2] on the x-axis of the graph. The default extension for Stata graph files is .gph which Stata will automatically add to the file name. You can select the font, colors, line thickness and other options by selecting Graph Preferences in the Prefs menu. The following table gives an overview of the type of graphs you can draw in STATA Graphics* twoway[plottype] vary varx plottype : scatter, line, bar etc. graph bar (mean)vary, over(varx) graph dot (mean)vary, over(varx) Description 2-dimensional family of plots, all of which fit on numeric y and x scales vertical bar charts (y axis is numerical, and the x axis is categorical horizontal dot charts (categorical axis is presented vertically, the numerical axis is graph pie vary, over(varx) graph matrix graph save graphname graph use graphname.gph graph combine graphname1 graphname2 Example of graphs graph hbar (asis) pop, over(voter) title("Population by city") ytitle("Number of inhabitants") note("The source of the data is unknown") presented horizontally) Pie charts matrix plot of two-way scattergrams saves a graph under graphname with the default extension .gph redisplays the graph graphname combines several graphs into one

Population by city
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 50,000 100000 Number of inhabitants 150000 200000

The source of the data is unknown

twoway (scatter price mpg, mlabel(mpg) xline(20) xscale(range(10(10)45))) (lfit price mpg), legend(off) title("Price over mileage") ytitle("Mileage") xtitle("Price") graphregion(fcolor(green) lcolor(green))

Price over mileage


15,000
21 14 12 12 14 14 14 16 17

Mileage 10,000

17 21 21 17 23 23

25

15 14 15 14 16 16

25 25

18 18 19 20 22 17 18 19 20 18 19 21 22 19 22 18 19 22 20

26 41 28 28 29 30 31 34 35 35

5,000

24 25 24 25 26 26 24

0 10

20

Price

30

40

6. STATA FOR ECONOMETRICS Estimation commands The command for a simple OLS regression where y is regressed on several control variables x1xK is the following: regress y x1 x2 xK

The output on the result window looks the following:

Source |

SS

df

MS

Number of obs =

240

-------------+-----------------------------Model | .728611679 3 .24287056 Residual | 52.9338883 236 .224296137 -------------+-----------------------------Total | 53.6625 239 .224529289

F( 3, 236) = Prob > F R-squared Root MSE

1.08 = 0.3571 = 0.0136 = .4736

Adj R-squared = 0.0010

-----------------------------------------------------------------------------male | Coef. Std. Err. t P>|t| [95% Conf. Interval] -.0473615 -.0703469 -.0030364 .267731 .0166404 .0089989 .0022934 2.478065 -------------+---------------------------------------------------------------age | -.0153606 .0162436 grade | -.030674 .0201378 toefl | -.0003715 .0013527 _cons | 1.372898 .5609794 -0.95 0.345 -1.52 0.129 -0.27 0.784 2.45 0.015

Anova block: This block shows the sum of squares for both the explained part of the model and the residuals. df shows the degree of freedom and MS the mean squared error with respect to the degrees of freedom Modelfit block: This block shows the number of investigated units, the F-Statistic and the adjusted R squared value. Coefficient block: This block provides the estimation results for all control variables, incl. the coefficients, the standard errors, the value of the t-statistic and the pvalue as well as the confidence interval. For further advanced estimation procedures the following commands might be helpful: Regression procedures Description Rreg Robust regression Ivreg Instrumental variables regression Heckman Heckmans selection model probit/logit Probit/Logit analysis mlogit Multinomial logit Tobit Censored-normal and Tobit regression Arima* Autoregressive integrated MA models Arch* AR conditional heteroscedasiticity estimators xt*... Panel analysis st* Survival time data * data must be time-series/panel data/ survival data; use the command tsset panelvar timevar or stset. You may want to type quietly in front of the regression command to avoid all the table being displayed. Specially, if you just want to make predictions or you are just interested in the coefficients. Post-estimation commands

After having done a regression you might want to use the results for further analysis, e.g. some significance test or graphical analysis. In the following table you can find some useful commands to save regression results, to calculate marginal effects or to do certain types of tests. For more detailed information see the HELP function. Post estimation command predict name, options predictnl name, options mfx estimate store Estat mat name= e() Test testnl lrtest hausman Description saves predictions (xb), residuals (res), influence statistics, etc. point estimates, standard errors, testing, and inference for generalized predictions marginal effects or elasticities in a nonlinear equation (e.g. probit, logit etc.) saves estimation results AIC, BIC, VCE, and estimation sample summary saves the estimation results such as the coefficients (e(b)), variance covariance matrix (e(V) Wald tests for simple and composite linear hypotheses Wald tests of nonlinear hypotheses Likelihood-ratio test (to conduct the test, both the unrestricted and the restricted models must be fitted using the ML method) Hausman's specification test: tests if an efficient and a consistent estimator are significantly different.

You might also like