You are on page 1of 51

Principles of Biostatistics and Data Analysis PHP2510 Lab1 September 14th 2007

Goals for Lab 1:


Familiarization with the Stata Gui Importing data into Stata and the commands associated with different file types Data Management Becoming familiar with your data Learning and utilizing commands for summary and descriptive statistics Visual depiction of data stem and leaf plots, histograms, boxplots, ect. Saving your results

Useful Resources: http://www.stata.com/links/resources1.html http://www.ats.ucla.edu/stat/stata/notes3/default.htm Stata Commands used in this unit:


cd dir or ls insheet infile infix input describe compress save use count list clear codebook log summarize tabstat table stem graph sort histogram Change directory Show files in current directory Read ASCII (text) data created by a spreadsheet Read unformatted ASCII (text) data Read ASCII (text) data in fixed format Enter data from keyboard Describe contents of data in memory or on disk Compress data in memory Store the dataset currently in memory on disk in Stata data format Load a Stata-format dataset Show the number of observations List values of variables Clear the entire dataset and everything else Detailed contents of a dataset Create a log file Descriptive statistics Table of descriptive statistics Create a table of statistics Stem-and-leaf plot High resolution graphs Sort observations in a dataset Histogram for continuous and categorical variables

Brief Introduction to Stata:


Upon opening Stata, youll notice that four windows appear: Command line, Results, Review, and Variables. Command line window: where you issue and execute commands in Stata. Note, Enter is the execute key. Results window: where the results of your commands will be displayed. In addition to that, this window also displays the command issued right before the output.
1

Principles of Biostatistics and Data Analysis PHP2510 Lab1 September 14th 2007

Review window: lists all the commands youve issued during your Stata session. Variables window: displays the names of the variables in your current data set. Below the menu bar (where the names of the menus appear) is the toolbar, which contains icons that provide alternate ways to do some of the things you would normally do with the menus. From left to right, these icons are used to open a dataset, to save a dataset, to print the contents of the Results window, to start or stop a log of your session, start viewer, bring results to font, bring graph to font, do-file editor, data editor, data browser, clear, and break icon

Importing data and creating log files:


We will begin by inputting a spreadsheet type of data file into Stata. A spreadsheet type of file is created by programs such as Excel or JMP. For example, in Excel, we can save a file into a comma-separated-values format (.csv) file. Stata reads in this type of data using the insheet command. Before we can use this command, there are a few things we need to set up. Make a temp folder on your C drive. Go to the link http://www.ats.ucla.edu/stat/stata/notes3/default.htm and click on statadata.zip. Save this set of data to your temp directory. Open up Stata if its not already open and use the following command to switch to your temp directory cd c:/temp ls insheet using hs0.csv Using the insheet command, were able to import this data into Stata. Now that we have our weve imported our data into we can begin using some Stata commands to get a better feel for the type and nature of the data were dealing with, but before we proceed we are going to want to create a log file. A log file will save all of the commands and the output in a text file. To start a log file, go to the menu bar File and scroll down to log and click Open. Choose the location where you want to put the results.

Data checking and summarizing:

The first command were going to use is the list command. This command displays the values of each of the variables in the dataset youre working with. If we need to make a change in the data or simply want to view the data in a spreadsheet format we can use the Data Editor option. To do so, navigate to the Stata menu bar, click on Data, and then scroll down to the Data Editor.

Using the describe command in Stata gives us information on the contents of the data as stored in the memory.

describe obs: 200 vars: 11 20 Jun 2002 12:42 size: 10,400 (99.9% of memory free) -----------------------------------------------------------------------------2

Principles of Biostatistics and Data Analysis PHP2510 Lab1 September 14th 2007 storage display value variable name type format label variable label -----------------------------------------------------------------------------gender float %9.0g fl id float %9.0g race float %12.0g rl ses float %9.0g sl schtyp float %9.0g scl prgtype str8 %9s read float %9.0g reading score write float %9.0g writing score math float %9.0g math score science float %9.0g science score socst float %9.0g social studies score

Using the summarize command allows the user to view the descriptive statistics for each of the variables in the dataset.

summarize Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------gender | 200 .545 .4992205 0 1 id | 200 100.5 57.87918 1 200 race | 200 3.44 1.049719 1 5 ses | 200 2.055 .7242914 1 3 schtyp | 200 1.16 .367526 1 2 -------------+-------------------------------------------------------prgtype | 0 read | 200 52.23 10.25294 28 76 write | 200 52.775 9.478586 31 67 math | 200 52.645 9.368448 33 75 science | 195 51.66154 9.866026 26 74 -------------+-------------------------------------------------------socst | 200 52.405 10.73579 26 71

Say for instance we were only concerned with viewing the summary statistics for the variable read. The command we would use to do that is pasted below:

summarize read Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------read | 200 52.23 10.25294 28 76
3

Principles of Biostatistics and Data Analysis PHP2510 Lab1 September 14th 2007

Along the same lines, say we are only interested in viewing the summary statistics for variables: math, science, read, and write.

summarize math science read write Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------math | 200 52.645 9.368448 33 75 science | 195 51.66154 9.866026 26 74 read | 200 52.23 10.25294 28 76 write | 200 52.775 9.478586 31 67

Often times, in order to effectively get a feel for the data, we need more summary statistics than what the summarize command gives us. In these cases we can use the detail option. For instance:

summarize read, detail reading score ------------------------------------------------------------Percentiles Smallest 1% 32.5 28 5% 36 31 10% 39 34 Obs 200 25% 44 34 Sum of Wgt. 200 50% 75% 90% 95% 99% 50 60 67 68 74.5 Largest 73 73 76 76 Mean Std. Dev. Variance Skewness Kurtosis 52.23 10.25294 105.1227 .1948373 2.363052

Using the detail option gives us a better feel for the distribution, spread, and concentration of the data. One other very useful command is the sort command. sort arranges the observations of the current data into ascending order based on the values of the variables in varlist. For instance, if we were interested in obtaining summary statistics for variable math based on prgtyp, we could use the following code.

bysort prgtype: summarize math ----------------------------------------------------------------------------------------------------------------------> prgtype = academic


4

Principles of Biostatistics and Data Analysis PHP2510 Lab1 September 14th 2007

Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------math | 105 56.73333 8.730216 38 75 ----------------------------------------------------------------------------------------------------------------------> prgtype = general Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------math | 45 50.02222 7.442168 35 63 ----------------------------------------------------------------------------------------------------------------------> prgtype = vocati Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------math | 50 46.42 7.95418 33 75

What if we wanted to view the summary statistics for the variable math only when variable prgtype (Program type) is equal to academic. We can do so in the following way:
(note: quotation marks must go around academic because variable prgtype

summarize math if prgtyp == academic


is a string)

Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------math | 105 56.73333 8.730216 38 75

Considering the same example, if we wish to have more summary statistics for math we would use the following code:

summarize math if prgtyp == academic, detail


math score ------------------------------------------------------------Percentiles Smallest 1% 41 38 5% 43 41 10% 45 41 Obs 105 25% 50 42 Sum of Wgt. 105 50% 57 Mean 56.73333
5

Principles of Biostatistics and Data Analysis PHP2510 Lab1 September 14th 2007 Largest 72 72 73 75 Std. Dev. Variance Skewness Kurtosis 8.730216 76.21667 .0610312 2.171239

75% 90% 95% 99%

63 69 71 73

Consider the following; If we wanted to view summary statistics for the variable science based on program type, we can use the following code:

tabulate prgtype, summarize(science)

| Summary of science score prgtype | Mean Std. Dev. Freq. ------------+-----------------------------------academic | 53.617647 9.0126909 102 general | 52.186047 9.8301185 43 vocati | 47.22 10.333796 50 ------------+-----------------------------------Total | 51.661538 9.8660256 195 Another very useful command that allows the user to become familiar with data youre working with is the inspect command. The inspect command provides a quick summary of a numeric variable that differs from that provided by summarize or tabulate. It reports the number of negative, zero, and positive values; the number of integers and non-integers; the number of unique values; and the number of missing; and it produces a small histogram. Its purpose is not analytical but is to allow you to quickly gain familiarity with unknown data.

Graphical Representation of Data:


When analyzing a set of data, it is often helpful to look at a graphical representation of that data. Stata is equipped with several commands that allow the user to carry out that task. One of the most basic graphical representations of data is the stem and leaf plot. Say we are interested in looking at the stem and leaf plot for the variable read.

stem read

Stem-and-leaf plot for read (reading score)


6

Principles of Biostatistics and Data Analysis PHP2510 Lab1 September 14th 2007

2. 3* 3t 3f 3s 3. 4* 4t 4f 4s 4. 5* 5t 5f 5s 5. 6* 6t 6f 6s 6. 7* 7t 7f 7s

| | | | | | | | | | | | | | | | | | | | | | | | |

8 1 4444445 66677 99999999 11 222222222222233 444444444444455 6777777777777777777777777777 8 000000000000000000 222222222222223 45555555555555 77777777777777 0000000001 3333333333333333 555555555 6 88888888888 11 33333 66

Another graphical display of data we might be interested in using is the box-plot. The command below demonstrates the code we would use to produce a box-plot for the variable math.

graph box math

Consider the following example; What if we want to view the box-plots for the variable write, based on the program type. Executing the command below will allow us to do just that.
7

30

40

50

math score

60

70

80

Principles of Biostatistics and Data Analysis PHP2510 Lab1 September 14th 2007

graph box write, over(prgtype)

30

40

50
in writ ore g sc

60

70

academic

general

vocati

What if we are interested in viewing the box-plots for both read and write variables, based on the program type.

graph box write read, over(prgtype)

80 70 60 50 40 30

academic writing score

general reading score

vocati

Another extremely useful depiction of data is achieved through the use of a histogram. The code
8

Principles of Biostatistics and Data Analysis PHP2510 Lab1 September 14th 2007

for creating a density histogram for variable science is as follows:


histogram science

Perhaps we want to adjust the length of the intervals and/or axis of a frequency histogram Using the command below will allow us to do that.

histogram science, discrete width(10) start(25) frequency

60

40

ncy que Fre 20

Saving your work:


In the command windown type log close. Now the log file is finished in the Stata output format. Type translate hs0.smcl hs0.log. Once you have a file ends as (.log), you can open it in a text editor like Notepad or TextEdit. Of course if you didnt open a log file, you can still save the results via copy & paste. Open a new word file (or Notepad as you like), highlight the results you want in the Results Window, go to the menu
9

5 .0 4 .0 3 .0 2 .0 1 .0 0 it ns De y

30

40

50 science score

60

70

20

40 science score

60

80

Principles of Biostatistics and Data Analysis PHP2510 Lab1 September 14th 2007

bar Edit, selcct Copy then Paste the results into a new word file. Saving the data If you didnt change the data set during this past session, you dont need to save the data. If you did make changes to the data, and want to save it, type save hso, replace or save hs02. The former gives you an updated hs0.dta data set. The latter creates a new augmented data set called hs02. Exit STATA Now, you are ready to exit, you have two choices. You can type exit in the Stata Command line or simply click on File (on the menu bar at the top of the screen) and then scroll down to Exit.

An example to consider:
Type the following command in the Stata command window. sysuse cancer, clear The sysuse command above will find the sample datasets on your computer by name alone. In this case, the name of the dataset is cancer. The cancer dataset was installed with Stata. This dataset contains 48 observations and 4 variables related to cancer treatment. After loading the cancer dataset, generate the detailed summary statistics for each of the variables. Generate summary statistics for variable studytime when age is greater than or equal to 57 Create summary statistics for variable studytime when age is greater than or equal to 60 and drug equal to 2. Produce a box-plot for variable age. What does this box-plot tell you about the distribution of data? Take a look at the histogram for variable studytime. Describe the distribution.

References: Accok, Alan. A Gentle Introduction to Stata. Texas: Stata Press, 2006. Resources to Help you Learn Stata. UCLA Academic Technology Service. 8 September 2007 http://www.ats.ucla.edu/stat/stata/

10

Principles of Biostatistics and Data Analysis PHP2510 Lab2 September 21st 2007

Goals for Lab 2:


Creating and importing datasets of various file types into Stata Data Management and Modifying techniques Creating subsets of the dataset Replacing and generating variables Calculation of expressions using Stata

Useful Resources: http://www.stata.com/links/resources1.html http://www.ats.ucla.edu/stat/stata/notes3/default.htm Stata Commands used in this unit:


cd dir or ls insheet infile infix input browse edit save use count list clear codebook generate replace stem graph sort histogram Change directory Show files in current directory Read ASCII (text) data created by a spreadsheet Read unformatted ASCII (text) data Read ASCII (text) data in fixed format Enter data from keyboard Opens the spreadsheet-like Data Browser for viewing the data Opens the spreadsheet-like Data Editor where data can be entered or edited Store the dataset currently in memory on disk in Stata data format Load a Stata-format dataset Show the number of observations List values of variables Clear the entire dataset and everything else Detailed contents of a dataset Creates a new variable using any mixture of old variables, constants, random values or expressions Can produce new values, using any mixture of old variables, constants, random values or expressions Stem-and-leaf plot High resolution graphs Sort observations in a dataset Histogram for continuous and categorical variables

Here are the commands for operators, which will be useful in this unit:
Logical Operators & | ! and or not
1

Principles of Biostatistics and Data Analysis PHP2510 Lab2 September 21st 2007

Relational Operators == != > < >= <= is equal to is not equal to is greater than is less than is greater than or equal to is less than or equal to

Creating and importing datasets of various file types into Stata:


Before we can perform an analysis on a particular set of data it is necessary that we organize the raw data into a format that is compatible with Stata. As you recall in the lab for last week, the sample dataset we used had already been created for you. What if we werent provided a dataset, but were expected to create a dataset based on information given to us by our colleagues, research team, or superior. The ability to create and manage a dataset is a significant skill. Those who are familiar with creating datasets are often a valuable part of a research team. Datasets of various forms can be imported into Stata. Some of these forms include: (comma-delimited files, space delimited files and data which are saved in a fixed file format). As you might remember the command for importing a dataset of comma-delimited format is: insheet using name_of_file.csv If the dataset we wished to import was saved as a space-delimited file, we would use the following command: infile using name_of_file.prn The same follows for files saved in a fixed file format. Now that were familiar with the commands used to import files of various formats, were going to change direction and focus our attention on how to create datasets in Stata given a description of the data. Before we can delve into creating datasets using Stata though, start a log file for this session by navigating to File Log Begin and save it in your temp directory. The simplest way to create a dataset is through Statas spreadsheet-like Data Editor, which is invoked by either using the executing the edit command, selecting Data-Data Editor from the upper menu bar, or clicking Data-Editor button. Lets consider the following: we would like to create a dataset in Stata corresponding to the following table so we can perform some basic statistics and analysis methods to learn more about our data.
Race-specific infant mortality in the entire United States population in 1987

Race Black White Other

Live Births 641,576.00 2,992,488.00 175,339.00

Infant Deaths 11.46 25,810.00 1,137.00


2

Principles of Biostatistics and Data Analysis PHP2510 Lab2 September 21st 2007

Now that youre familiar with how to create a dataset in Stata, use this knowledge and create the following dataset: the total numbers of deaths in the United States in various years are presented ( 1990 = 2148463, 1980 = 1989841, 1970 = 1921031, 1960 = 1711982, 1950 = 1452454, 1940 = 1417269 1930 = 1239453). Note: before you create this new dataset, save your current dataset to your C:/temp directory ( well be coming back to this set later). After creating a dataset for the data above, compute the following: Mean and Standard Deviation for the number of deaths from 1940-1990. Mean and Standard Deviation for the number of deaths from 1970-1990 Mean and Standard Deviation for the number of deaths from 1960-1980 Create a bar chart for the number of deaths from 1940-1990. What relationship, if any, do you notice? Create a box-plot of the number of deaths from 1940-1990. Are there any outliers or interesting features of this data?

Creating subsets of the data:


We will begin this segment of the lab by introducing a new dataset to work with. First, use the clear command to clear the dataset thats currently loaded in Stata. Following that, use the command below to introduce our new dataset: infile str30 place pop unemp mlife flife using http://www.ats.ucla.edu/stat/stata/examples/sws5/canada.raw The dataset youve just imported is representative of the table below:
Table 1: From the Federal, Provincial and Territorial Advisory Committee on Population Health, 1996

Place Canada Newfoundland Prince Edward Island Nova Scotia New Brunswick Quebec Ontario Manitoba Saskatchewan Alberta British Columbia Yukon Northwest Territories

1995 Unemployment Male life Female life population Rate (%) expectancy expectancy (1000's) 29606.1 10.6 75.1 81.1 575.4 19.6 73.9 79.8 136.1 937.8 760.1 7334.2 11100.3 1137.5 1015.6 2747 3766 30.1 65.8 19.1 13.9 13.8 13.2 9.3 8.5 7 8.4 9.8 . . 74.8 74.2 74.8 74.5 75.5 75 75.2 75.5 75.8 71.3 70.2 81.3 80.4 80.6 81.2 81.1 80.8 81.8 81.4 81.4 80.4 78

Take note that variable place corresponds to provinces in Canada, including Canada itself, pop corresponds to the 1995 population (1000s), unemp corresponds to the unemployment rate (%), mlife corresponds to male life expectancy, and flife corresponds to the female life expectancy. On that note, when working with any set of data, it is critical that you
3

Principles of Biostatistics and Data Analysis PHP2510 Lab2 September 21st 2007

spend some time familiarizing yourself with variable names and their corresponding labels. Now that weve learned some of Statas basic commands and operators, lets utilize that knowledge to create subsets of our dataset using qualifiers ( in or if ). Many Stata commands can be restricted to a subset of the data by adding an in or if qualifier. Recall that the list command displays the values for all of the variables for the dataset you apply it to. What if we were only concerned with the values of the variables for Nova Scotia, New Brunswick, and Quebec. list in 4/6 +------------------------------------------------+ | place pop unemp mlife flife | |------------------------------------------------| 4. | Nova Scotia 937.8 13.9 74.2 80.4 | 5. | New Brunswick 760.1 13.8 74.8 80.6 | 6. | Quebec 7334.2 13.2 74.5 81.2 | +------------------------------------------------+ The 4/6 tells Stata to list only the 4th through 6th observation. So if we wanted the list for Newfoundland to Yukon we would use the following command: list in 2/12 because Newfoundland is the 2nd observation and Yukon is the 12th observation. The in qualifier is useful, however its capabilities are limited compared to the if qualifier. The if qualifier selects observations based on specific variable values. For many purposes, we might want to exclude Canada from the analyses involving the 12 provinces and territories. One way to do this is to restrict the analysis to only those places whose population is less than 15 million people. So, if we were interested in summary statistics by place, excluding Canada, we could use the following code: summarize if pop < 15000

Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------place | 0 pop | 12 2467.158 3435.521 30.1 11100.3 unemp | 10 12.26 4.44877 7 19.6 mlife | 12 74.225 1.728965 70.2 75.8 flife | 12 80.68333 1.0116 78 81.8 An alternate way of doing this is to use the command: summarize if pop != Canada Note the difference in unemp between Canada and the mean unemp of all the provinces/territories (10.6 compared with 12.26). Intuitively, we would think that they should be the same number; however the mean unemp for provinces/territories is not a weighted mean and therefore does not take into consideration the population of the province/territory from which unemp was used for the calculation. Consider the following: Say we wanted summary statistics for the Unemployment rate of provinces/territories whose population is between 500,000 and 10 million; we can use the command below:
4

Principles of Biostatistics and Data Analysis PHP2510 Lab2 September 21st 2007

Summarize unemp if pop > 500 & pop < 10000 Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------unemp | 8 11.775 4.152366 7 19.6 What command would we use if we wanted a listing of observations where population is between 100,000 and 2 million or unemployment rate is less than or equal to 13.8%? list if unemp <= 13.8 | ( pop > 100 & pop < 2000 ) +--------------------------------------------------------+ | place pop unemp mlife flife | |--------------------------------------------------------| 1. | Canada 29606.1 10.6 75.1 81.1 | 2. | Newfoundland 575.4 19.6 73.9 79.8 | 3. | Prince Edward Island 136.1 19.1 74.8 81.3 | 4. | Nova Scotia 937.8 13.9 74.2 80.4 | 5. | New Brunswick 760.1 13.8 74.8 80.6 | |--------------------------------------------------------| 6. | Quebec 7334.2 13.2 74.5 81.2 | 7. | Ontario 11100.3 9.3 75.5 81.1 | 8. | Manitoba 1137.5 8.5 75 80.8 | 9. | Saskatchewan 1015.6 7 75.2 81.8 | 10. | Alberta 2747 8.4 75.5 81.4 | |--------------------------------------------------------| 11. | British Columbia 3766 9.8 75.8 81.4 | +--------------------------------------------------------+ The example above demonstrates how parentheses allow us to specify the precedence among multiple operators. Try the following examples: Summary statistics for population, when unemployment rate is between 8% and 14% inclusive and when the male life expectancy is greater than 73 years. Listing of observations excluding Canada and when the male life expectancy is less than 75 years and female life expectancy is greater than 75 years.

Generating and Replacing Variables:


The replace and generate commands allow users to change the values of existing variables or create new variables. The replace is useful if we want to modify a certain variable without having to create a whole new variable. For example, if you refer back to the table of this dataset youll notice that unemployment rates are displayed as percentages. If we wanted to convert those percentages into proportions without having to create a new variable we can use the command below: replace unemp = unemp/100
5

Principles of Biostatistics and Data Analysis PHP2510 Lab2 September 21st 2007

In the main part of a generate or replace statement ( unlike if qualifiers ) we use a single equals sign. Try the following example: In our dataset, population is given in (1000s). Use the replace statement to change the population back to an actual population.

If we wish to add new variables to our dataset we can do so using the generate statement. Unlike the replace statement, which simply modifies variables, the generate statement can add entirely new variables to our current dataset. Say were interested in adding a variable to our dataset, which represents the difference between female life expectancy and male life expectancy. generate difflife = flife - mlife Check to make sure this command was executed properly by using the list command. Try the following example: Create a new variable (num_unemp), which is total number of unemployed persons for each of the provinces/territories including Canada. replace unemp = unemp/100 replace pop = pop * 1000 generate num_unemp = pop * unemp Check to make sure this command was executed properly by using the list command.

Calculation of Expressions using Stata:


Arithmetic Operators + add - subtract * multiply / divide ^ raise to power

The display command in Stata works as an on-screen calculator. To calculate below: display sqrt(527) 22.956481 display exp(5)*exp(7) 162754.79 display 145^(2.4) 153915.8

527 we can use the command

Principles of Biostatistics and Data Analysis PHP2510 Lab2 September 21st 2007

display (exp(_pi *sqrt(-1)))+1 ???? Does anyone know what answer this should produce?

Saving your work:


In the command window type log close. Now the log file is finished in the Stata output format. Type translate file_name.smcl file_name.log. Once you have a file ends as (.log), you can open it in a text editor like Notepad or TextEdit. Of course if you didnt open a log file, you can still save the results via copy & paste. Open a new word file (or Notepad as you like), highlight the results you want in the Results Window, go to the menu bar Edit, select Copy then Paste the results into a new word file. Saving the data If you didnt change the data set during this past session, you dont need to save the data. If you did make changes to the data, and want to save it, type save file_name, replace or save file_name 2. The former gives you an updated hs0.dta data set. The latter creates a new augmented data set called file_name 2. Exit STATA Now, you are ready to exit, you have two choices. You can type exit in the Stata Command line or simply click on File (on the menu bar at the top of the screen) and then scroll down to Exit.

References: Accok, Alan. A Gentle Introduction to Stata. Texas: Stata Press, 2006. Resources to Help you Learn Stata. UCLA Academic Technology Service. 8 September 2007 http://www.ats.ucla.edu/stat/stata/ Hamilton, Lawrence. Statistics with Stata. Canada: Duxbury, 2006.

Principles of Biostatistics and Data Analysis PHP2510 Lab3 September 28th 2007

Goals for Lab 3:


Calculation of Probability expressions using Stata Calculating conditional probabilities with row and column options Calculating conditional probabilities from raw unorganized data

Useful Resources: http://www.stata.com/links/resources1.html http://www.ats.ucla.edu/stat/stata/notes3/default.htm Stata Commands used in this unit:


cd dir or ls insheet infile infix input browse edit save display list Binomial(n,k,p) codebook generate replace stem graph sort histogram Change directory Show files in current directory Read ASCII (text) data created by a spreadsheet Read unformatted ASCII (text) data Read ASCII (text) data in fixed format Enter data from keyboard Opens the spreadsheet-like Data Browser for viewing the data Opens the spreadsheet-like Data Editor where data can be entered or edited Store the dataset currently in memory on disk in Stata data format Performs a single calculation and displays the results on screen List values of variables Probability of k or more successes in n trials when the probability of a success on a single trial is p. Detailed contents of a dataset Creates a new variable using any mixture of old variables, constants, random values or expressions Can produce new values, using any mixture of old variables, constants, random values or expressions Stem-and-leaf plot High resolution graphs Sort observations in a dataset Histogram for continuous and categorical variables

Summary of Todays Lab:


The emphasis of lab 3 is centered on the utility of Stata in computing various probabilities. As weve seen in previous labs, Stata comes equipped with numerous commands that allow users to compute and generate various statistics and graphics easily and efficiently. When using Stata to compute probabilities however, there is no magical command that produces results for every probability we might be looking for. With that in mind, most of the work with probabilities in Stata requires the user to carry out the calculations from start to finish. Considering that, it is critical that the user be cognizant of what he/she is attempting to compute as well as the approach to achieve that goal. The most common Stata command well be using today is the display command, which allows users to perform a
1

Principles of Biostatistics and Data Analysis PHP2510 Lab3 September 28th 2007

single calculation and upon execution, displays the results of the calculation in the Results window.

Using Stata to compute Conditional Probabilities:


In the previous example we practiced computing some conditional probabilities. Using that knowledge as well as your knowledge about conditional probability from lecture, try the following example.
Lung Cancer Yes Smoking Yes No Totals 0.12 0.03 0.15 No 0.04 0.81 0.85 Totals 0.16 0.84 1

Suppose you were given a set of data that contained information on 100 subjects regarding whether they smoked and whether they had lung cancer. How would you recreate this table in Stata? Considering the table above, answer the following questions: Given that someone has lung cancer, compute the probability that they are a smoker Given that someone is a smoker, compute the probability that they have lung cancer What is the probability of having lung cancer provided the person doesnt smoke? Compute the relative risk (RR) of lung cancer between smokers and non-smokers and interpret the results.

Using Stata to Compute Probabilities:


Application: Are Blood Antibodies Independent? An example of conditional probability in human genetics (Adapted from Rick Chappell Ph.D., University of Wisconsin Department of Biostatistics) Background: The surfaces of human red blood cells are coated with antigens that are classified into four disjoint blood types: O, A, B, and AB. Each type is associated with blood serum antibodies for the other types, that is, Type O blood contains both A and B antibodies. (This makes type O the universal donor, but capable of receiving only type O). Type A blood contains only B antibodies Type B blood contains only A antibodies Type AB blood contains neither A nor B antibodies. (This makes type AB the universal recipient, but capable of donating only to type AB).

Principles of Biostatistics and Data Analysis PHP2510 Lab3 September 28th 2007

According to the American Red Cross, the U.S. population has the following blood group relative frequencies.
Rh factor + O A Blood Types B AB Totals 0.384 0.323 0.094 0.032 0.833 0.077 0.065 0.017 0.007 0.166 Totals 0.461 0.388 0.111 0.039 0.999

As youll notice from the table above, blood is also categorized based on the presence (+) or absence (-) of Rh factor. Therefore, there are actually eight distinct blood groups that correspond to this dual categorization system: O+, O-, A+, A-, B+, B-, AB+, and AB-. Lets first compute the following Probabilities: P (A antibodies) = P (Type O or Type B) = P (O) + P (B) = .461 + .111 display .461 + .111 P (B antibodies) it follows by the same logic as above display .461 + .388

P (B antibodies and Rh+) = P (Type O+ or A+) = P (O+) + P (A+) Display .384 + .323 Based on the calculations we just made, try to answer the following questions.
Keep track of your answers for each question.

Is having A antibodies independent of having B antibodies? Why, or why not? Is having A antibodies independent of Rh+ ? Why, or why not? Is having B antibodies independent of Rh+ ? Why, or why not? Compute P( A antibodies | B antibodies ) and P( B antibodies | A antibodies ). What conclusions can you draw from this?

More Computing of Probabilities:


An observational study investigates the connection between aspirin use and three vascular conditions: gastrointestinal bleeding, primary stroke, and cardiovascular disease. A sample of patients exhibiting these disjoint conditions have the following prior probabilities: P(GI bleeding ) = 0.2, P(Stroke) = 0.3, and P(Cardiovascular Disease) = 0.5. We also
3

Principles of Biostatistics and Data Analysis PHP2510 Lab3 September 28th 2007

have the following conditional probabilities: P(Aspirin | GI bleeding) = 0.09, P(Aspirin | Stroke) = 0.04, and P(Aspirin | Cardiovascular Disease) = 0.02. Calculate the following posterior probabilities: P(GI bleeding | Aspirin), P(Stroke | Aspirin), and P(Cardiovascular Disease | Aspirin) Interpret what youve calculated by comparing the prior probabilities to the posterior probabilities. One last example

Saving your work:


In the command window type log close. Now the log file is finished in the Stata output format. Type translate file_name.smcl file_name.log. Once you have a file ends as (.log), you can open it in a text editor like Notepad or TextEdit. Of course if you didnt open a log file, you can still save the results via copy & paste. Open a new word file (or Notepad as you like), highlight the results you want in the Results Window, go to the menu bar Edit, select Copy then Paste the results into a new word file. Saving the data If you didnt change the data set during this past session, you dont need to save the data. If you did make changes to the data, and want to save it, type save file_name, replace or save file_name 2. The former gives you an updated hs0.dta data set. The latter creates a new augmented data set called file_name 2. Exit STATA Now, you are ready to exit, you have two choices. You can type exit in the Stata Command line or simply click on File (on the menu bar at the top of the screen) and then scroll down to Exit.

References: All three examples outlined in this lab were obtained from the lecture notes of Dr. Ismor Fischer of the University of Wisconsin Department of Biostatistics, for Stat 541. Although these notes are not published, Dr. Fischer deserves acknowledgement for his remarkable examples and commentary. Accok, Alan. A Gentle Introduction to Stata. Texas: Stata Press, 2006. Resources to Help you Learn Stata. UCLA Academic Technology Service. 8 September 2007 http://www.ats.ucla.edu/stat/stata/ Hamilton, Lawrence. Statistics with Stata. Canada: Duxbury, 2006.

Principles of Biostatistics and Data Analysis PHP2510 Lab4 October 4th 2007

Goals for Lab 4:


Working with Principle mass functions of discrete random variables Calculating probabilities via Binomial Distribution with and without Stata Calculating conditional probabilities from raw unorganized data Questions regarding Homework set 1

Useful Resources: http://www.stata.com/links/resources1.html http://www.ats.ucla.edu/stat/stata/notes3/default.htm Stata Commands to keep in mind:


cd dir or ls insheet infile infix input browse edit save display list Binomial(n,k,p) codebook generate replace stem graph histogram Change directory Show files in current directory Read ASCII (text) data created by a spreadsheet Read unformatted ASCII (text) data Read ASCII (text) data in fixed format Enter data from keyboard Opens the spreadsheet-like Data Browser for viewing the data Opens the spreadsheet-like Data Editor where data can be entered or edited Store the dataset currently in memory on disk in Stata data format Performs a single calculation and displays the results on screen List values of variables Probability of k or more successes in n trials when the probability of a success on a single trial is p. Detailed contents of a dataset Creates a new variable using any mixture of old variables, constants, random values or expressions Can produce new values, using any mixture of old variables, constants, random values or expressions Stem-and-leaf plot High resolution graphs Histogram for continuous and categorical variables

Summary of Todays Lab:


The emphasis of lab 4, like lab 3, is centered on the utility of Stata in computing various probabilities. Although most of these probabilities can be easily computed using a calculator, the more we work with Stata, the more comfortable youll feel using Stata for more challenging problems. You should find this lab session helpful when you sit down to write up your homework solutions as the problems well be going over today closely resemble some of the problems on your homework assignment. If there is any time at the end of our lab session I will devote that time to answering questions regarding the homework assignment.
1

Principles of Biostatistics and Data Analysis PHP2510 Lab4 October 4th 2007

A couple examples to get warmed up:


Let X be a discrete random variable that represents the number of diagnostic services a child receives during an office visit to a pediatric specialist; these services include procedures such as blood tests and urinanalysis. The probability distribution for X appears below:
x 0 1 2 3 4 5 Total P(X = x) 0.671 0.229 0.053 0.031 0.01 0.006 1

Considering the table above, answer the following questions: Graph the probability distribution of X What is the probability that a child receives exactly three diagnostic services during an office visit? What is the probability that a child receives at least one diagnostic service during an office visit? What is the probability that the child receives exactly three services given the child has received at least one service? Suppose 10 children visit the doctors office, what is the probability that none receive services? Given that at least one service is received, what is the probability that more than one service is received? What is E(X) Is Poisson a good model for X? Why, or why not? Questions? Enjoy your 3 day weekend!

Saving your work:


In the command window type log close. Now the log file is finished in the Stata output format. Type translate file_name.smcl file_name.log. Once you have a file ends as (.log), you can open it in a text editor like Notepad or TextEdit. Of course if you didnt open a log file, you can still save the results via copy & paste. Open a new word file (or Notepad as you like), highlight the results you want in the Results Window, go to the menu bar Edit, select Copy then Paste the results into a new word file. Saving the data If you didnt change the data set during this past session, you dont need to save the data. If you did make changes to the data, and want to save it, type save file_name, replace or save file_name 2. The former gives you an updated hs0.dta data set. The latter creates a new augmented data set called file_name 2. Exit STATA
2

Principles of Biostatistics and Data Analysis PHP2510 Lab4 October 4th 2007

Now, you are ready to exit, you have two choices. You can type exit in the Stata Command line or simply click on File (on the menu bar at the top of the screen) and then scroll down to Exit.

References: All three examples outlined in this lab were obtained from the lecture notes of Dr. Ismor Fischer of the University of Wisconsin Department of Biostatistics, for Stat 541. Although these notes are not published, Dr. Fischer deserves acknowledgement for his remarkable examples and commentary. Accok, Alan. A Gentle Introduction to Stata. Texas: Stata Press, 2006. Resources to Help you Learn Stata. UCLA Academic Technology Service. 8 September 2007 http://www.ats.ucla.edu/stat/stata/ Hamilton, Lawrence. Statistics with Stata. Canada: Duxbury, 2006.

Principles of Biostatistics and Data Analysis PHP2510 Lab5 October 12th 2007

Goals for Lab 5:


Working with Principle mass functions of discrete random variables Working with Binomial and Poisson Distributions Working with Normal Distributions Questions regarding Homework set 2 / Questions about test 1

Useful Resources: http://www.stata.com/links/resources1.html http://www.ats.ucla.edu/stat/stata/notes3/default.htm Stata Commands to keep in mind:


cd dir or ls insheet infile infix input browse edit save display list qnorm codebook generate replace stem graph histogram Change directory Show files in current directory Read ASCII (text) data created by a spreadsheet Read unformatted ASCII (text) data Read ASCII (text) data in fixed format Enter data from keyboard Opens the spreadsheet-like Data Browser for viewing the data Opens the spreadsheet-like Data Editor where data can be entered or edited Store the dataset currently in memory on disk in Stata data format Performs a single calculation and displays the results on screen List values of variables Standardized normal probability plot Detailed contents of a dataset Creates a new variable using any mixture of old variables, constants, random values or expressions Can produce new values, using any mixture of old variables, constants, random values or expressions Stem-and-leaf plot High resolution graphs Histogram for continuous and categorical variables

Summary of Todays Lab:


The emphasis of lab 5 is centered on exposing you to problems that you will likely encounter on a homework or exam. This is mostly review of concepts youve been studying in lecture the past 2 weeks. The two exercises in this lab deal with the Binomial, Poisson and Normal distributions. For the second exercise youll need to use the cumulative probabilities of the standard normal distribution statistical table, which you can obtain from the following URL ( http://www.stat.wisc.edu/~ifischer/Statistical_Tables/Z-distribution.pdf ). If there is any time at the end of lab, I will answer questions regarding homework 2 or the upcoming test.
1

Principles of Biostatistics and Data Analysis PHP2510 Lab5 October 12th 2007

Working with the Binomial and Poisson Distributions:


A new disease occurs in large populations in such a way that the probability of a randomly selected individual having the disease remains constant at p = 0.008, independent of any other randomly selected individual having the disease. Suppose now that a sample of n = 500 individuals is to be randomly selected from this population. Define the discrete random variable X = number of diseased individuals, which can take on any value from 0 to 500. o Calculate the probability mass function f(x) = P(X = x), that is, the probability that 0,1,2..10 people have the disease among the n = 500 individuals sampled. Do this first using the Binomial distribution and second using the Poisson distribution. (Round each decimal off to the thousandths place) Graph the both the Binomial and Poisson probability mass functions using what you calculated in the table below. What do these graphs tell you about the distribution of disease?

x 0 1 2 3 4 5 6 7 8 9 10 etc.

Binomial

Poisson

etc.

etc.

Using both the Binomial and Poisson distribution, what is the mean number of diseased individuals to be expected in the sample and what is its probability? How does this probability compare with the probabilities of other numbers of diseased individuals? Suppose that, after sampling n = 500 individuals, you find that x = 10 of them actually have the disease. Before performing any formal statistical tests, what assumptions, if any, might you suspect have been violated in this scenario? What is the estimate of the probability p of disease, based on this sample?

Working with the Normal distribution: Suppose that in a certain population of adult males the variable Y = total serum cholesterol level (mg/dL) is found to be normally distributed with mean = 220 and standard deviation = 40. For an individual selected at random, what is the probability that his cholesterol level is: Under 190? Under 210? Under 230? Under 250? Over 240? Over 270? Over 300? Over 330? Over 250, given it is over 240? What value of Y constitutes the 80th percentile? 90th percentile? 99th percentile?
2

Principles of Biostatistics and Data Analysis PHP2510 Lab5 October 12th 2007

Between 214 and 276? Between 202 and 238? Questions?

Saving your work:


In the command window type log close. Now the log file is finished in the Stata output format. Type translate file_name.smcl file_name.log. Once you have a file ends as (.log), you can open it in a text editor like Notepad or TextEdit. Of course if you didnt open a log file, you can still save the results via copy & paste. Open a new word file (or Notepad as you like), highlight the results you want in the Results Window, go to the menu bar Edit, select Copy then Paste the results into a new word file. Saving the data If you didnt change the data set during this past session, you dont need to save the data. If you did make changes to the data, and want to save it, type save file_name, replace or save file_name 2. The former gives you an updated hs0.dta data set. The latter creates a new augmented data set called file_name 2. Exit STATA Now, you are ready to exit, you have two choices. You can type exit in the Stata Command line or simply click on File (on the menu bar at the top of the screen) and then scroll down to Exit.

References: Examples outlined in this lab are variations of labs obtained from the lecture notes of Dr. Ismor Fischer of the University of Wisconsin Department of Biostatistics, for Stat 541. Although these notes are not published, Dr. Fischer deserves acknowledgement for his remarkable examples and commentary. Accok, Alan. A Gentle Introduction to Stata. Texas: Stata Press, 2006. Resources to Help you Learn Stata. UCLA Academic Technology Service. 8 September 2007 http://www.ats.ucla.edu/stat/stata/ Hamilton, Lawrence. Statistics with Stata. Canada: Duxbury, 2006.

Principles of Biostatistics and Data Analysis PHP2510 Lab6 October 25th 2007

Goals for Lab 6:


Generating random variables that are uniformly distribution Examples of cases where we could use a random variable generated from a uniform distribution Generating variables from normal distribution Applications

Useful Resources: http://www.stata.com/links/resources1.html http://www.ats.ucla.edu/stat/stata/notes3/default.htm Stata Commands to keep in mind:


edit save display list qnorm tabulate generate set Invnormal() Uniform() histogram

Opens the spreadsheet-like Data Editor where data can be entered or edited Store the dataset currently in memory on disk in Stata data format Performs a single calculation and displays the results on screen List values of variables Standardized normal probability plot produces one- and two-way tables (breakdowns) of means and standard deviations Creates a new variable using any mixture of old variables, constants, random values or expressions sets the values of various system parameters. returns the inverse cumulative standard normal distribution Random number function based on uniform distribution Histogram for continuous and categorical variables

Summary of Todays Lab:


Up to this point, most of the datasets weve used in Stata have been either created already or were just a matter of converting a table format to a format that is workable in Stata. Alternatively however, we might need to create a dataset from scratch. The focus of todays lab is centered around generating data from some of the theoretical distributions weve discussed in lecture and learning why this might be useful.

Generating random variables from a uniform distribution:


Suppose we want to start a dataset containing 10 random values. The first step we are going to take is setting the number of observations desired for the new dataset. The command to do this in Stata is as follows: set obs 10 Recall using the generate command (Creates a new variable using any mixture of old variables, constants, random values or expressions). If we want our generated dataset to contain values that are uniformly distributed, we can use the following command:
1

Principles of Biostatistics and Data Analysis PHP2510 Lab6 October 25th 2007

generate randnum = uniform() The above command should produce the following results: list +----------+ | randnum | |----------| 1. | .3595699 2. | .6071514 3. | .4038841 4. | .6268446 5. | .3503966 |----------| 6. | .7756056 7. | .8306237 8. | .6329162 9. | .4297969 10. | .3735252

| | | | | | | | | |

Note: the values that you obtain will not be the same as what is listed on this table.

In combination with Statas algebraic, statistical, and special functions, uniform() can simulate values sampled from a variety of theoretical distributions. If we want a new variable sampled from a uniform distribution over [0, 32) instead of the typical [0,1), we can use the command below: generate newvar = 428 * uniform() To simulate 100 rolls of a standard six-sided die, we can use: set obs 100 generate roll = 1 + trunc(6*uniform()) If we want to see how many 1s 2s6s we had in this randomized simulation we can use the tabulate command. tabulate roll roll | Freq. Percent Cum. ------------+----------------------------------1 | 20 20.00 20.00 2 | 19 19.00 39.00 3 | 11 11.00 50.00 4 | 18 18.00 68.00 5 | 13 13.00 81.00 6 | 19 19.00 100.00
2

Principles of Biostatistics and Data Analysis PHP2510 Lab6 October 25th 2007

A more interesting example involves simulating 1000 rolls of a pair of six-sided die, type: set obs 1000 generate dice = 2 + trunc(6*uniform()) + trunc(6*uniform()) tabulate dice

dice | Freq. Percent Cum. ------------+----------------------------------2 | 33 3.30 3.30 3 | 47 4.70 8.00 4 | 80 8.00 16.00 5 | 125 12.50 28.50 6 | 130 13.00 41.50 7 | 172 17.20 58.70 8 | 125 12.50 71.20 9 | 117 11.70 82.90 10 | 88 8.80 91.70 11 | 57 5.70 97.40 12 | 26 2.60 100.00 ------------+----------------------------------Total | 1,000 100.00 We can also take a look at the distribution: histogram dice

0 2

.1

.2

Density .3

.4

.5

6 dice

10

12

Which is precisely what we would expect to find.


3

Principles of Biostatistics and Data Analysis PHP2510 Lab6 October 25th 2007

Generating random variables from the normal distribution:


It is also possible to generate variables from a normal (Gaussian) distribution using uniform(). The following example creates a dataset with 500 observations and two variables, z from the standard normal N(0,1) population, and x from the N(10,3). clear set obs 500 generate z = invnormal(uniform()) generate x = 10 + 3 * invnormal(uniform()) Notice: the actual sample means and standard deviations differ slightly from their theoretical values; summarize Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------z | 500 .0046837 1.05767 -2.934611 2.877433 x | 500 9.872388 3.065744 .6584399 20.88839

Considering the following example: Create variables N(0,1) and N(29,23) and fill out the tables below. Summarize your findings via some form of visual display method (ie: graph) using Stata.
Table 1: N(0,1) number of observations 10 100 1000 10000 Table 2: N(89,23) number of observations 10 100 1000 10000
4

standard deviation

standard error

standard deviation

standard error

Principles of Biostatistics and Data Analysis PHP2510 Lab6 October 25th 2007

Interpret your results.

Are your results consistent with your expectations prior to this exercise?

If so, why?

Saving your work:


In the command window type log close. Now the log file is finished in the Stata output format. Type translate file_name.smcl file_name.log. Once you have a file ends as (.log), you can open it in a text editor like Notepad or TextEdit. Of course if you didnt open a log file, you can still save the results via copy & paste. Open a new word file (or Notepad as you like), highlight the results you want in the Results Window, go to the menu bar Edit, select Copy then Paste the results into a new word file. Saving the data If you didnt change the data set during this past session, you dont need to save the data. If you did make changes to the data, and want to save it, type save file_name, replace or save file_name 2. The former gives you an updated hs0.dta data set. The latter creates a new augmented data set called file_name 2. Exit STATA Now, you are ready to exit, you have two choices. You can type exit in the Stata Command line or simply click on File (on the menu bar at the top of the screen) and then scroll down to Exit.

References: Examples outlined in this lab are variations of labs obtained from the lecture notes of Dr. Ismor Fischer of the University of Wisconsin Department of Biostatistics, for Stat 541. Although these notes are not published, Dr. Fischer deserves acknowledgement for his remarkable examples and commentary. Accok, Alan. A Gentle Introduction to Stata. Texas: Stata Press, 2006. Resources to Help you Learn Stata. UCLA Academic Technology Service. 8 September 2007 http://www.ats.ucla.edu/stat/stata/ Hamilton, Lawrence. Statistics with Stata. Canada: Duxbury, 2006.

Principles of Biostatistics and Data Analysis PHP2510 Lab7 November 2nd 2007

Goals for Lab 7:


Calculating confidence intervals Application: breast cancer mortality

Useful Resources: http://www.stata.com/links/resources1.html http://www.ats.ucla.edu/stat/stata/notes3/default.htm Stata Commands to keep in mind:


Opens the spreadsheet-like Data Editor where data can be entered or edited save Store the dataset currently in memory on disk in Stata data format display Performs a single calculation and displays the results on screen list List values of variables qnorm Standardized normal probability plot sample #, count Generates a sample from the current dataset Creates a new variable using any mixture of old variables, constants, random values or generate expressions set sets the values of various system parameters. Invnormal() returns the inverse cumulative standard normal distribution Uniform() Random number function based on uniform distribution histogram Histogram for continuous and categorical variables
edit

Calculating Confidence intervals:


A particular area contains 8000 condominium units. In a survey of the occupants, a simple random sample of size 100 yields the information that the average number of motor vehicles per unit is 1.6, with sample variance .64. The estimated error of X .

se( x ) = se( x ) =

s n .8 100

se( x ) = .08
Since z(.025) = 1.96, a 95% confidence interval for the population average is: X 1.96 *se( x ) , which comes out to be ~ (1.44, 1.76) What would the 90% confidence interval for the population average be? The 99% CI? What would be the estimate of the total number of motor vehicles in the particular area specified in this problem?
1

Principles of Biostatistics and Data Analysis PHP2510 Lab7 November 2nd 2007

T = 8000 1.6 = 12,800 Using the result from above, what is the estimated standard error of T ? What are the 90%, 95%, 99% confidence intervals for the total number of motor vehicles? Application: Cancer Mortality: Import the cancer dataset into Stata. This dataset contains values for breast cancer mortality from 1950 to 1960 and the adult white female population in 1960 for 301 counties in North Carolina, South Carolina, and Georgia and is designed to help you understand the differences between samples of various sizes and the true population. Make a histogram of the population values for cancer mortality What are the population mean and total cancer mortality? What are the population variance and standard deviation? Select a random sample of 25 observations of cancer mortality. What is the sampling distribution of the sample mean? From this random sample, calculate the sample mean, variance, and standard error. Compare these results to the population mean and variance you calculated. How large are the differences? Form 90%, 95%, and 99% confidence intervals for the population mean from the sample mean that you calculated. Do these confidence intervals cover the population values? Repeat bullets 3-6 with a random sample of 150 observations of cancer mortality. How does this sample size compare to n=25?...to the population?

References:

Examples outlined in this lab are variations of labs obtained from the lecture notes of Dr. Ismor Fischer of the University of Wisconsin Department of Biostatistics, for Stat 541. Although these notes are not published, Dr. Fischer deserves acknowledgement for his remarkable examples and commentary. Accok, Alan. A Gentle Introduction to Stata. Texas: Stata Press, 2006. Resources to Help you Learn Stata. UCLA Academic Technology Service. 8 September 2007 http://www.ats.ucla.edu/stat/stata/ Hamilton, Lawrence. Statistics with Stata. Canada: Duxbury, 2006.

Principles of Biostatistics and Data Analysis PHP2510 Lab8 November 9th 2007

Goals for Lab 8:


Some more exposure to confidence intervals Maximum likelihood estimation Distributions of the sample mean

Useful Resources: http://www.stata.com/links/resources1.html http://www.ats.ucla.edu/stat/stata/notes3/default.htm Stata Commands to keep in mind:


Opens the spreadsheet-like Data Editor where data can be entered or edited save Store the dataset currently in memory on disk in Stata data format display Performs a single calculation and displays the results on screen list List values of variables qnorm Standardized normal probability plot sample #, count Generates a sample from the current dataset Creates a new variable using any mixture of old variables, constants, random values or generate expressions set sets the values of various system parameters. Invnormal() returns the inverse cumulative standard normal distribution Uniform() Random number function based on uniform distribution histogram Histogram for continuous and categorical variables
edit

A little review before we begin the lab. -We have a population whose distribution is Bernoulli, p = .35. means? What are the standard errors? Sample mean with n = 100 Sample mean with n = 75 What if p = .80 instead?

What is the distribution of the following sample

Recall: Sample mean is a random variable and the expected value of the sample mean is the same as the expected value for the population. Additionally, the variance of the sample mean is

Var ( X ) . n

Example1: From lecture we learned that confidence intervals are often used to convey uncertainty about the estimate of any parameter.
1

Principles of Biostatistics and Data Analysis PHP2510 Lab8 November 9th 2007

Recall: the calculation for a confidence interval.

X z / 2

, which is often given as , X + z1 / 2 X z1 / 2 n n n

When eight persons in Massachusetts experienced an unexplained episode of vitamin D intoxication that required hospitalization, it was suggested that these unusual occurrences might be the result of excessive supplementation of dairy milk. Blood levels of calcium and albumin for each individual at the time of hospital admission are shown below:
Calcium (mmol/l) 2.92 3.84 2.37 2.99 2.67 3.17 3.74 3.44 Albumin (g/l) 43 42 42 40 42 38 34 42

a) Construct a two-sided 95% Confidence interval for the true mean calcium level of individuals who experience vitamin D intoxication. b) Compute a 95% confidence interval bound for the true mean albumin level of this group. c) For healthy individuals, the normal range of calcium levels is 2.12 to 2.74 mmol/l and the range of albumin levels is 32 to 55 g/l. Do you believe that patients suffering from vitamin D intoxication have normal blood levels of calcium and albumin? d) Using Stata, test the hypothesis that the true mean for calcium blood level is 2.0 mmol/l. e) Test the hypothesis that the true mean for calcium blood level is 2.5 mmol/l. Compare this to the results obtained from part (d) f) Test the hypothesis that the true mean for albumin blood level is 36 mmol/l.

Open up the lowbt.dta dataset from Dr. Hogans course webpage. -

Test the following hypotheses:

The true mean for spb is 40. Interpret your result. Calculate the 95% confidence interval for gestage and test the hypothesis that the true mean is 25. Using this dataset, formulate and test a hypothesis of your own. I will not

Remember, well be having a review session on Sunday, the 11th, at noon in room 241 (121 S. Main St.). be going over any specific topics, so bring questions!

Principles of Biostatistics and Data Analysis PHP2510 Lab9 November 16th 2007

Goals for Lab 9:


Hypothesis testing One-sample test of a proportion Two-sample test of a proportion One-sample test of a mean Two-sample test of a mean

Useful Resources: http://www.stata.com/links/resources1.html http://www.ats.ucla.edu/stat/stata/notes3/default.htm Stata Commands to keep in mind:


edit

Opens the spreadsheet-like Data Editor where data can be entered or edited save Store the dataset currently in memory on disk in Stata data format display Performs a single calculation and displays the results on screen list List values of variables qnorm Standardized normal probability plot sample #, count Generates a sample from the current dataset Creates a new variable using any mixture of old variables, constants, random values or generate expressions set sets the values of various system parameters. Invnormal() returns the inverse cumulative standard normal distribution Uniform() Random number function based on uniform distribution histogram Histogram for continuous and categorical variables

Hypothesis Testing: Before we can run a z or t-test, we need to have two hypotheses: a null hypothesis (H0) and an alternative hypothesis (HA). Although one-sided tests are appropriate if we can categorically exclude negative findings (results in the opposite direction of our hypothesis), we will rarely be this confident in the direction of the results. It is common practice to report two-tailed test and make note that the direction of the results is what we expect to see. A two-tailed test is always more conservative than a one-tailed test, so the tendency to rely on two-tailed tests can be viewed as a conservative approach to statistical significance. As well see, Stata reports both one-tailed and two-tailed significance. One sample test of a proportion: For this exercise, well be using data from the 2002 General Social Survey dataset. The variable SchoolPrayer is coded 1 if the person favors school prayer and 0 if the person opposes school prayer. Null Hypothesis H0 : p = 0.5 Alternative Hypothesis HA : p 0.5
1

Principles of Biostatistics and Data Analysis PHP2510 Lab9 November 16th 2007

The null hypothesis, p = 0.5, uses a value of 0.5 because this represents what the proportion would be if there were no preference for or against school prayer. To carryout the hypothesis test in Stata well need to do the following. Navigate to the menu bar and click on Statistics => Summaries, tables => Classical tests of hypothesis => One-sample proportion test. Enter the variable SchoolPrayer and the hypothesized proportion for the null hypothesis 0.5. Submitting this gives us the following results:

One-sample test of proportion SchoolPrayer: Number of obs = 320 -----------------------------------------------------------------------------Variable | Mean Std. Err. [95% Conf. Interval] -------------+---------------------------------------------------------------SchoolPrayer | .303125 .0256929 .2527678 .3534822 -----------------------------------------------------------------------------p = proportion(SchoolPrayer) z = -7.0436 Ho: p = 0.5 Ha: p < 0.5 Pr(Z < z) = 0.0000 Ha: p != 0.5 Pr(|Z| > |z|) = 0.0000 Ha: p > 0.5 Pr(Z > z) = 1.0000

The first line indicates that this is a one-sample test of proportion, the variable is SchoolPrayer, and N = 320. The table gives a mean ( 0.303125) and standard error ( 0.0256929). The mean is simply the proportion of people coded 1. Thus, 0.30, 30% of the people in the survey say that they support prayer in school. The 95% confidence interval tells us that we can be 95% confident that the interval .253 to .353 includes adults who support school prayer. Using percentages, we could say that we are 95% confident that the interval of 25.3% to 35.3% contains the true percentage of adults who support school prayer. Statas Ha: p != 0.5 has the same meaning as the alternative hypothesis we previously stated HA : p 0.5. The p = 0.0000 does not mean that there is no probability that the null hypothesis is true, just that the probability is all zeros to four decimal places. If the p = 0.00004, this rounded to p = 0.0000 in the Stata output. A p-value of .0000 is usually reported as p < .001. This finding is highly statistically significant and allows us to reject the null hypothesis. Fewer than 1 time in a 1000 would we obtain these results by chance if the null hypothesis was true. Two-Sample test of a proportion: Often times, statisticians want to compare a proportion across two-samples. For example, you might have an experiment testing a new drug. You randomly assign 100 study participants so that 50 are in the treatment group that receives the drug and 50 are in the control group that receives a placebo. You record whether the person is cured by assigning a 1 to those who were cured and a 0 to those who were not. The table on the following page represents what the data would look like.

Principles of Biostatistics and Data Analysis PHP2510 Lab9 November 16th 2007

Patient 1 2 3 4 5 . . . 50

Treatment 0 0 1 1 0 . . . 1

Placebo 0 1 0 1 1 . . . 0

Open up the drug.dta dataset in Stata and summarize your data. Notice that in the treatment group we have 35 of the 50 people, or .7 cured. In the Placebo group, we have 20 of the 50 cured, or .4. Before going any further, we need a null and alternative hypothesis. The null hypothesis is that the two groups have the same proportion cured and the alternative hypothesis is that the proportion cured is unequal among the treatment and placebo groups. Thus, we have the following: Null Hypothesis H 0 : p ( treatment ) = p ( placebo ) Alternative Hypothesis H A : p ( treatment ) p ( placebo ) Notice that these are independent samples and that the data for the two groups are entered as two variables. To perform the hypothesis test navigate to the menu bar Statistics => Summaries, tables => Classical tests of hypothesis => Two-sample proportion test. Performing the hypothesis test will give us the following results:
Two-sample test of proportion placebo: Number of obs = Variable | treatment | placebo | diff | | under Ho: .3 Mean .7 .4 50 z P>|z| [95% Conf. Interval] .5729798 .2642097 .1140615 3.02 0.003 z = 3.0151 .8270202 .5357903 .4859385 treatment: Number of obs = 50

-----------------------------------------------------------------------------Std. Err. .0648074 .069282 -------------+----------------------------------------------------------------

-------------+---------------------------------------------------------------.0948683 .0994987

-----------------------------------------------------------------------------diff = prop(treatment) - prop(placebo) Ho: diff = 0 Ha: diff < 0 Pr(Z < z) = 0.9987 Ha: diff != 0 Pr(|Z| < |z|) = 0.0026 Ha: diff > 0 Pr(Z > z) = 0.0013

Evaluating the two-sided p-value, which is less than = 0.05, therefore we reject the null hypothesis, thus the
3

Principles of Biostatistics and Data Analysis PHP2510 Lab9 November 16th 2007

proportion of cured individuals in the treatment group is not equal to the proportion of cured individuals in the placebo group. One-sample test of means: You can do z tests for one-sample mean or do t tests. A z test is appropriate when you know the population variance and because this is often not the case, well cover only the use of t tests since these are the most widely used. Unless you have a small sample, both tests yield very similar results. We will be using the lifexp.dta dataset for this exercise. This dataset contains 538 observations and 4 variables. All observations in this study are a sample from the United States population. The variable were concerned with lexp (life expectancy from birth). It was reported in a recent issue of the USA today that the current average life expectancy for Americans is 73 years. Test the null hypothesis that the mean life expectancy is 73 years. Null Hypothesis H0 : = 73 Alternative Hypothesis HA : 73 One-sample t test -----------------------------------------------------------------------------Variable | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] ---------+-------------------------------------------------------------------lexp | 538 72.89219 .1721484 3.992952 72.55403 73.23036 -----------------------------------------------------------------------------mean = mean(lexp) t = -0.6262 Ho: mean = 73 degrees of freedom = 537 Ha: mean < 73 Pr(T < t) = 0.2657 Ha: mean != 73 Pr(|T| > |t|) = 0.5314 Ha: mean > 73 Pr(T > t) = 0.7343

Thus, the two-sided P-value is greater than = 0.05, therefore we fail to reject the null hypothesis, hence it is very likely that we would observe the results of this study given that = 73. Two-sample test of means: Often times statisticians want to compare two means to determine if statistical significance exists. In the same USA today issue, it was also reported that women have a greater life expectancy than do men. Using a two-sample test of means, test the null-hypothesis that men and women have equal life expectancies. Null Hypothesis H 0 : ( women ) = ( men ) Alternative Hypothesis H A : ( women ) ( men ) To carryout the following two-sample test of means navigate to the menu bar, Statistics => Summaries tables & tests => Classical tests of hypothesis => two-sample mean comparison test.
4

Principles of Biostatistics and Data Analysis PHP2510 Lab9 November 16th 2007

Set Flexp as your first variable and Mlexp as your second variable. Additionally, click the unequal variances box (Why should we click this box?)

Two-sample t test with unequal variances -----------------------------------------------------------------------------Variable | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] ---------+-------------------------------------------------------------------Mlexp | 538 71.08736 .1319606 3.060803 70.82814 71.34658 Flexp | 538 76.05762 .1690386 3.920821 75.72556 76.38968 ---------+-------------------------------------------------------------------combined | 1076 73.57249 .1312677 4.305901 73.31492 73.83006 ---------+-------------------------------------------------------------------diff | -4.97026 .2144473 -5.391071 -4.549449 -----------------------------------------------------------------------------diff = mean(Mlexp) - mean(Flexp) t = -23.1771 Ho: diff = 0 Satterthwaite's degrees of freedom = 1014.26 Ha: diff < 0 Pr(T < t) = 0.0000 Ha: diff != 0 Pr(|T| > |t|) = 0.0000 Ha: diff > 0 Pr(T > t) = 1.0000

Recall from a previous example, the p = 0.0000 does not mean that there is no probability that the null hypothesis is true, just that the probability is all zeros to four decimal places. A p-value of .0000 is usually reported as p < .001. This finding is highly statistically significant and allows us to reject the null hypothesis. Fewer than 1 time in a 1000 would we obtain these results by chance if the null hypothesis was true. It is therefore reasonable to assume that the mean life expectancy for women is not equal to the mean life expectancy for men.

Questions:

Principles of Biostatistics and Data Analysis PHP2510 Lab10 November 30th 2007

Goals for Lab 10:


Setting up two-by-two tables in Stata Calculating odds, odds ratio, and relative risk Applications

Useful Resources: http://www.stata.com/links/resources1.html http://www.ats.ucla.edu/stat/stata/notes3/default.htm Stata Commands to keep in mind:


edit

Opens the spreadsheet-like Data Editor where data can be entered or edited save Store the dataset currently in memory on disk in Stata data format display Performs a single calculation and displays the results on screen list List values of variables qnorm Standardized normal probability plot sample #, count Generates a sample from the current dataset Sample size and power determination sampsi csi Calculates appropriate statistics for epidemiologic datasets Invnormal() returns the inverse cumulative standard normal distribution Uniform() Random number function based on uniform distribution histogram Histogram for continuous and categorical variables

Concepts The concept of relative risk is often useful when we want to compare the probabilities of disease in two different groups. The relative risk, abbreviated RR, is the chance that a member of a group receiving some exposure will develop disease relative to the chance that a member of an unexposed group will develop the same disease. Mathematically; RR =

a /(a + c) b(b + d )

Another commonly used measure of the relative probabilities of disease is the odds ratio (OR). If an event takes place with probability p, the odds in favor of the even are p/ (1-p). Mathematically; OR = ad / bc The relative risk and the odds ratio are tow different measures that attempt to explain the same phenomenon. Although the relative risk might seem more intuitive, the odds ratio has better statistical properties. First exercise: For the first exercise in this lab well be using data collected from the study below:
1

Principles of Biostatistics and Data Analysis PHP2510 Lab10 November 30th 2007

Chasnoff et al. Temporal patterns of cocaine use in pregnancy. JAMA 1989; 261:12: 1741-1744. Study Abstract:

Fill in the 2x2 table from the data below to estimate the effect of cocaine use throughout the entire pregnancy compared to women who did not use cocaine at any time during pregnancy on the risk of low birth weight infants

Cocaine use throughout pregnancy Low Birth weight infants no yes No cocaine use at any point during pregnancy

We can use the following command to enter these data directly into stata: csi 10 1 42 39

Exposed

Unexposed

Total
2

Principles of Biostatistics and Data Analysis PHP2510 Lab10 November 30th 2007
-----------------+------------------------+-----------Cases | Noncases | Total | | Risk | | | Risk difference | Risk ratio | Attr. frac. ex. | Attr. frac. pop | Point estimate .1673077 7.692308 .87 .7909091 chi2(1) = .1923077 .025 10 42 52 1 39 40 | | | | | | | | | | | 6.01 Pr>chi2 = 0.0142 [95% Conf. Interval] .0497686 1.026696 .0260014 .2848468 57.63305 .9826488 .1195652 11 81 92

-----------------+------------------------+------------

|------------------------+------------------------

+-------------------------------------------------

Answer the following questions regarding the data: What is the estimated probability of low birth weight infants for women who used cocaine throughout pregnancy? For women who didnt use cocaine at all during pregnancy? What is the total probability of low-birth weight infants among both women who used cocaine throughout pregnancy and women who didnt use cocaine at all during pregnancy? What is the relative risk? Interpret your result. What is the odds ratio? Interpret your result. Comparing your relative risk estimate to your odds ratio estimate, why do you think the odds ratio and the relative risk are so different? Use Stata to compute 95% and 99% confidence interval for the following: risk difference, relative risk, and odds ratio. Calculate the Pearsons chi square test statistic, and report a p-value for the test of no association. Interpret your results. Second Exercise: Download the following article:

Lappe, et al. Vitamin D and calcium supplementation reduces cancer risk: results of a randomized trial. Amer J Clin Nutrition 2007; 85: 1586-91. Create a 2x2 table from the data below to estimate the effect of Vitamin D and Calcium supplementation (years 1-4) compared to the placebo (years 1-4) group on the risk of developing cancer.
Answer the following questions regarding the data: What is the relative risk? Interpret your result. What is the odds ratio? Interpret your result.
3

Principles of Biostatistics and Data Analysis PHP2510 Lab10 November 30th 2007

Comparing your relative risk estimate to your odds ratio estimate, why do you think the odds ratio and the relative risk are so different? Use Stata to compute 95% and 99% confidence interval for the following: risk difference, relative risk, and odds ratio. Calculate the Pearsons chi square test statistic, and report a p-value for the test of no association. Interpret your results.

Questions:

Principles of Biostatistics and Data Analysis PHP2510 Lab11 December 7th 2007

Goals for Lab 11: Examine two important regression assumptions - Linearity - Homoscedastic errors In doing so, we will perform the following exercises: Fit smoother to determine if linear Fit regression Evaluate residuals to determine if homoscedacitiy holds Use robust standard errors if needed

Useful Resources: http://www.stata.com/links/resources1.html http://www.ats.ucla.edu/stat/stata/notes3/default.htm Stata Commands to keep in mind:


Opens the spreadsheet-like Data Editor where data can be entered or edited save Store the dataset currently in memory on disk in Stata data format display Performs a single calculation and displays the results on screen list List values of variables qnorm Standardized normal probability plot sample #, count Generates a sample from the current dataset Sample size and power determination sampsi csi Calculates appropriate statistics for epidemiologic datasets Invnormal() returns the inverse cumulative standard normal distribution Uniform() Random number function based on uniform distribution histogram Histogram for continuous and categorical variables
edit
Background/Importance: Quantitative models always rest on assumptions about the way the world works, and regression models are no exception. There are four principal assumptions which justify the use of linear regression models for purposes of prediction: Linearity of the relationship between dependent and independent variables Independence of the errors (no serial correlation) Homoscedasticity (constant variance) of the errors versus time versus the predictions (or versus any independent variable) Normality of the error distribution.

If any of these assumptions is violated (i.e., if there is nonlinearity, serial correlation, heteroscedasticity, and/or non-normality), then the forecasts, confidence intervals, and insights yielded by a regression
1

Principles of Biostatistics and Data Analysis PHP2510 Lab11 December 7th 2007
model may be (at best) inefficient or (at worst) seriously biased or misleading. that evaluated hypertension in pregnancy. assumptions hold. Note: Mean arterial pressure (map) and body mass index (bmi). measured in the first trimester of pregnancy. Before testing the assumption of linearity, what well first want to do is look at the scatter-plot of MAP24 and BMI. Use the following command in Stata to generate a scatter-plot: MAP24 is the mean arterial pressure at BMI is the body mass index, The dataset well be

working with today to test some of these assumptions (linearity and homoscedasticity) came from a study For the purpose of this exercise, well assume the 2nd and 4th

24 weeks gestation for a cohort of 330 women followed through pregnancy.

twoway scatter (map24 bmi)

map24 40 60

80

100

20

30 bmi

40

50

60

It is often useful to evaluate the correlation between the variables were interested. Stata multiple ways, but the easiest way is by using the following command:

We can do this in

corr map24 bmi

How do we interpret the results? The Linearity Assumption:

Principles of Biostatistics and Data Analysis PHP2510 Lab11 December 7th 2007
Violations of linearity are extremely serious--if you fit a linear model to data which are nonlinearly related, your predictions are likely to be seriously in error, especially when you extrapolate beyond the range of the sample data. How to detect: nonlinearity is usually most evident in a plot of the observed versus predicted values or a plot of residuals versus predicted values, which are a part of standard regression output. The points should be symmetrically distributed around a diagonal line in the former plot or a horizontal line in the latter plot. Look carefully for evidence of a "bowed" pattern, indicating that the model makes systematic errors whenever it is making unusually large or small predictions. We can evaluate linearity more precisely using the lowess smoother in Stata. The lowess smoother allows us to determine whether linearity makes sense by computing the average of Y over windows of X, and doing so, tends to pick up the underlying trajectory. The bandwidth option specifies the size of the windows. A bandwidth of .5 has a moving window whose width is 50% of the x axis. uses a moving window with width equal to 30% of the x axis, and so on. A bandwidth of .3

lowess map24 bmi, bwidth(.6)

Lowess smoother
100 map24 40 60 80

20
bandwidth = .6

30 bmi

40

50

60

lowess map24 bmi, bwidth(.4)

Lowess smoother
100 map24 40 60 80

20
bandwidth = .4

30 bmi

40

50

60

We can see from the lowess smoother that our data are linear. to see curvature in the line.

If it were not linear then we would tend


3

Principles of Biostatistics and Data Analysis PHP2510 Lab11 December 7th 2007
Now that weve shown that our data are linear, we are going to regress our data using the regress command in Stata:

regress map24 bmi

What are the estimates of beta0, beta1? How are they interpreted? What is the R-square? How is this interpreted? What is the MSE? First generate predicted values, then make a two way

Plot the fitted line and the data.

scatter-plot with the regression line superimposed.

predict map24pred twoway (scatter map24 bmi) (line map24pred bmi)

40

60

80

100

20

30 bmi map24

40 Fitted values

50

60

Now we will examine the error distribution and in doing so, evaluate the assumption of homoscedasticity.
4

Principles of Biostatistics and Data Analysis PHP2510 Lab11 December 7th 2007
The Homoscedasticity Assumption Violations of homoscedasticity make it difficult to gauge the true standard deviation of the forecast errors, usually resulting in confidence intervals that are too wide or too narrow. In particular, if the variance of the errors is increasing over time, confidence intervals for out-of-sample predictions will tend to be unrealistically narrow. Heteroscedasticity may also have the effect of giving too much weight to small subset of the data (namely the subset where the error variance was largest) when estimating coefficients. How to detect: look at plots of residuals versus time and residuals versus predicted value, and be alert for evidence of residuals that are getting larger (i.e., more spread-out) either as a function of time or as a function of the predicted value. (To be really thorough, you might also want to plot residuals versus some of the independent variables.) In particular we are interested to know whether the variance of the residuals is constant for all values of X. In this case, X is BMI. We will use the predict command to generate the error terms in a variable called e.

predict e, residual scatter e bmi

-40

-20

Residuals 0

20

40

20

30 bmi

40

50

60

Getting the correct standard error depends on satisfying the homoscedastic error assumption. When the homoscedastic error assumption is violated, we can use robust standard errors instead. when errors do NOT have constant variance. >100). They are called robust because they are robust to violations of this assumption, and provide correct standard errors even They do require large sample size in order to be valid (e.g.

regress map24 bmi, robust


5

Principles of Biostatistics and Data Analysis PHP2510 Lab11 December 7th 2007

Compare your confidence intervals and standard errors for beta1 between the two regressions.

References: http://www.duke.edu/~rnau/testing.htm