You are on page 1of 1

Stat105 Long Test No.

1
Save your R codes in this format: [fullname.txt]. Good luck.

July 12, 2013 R. Bataller


[100pts.]

Directions: Answer all the questions completely. Use R to carry out all necessary computations/calculations.

1. (a) A medical researcher, trying to establish the ecacy of a new drug, has begun testing the drug along with a placebo. To make sure that the two groups of volunteer patients- those receiving the drug and those receiving a placebo- are as nearly alike as possible, the researcher has decided not to rely on chance but rather to carefully scrutinize the volunteers and then choose the groupings himself. Is this approach advisable? Why or why not? If why not, give the most appropriate approach to conduct this research. [15] (b) Explain why it is important that a researcher who is trying to learn about the usefulness of a new drug not know which patients are receiving the new drug and which are receiving the placebo. [10] 2. To determine the proportion of people in your town who are smokers, it has been decided to poll people at one of the following local spots: (a) the pool hall; (b) the bowling alley; (c) the shopping mall; (d) the library. Which of these potential polling places would most likely result in a reasonable approximation to the desired proportion? Why? [10] 3. Determine the type of variable (quantitative or qualitative ) and the level of measurement used to measure the following variables: [10] (a) Postal zip code (b) Performance rating of an employee as excellent, very good, good, fair and bad (c) Body weight of a baby measured in kilograms (d) Intelligence quotient of a student (e) Student number 4. (a) The following data represents the height of trees in meters, measured to the nearest tenth, of a sample of 50 trees in a certain region. The data are given in trees.csv. Set up a frequency distribution table for these data. Write the table on your answer sheet. Construct a histogram and describe the features of its distribution (shape, skewness, etc.) [15] (b) Given a frequency distribution table, we can approximate the mean and standard deviation of the sample using the following formulas: 1 n
k

x f =

fi xi ,
i=1

sf =

1 n1

fi (xi x f )2 ,
i=1

where n is the sample size, k is the number of classes, fi is the frequency of the ith class, and xi is the class midpoint of the ith class. Approximate the mean and standard deviation of the sample described in the frequency distribution table in (a). [15] 5. A paper by Robertson et al. [1976] discusses the level of plasma prostaglandin E (iPGE) in patients with cancer with and without hypercalcemia. The data are given in robertson76.csv. Note that the variables are the mean plasma iPGE (iPGE) and mean serum calcium levels (serumCa)- presumably, more than one assay was carried out for each patients level. The number of such tests for each patient is not indicated, nor is the criterion for the number. [25] (a) Calculate the mean and standard deviation of plasma iPGE level for patients with hypercalcemia (yes); do the same for patients without hypercalcemia (no). (b) Calculate all the quartiles for each group of patients. Use type=5 in calculating these quantities. (c) Make a box plots for plasma iPGE for each group. Can you draw any conclusions from these plots? Discuss the properties of the distribution of each group. Do they suggest that the two groups dier in plasma iPGE levels? (d) The article states that normal limits for serum calcium levels are 8.5 to 10.5 mg/dL. It is clear that patients were classied as hypercalcemic if their serum calcium levels exceeded 10.5 mg/dL. Without classifying patients it may be postulated that high plasma iPGE levels tend to be associated with high serum calcium levels. Make a plot of the plasma iPGE and serum calcium level to determine if there is a suggestion of pattern relating these two variables. What is your conclusion based from the plot?

You might also like