Professional Documents
Culture Documents
Tathagata Bandyopadhyay
August 22, 2016
1. Statistical Inquiry: A statistical inquiry always starts with the formulation of a
research question. A few examples are: What percentage of families in Ahmedabad
afford to send their children to private schools? What is the average cell phone bill per
month for a boy/girl aged between 10 and 16 years? What is the average switching
time (in months) from one mobile phone to another for an adult aged between 30
years and 40 years? How many hours on an average a college student spends in a
day on online social networking? On the average how many social media accounts a
college in Ahmedabad has? Interface of an app to book a movie ticket is designed by
two designers. Which interface is more user friendly? A bank is to decide whether
to offer a personal loan of Rs. 25,000 or Rs. 50,000 to all whose income are between
Rs. 20 thousand to Rs. 30000 per month on production of UIDAI card at 15% rate
of interest. The condition is to pay back the full amount in two years. Which format
will have more takers and which will have less defaulters (or more bad loans)? To
answer a research question one needs to design a study for collection of appropriate
data either by conducting a survey or from a secondary source or by conducting an
experiment. This is the design stage of an inquiry. In the design stage you need to
decide on the kind of data to be collected, method of collection, and also the design
of questionnaire.
Next stage is the analysis of data. It comprises summarization of data using tables,
graphs and numbers, modeling the data and then drawing inference or conclusion
from the data with a measure of uncertainty attached to it. For example, a statement
like: I am 95% confident that on an average a college student in Ahmedabad spends
between 5 and 6 hours a day in browsing social media sites.
Are all research questions stated above unambiguous? If not what alterations are needed for the ambiguous ones?
2. Nature of Data: Data are collected either as numbers like, average time spent on
mobile phone a day by a student, number of junk mails received in g-mail account,
average number of cigarettes smoked per day, etc. These are called variables. The
other kind of data are called categorical. Categorical data are collected by classifying an individual or object in one of the mutually exclusive and exhaustive
categories. Examples are: male, female; smoker and non-smoker; illiterate, studied
1
upto grade twelve, Graduates, post graduates and above; different categories of professions; taste of a food product observed in different categories; staying experience
in a hotel classified into different categories, etc.
Usually data are collected on individuals or objects which are called units or subjects.
3. Some Technical terms (Population, Sampling unit, Sampling frame, Sample, Random Sample, Parameter and statistic): Usually large population size
prohibits collecting data from each and every unit/subject for various reasons, primarily for time and cost constraints. Suppose the research question is: what
is the average number of hours per day a college student in Ahmedabad spends in
browsing social media sites in 2016?
Let us define the above technical terms in this specific problem context:
population (all college students in Ahmedabad. In other words, collection of all
students covered by our inquiry), sampling unit (each student could be a sampling
unit, when from the population a sample of students is to be selected. However, a
more efficient process may be to select a sample of colleges and then collect data from
each student of the selected colleges. In the latter case the sampling unit is a college.
On the other hand, after selecting a college from all colleges in Ahmedabad, a sample
of students is selected from each selected college then the colleges are called first stage
sampling units and the students are called the second stage sampling units).
sampling frame (List of all college students in Ahmedabad or list of all colleges in
Ahmedabad. In other words, the list of all sampling units in the population.)
A sample is a collection of sampling units drawn from a frame or frames. Sample
could be drawn according to convenience or could be drawn in a way such that every
sample has got equal chance of being selected. The former is called a convenience
sample and the latter is called a random sample. A random sample is often
preferred because it avoids any bias in selection and usually results in a representative
sample. (A convenience sample in our example could be a sample of students from
the the nearby colleges. For drawing a random sample on the other hand a list of all
college students (a sampling frame) needs to be created and then using some random
mechanism a sample of students will be selected from the list.)
A parameter is a population characteristic of our interest. In our example it is the
average based on data from all college students in Ahmedabad. It is usually a fixed
unknown number. Other population characteristics that may be of our interest
are the standard deviation of the times spent by the students in the population, and
the percentage of students in the population spending more than 4 hours, etc.
A statistic is a sample analogue of the parameter like sample average or sample
standard deviation. A statistic is calculated on the basis of sample observations. In
the context of our example, the average and standard deviation of times spent in
browsing social media sites by the selected students, percentage of selected students
2
spending more than four hours in social media sites etc. are examples of statistics.
Notice that for estimating the unknown parameter the sample value of the corresponding statistic is usually used. For example in estimating population average, the
value of the sample average observed for the selected sample is used.
Group assignments:
1. Consider the problem of estimating (i) the average number of hours on a day a
student in Section A spends in preparing for PGP classes and (ii) the proportion of
female students in Section A.
Prepare the sampling frame. Draw two random samples of (i) size five and (ii) size
ten. Find the sample mean and sample proportion for each of the samples that you
have drawn. You will have two means and two proportions for two samples.
2. Consider the problem of estimating (i) the average number of e-mails received by
the PGP I students,(ii)the proportion of PGP I students using an i-phone, (iii) the
proportion of PGP I students using a Samsung Galaxy phone (iv) the proportion of
PGP I students using more than one cell phone.
Prepare the sampling frame. Draw two random samples of (i) size twenty and (ii)
size forty. Find the sample mean (for (i)) and the sample proportion (for (ii)- (iv))
for each of the two samples.
You may collect your data through e-mails if you feel so. But you need to explain
every step during your presentation.
4. Collection of data: Two methods are usually followed in collection of data for
research. Either through conducting experiment or surveys (often using data already collected by some agency (known as syndicated data)). The former is often
called data collected through experimental study and the latter, the data collected
through observational study.
The data collected through experimental studies are more reliable in comparison
to the data collected through observational studies. In experimental study one
could control the factors that may lead to the confounding of the effect that we want
to study. (We will discuss it in the class) But the former is often, expensive
and may not be feasible at all. However, by using data collected through carefully
designed experimental studies only one can talk about some sort of causality.
Using data collected through observational studies one cannot prove causality
unless it is supported by a sound theory. Through observational studies one can talk
only about association.
Consider the following example. Suppose we observe the incidence of lung
cancer among one million smokers is 20% while among one million nonsmokers it is only 4%. Am I in a position to conclude that smoking causes
lung cancer? Think about it. We will take it up during the class room
discussion.
3
decision, who may not have any clue), or refusal to answer (could be because of fear
or of intention not to divulge). A good survey should attempt to obtain some
information about the group of non-respondents in order to understand
how different or similar are they as a group from the group of respondents?
Besides these, the errors of observations (may be due to respondents reporting error, the respondent may not simply remember it correctly, the respondent may not
understand the question properly, like, asking the head of the household the number
literates in the household (the meaning of literate may not be clear to the respondent).
Besides the above, the errors could be due to inability of the interviewers to elicit
honest response, could be due bad design of the instrument viz. questionnaire, (it has
been observed that ordering and wording of questions, nature of the question (whether
the question is open ended or close ended) lead to lot of variation in responses), could
be due to coding errors etc. So any kind of error besides sampling error is known
as non-sampling error.
It is believed that for a moderately large sample survey the non-sampling
error constitutes around 70-80% of the total error. Finally, the nonsampling errors increases with the increase in sample size.
In the light of above discussion why do you think a sample survey could
be a better choice? We will discuss this issue in the class.
6. Drawing a random sample or Random sampling from a finite population:
Two kinds of random sampling are used for finite population. These are simple
random sampling with replacement (SRSWR)and simple random sampling
without replacement (SRSWOR). For all practical purposes SRSWOR is preferred to SRSWR but in some situations like random number generation we need
to use SRSWR. We will soon discuss it.
Lets discuss how to draw a random sample of size 10 from a class of 80 students.
Step 1. Assign serial numbers 01 to 80 to the students, like, roll numbers.
Step 2. Consider a random number generation mechanism that selects one of the
digits 0 to 9 with equal probability, i.e., 1/10. (What could be such a mechanism?
Think about it.)
Step 3. Select two digits using this mechanism, if it gives a two-digit number, say, 10,
select the student who is assigned the serial number 10. If it is either 00 or a number
between 81 to 99, reject the number. Again select a two-digit number until a number
between 01 to 80 is obtained.
Question: Prove that by this method the chance of selecting a student in
the first drawing is 1/80
Step 4. Repeat step 3 until you draw nine more two-digit numbers between 01-80.
The students with corresponding serial numbers are selected. Notice in this way you
5
may select a student more than once. So if you are to draw a simple random sample
with replacement, a student could be selected more than once. However, to draw a
simple random sample without replacement you need to repeat step 3 until you select
9 more distinct two-digit numbers.
Question: Prove that by the above described method the chance of selecting a sample of 10 students in case of SRSWR is (80)10 and in case of
SRWOR it is (80.79.78.....71)1 . Also prove that in the fourth drawing the
chance of selecting any one of the 80 students is 1/80 for SRSWOR. Is it
true for any of the ten drawings?
Without having an access to such a mechanism one can use a random number
table or an EXCEL function like randbetween(min, max) to generate random
numbers betweem min and max.
A random number table is a sequence of digits generated using a mechanism as
discussed above so that in the long run the table contains all the digits 0, 1, ..., 9 in
approximately equal proportions, with no trends in the pattern in which the digits
are generated. Thus if a digit is drawn at random from the random number table the
chance of getting any digit is 1/10.
7. Random sampling from an infinite population:
Truly speaking there is nothing like an infinite population but often the population
is either very large or hypothetical (i.e., non-existent) so that sampling frame is not
available. In such cases, the sampling units should be selected fulfilling the following
two conditions.
1. The sampling unit should be a member of the target population.
2. The sampling units should be selected independently of each other.
For example the problem is to draw a random sample of customers of Flipkart from
its customer base. Flipkarts customer base is not only very large but dynamic too.
For all practical purposes the population could be approximately considered as an
infinite population. If Flipkart selects a sample of customers by picking up a purchaser
every 5 seconds during the grand sale of 36 hours, the customers in the sample could
be dependent. Because there is a possibility that the customers could exhibit similar
buying behaviour. One should always be careful about avoiding dependence if domain
knowledge makes us feel so.
Question 1. Devise a method to select a random sample of customers of
Flipkart.
Question 2. Does it intuitively make sense to assume that sampling from
an infinite population is equivalent to sampling from a finite population
with replacement? Explain.
two two-rupee coins and three one-rupee coins; note that the urn contains six coins
altogether. The player draws three coins at random and without replacement. Let
X, Y and Z denote the values of these three coins in rupees.
Define M = median ( X, Y , Z), L = min (X, Y, Z), U = max (X, Y, Z), S = (L +
U)/2 Find (a) P(M = 1) , (b) P(M = 2), (c) P(S = 1), (d) P(S = 2), (e) P(S = 3),
(f) P(S < M) and (g) E(S). Ans: (a) 0.5, (b) 0.5, (c) 0.05, (d) 0, (e) 0.45, (f) 0.15,
(g) 2.25
2. An alchemist visited the court of a medieval warlord and said Your excellency,
here is my tribute to you. I have six envelopes. One of these contains a single copper
coin, another contains two copper coins, while a third one contains three copper
coins. The remaining three envelopes are empty. Kindly pick up any three of these
six envelopes at random and without replacement. I shall convert all the coins in the
selected envelopes to gold coins dating from the period of King Solomon you can
imagine their value as antiques ! But what happens if I end up picking only the three
empty envelopes?, thundered the warlord, I shall behead you then. Take it easy, your
excellency, calmly replied the alchemist I am also a sorcerer in that extreme case,
I shall make seven gold coins for you, again dating from King Solomons era, simply
from the air. Assume that all the claims of the alchemist were true and that he kept
all his promises (the latter point is natural given the threat about his head!). Let X
be the number of gold coins that the warlord eventually ended up with. Obtain (a)
P(X = 3), (b) P(X = 4), (c) P(X = 5), (d) P(X = 6), (e) P(X = 7), (f) E(X) and (g)
Var(X). Ans: (a) 0.3, (b) 0.15, (c) 0.15, (d) 0.05, (e) 0.05, (f) 3.35, (g) 2.6275
3. A textbook on business statistics contains five chapters. A student, who is not
very serious, takes a simple random sample (without replacement) of three chapters.
He studies these three chapters with some seriousness and completely ignores the
remaining two chapters.
In the final examination, the question paper on this subject consists of five questions,
one from each chapter. The questions from Chapters 1 and 2 are compulsory and
carry 18 and 12 marks respectively. The questions from the other three chapters
carry 20 marks each and each student is supposed to answer any one of these three
questions (even if a student answers more than one of these three questions, he/she
gets credit for only one of them). Thus the maximum possible score for any student
is 50.
Obviously, the student under consideration gets zero in any question from a chapter
that he had ignored (so he does his best to avoid such a question, if possible). Furthermore, as he is not very serious with his studies, he gets only 50% of the marks
in any question from a chapter that he had included for study. Let T be his score in
the examination.
Obtain the probability distribution of T and hence the expectation and variance of
T. Ans: The possible values of T are 10, 16, 19 and 25 with respective probabilities
0.1, 0.3, 0.3 and 0.3; E(T) = 19, V(T) = 21.6
11
For solving the above problems if the sample sizes are large
we will have to use large sample approximation to the distribu and sample proportion p as discussed in
tions of sample mean X
probability class.
can be approximated by a normal distribution with mean (popThe distribution of X
ulation mean) and standard deviation n where is the population standard deviation
12
and n is the sample size if the sample size is greater than equal to 30 (A thumb rule).
The distribution of p can be approximated
p by a normal
distribution with mean p (population proportion) and standard deviation p(1 p)/ n where n is the sample size and
if both np and n(1 p) are greater than equal to 5 or 10 (thumb rule depends upon the
text book that you are using).
Proportional Allocation: The sample sizes to different strata are proportional to the
strata sizes.
The sample sizes for strata 1 to 4 are:
n1 = W1 n,
n2 = W2 n,
n3 = W3 n,
n4 = W4 n.
where Wi = Ni /N, i = 1, 2, 3, 4.
13
Systematic Sampling
The idea of systematic sampling is simple. Suppose a sample of n shoppers need to be
selected visiting a mall. The sampling frame is not known. But one may decide to sample
every 10th individual entering the mall. The first individual needs to be selected at random
from the first 10 shoppers entering the mall.
In other words, a 1-in-k systematic sample is obtained by randomly selecting one element
from the first k (10 in the shopper example) elements in the frame and every k-th element
thereafter until n elements are selected.
A systematic sample is generally spread uniformly over the entire population whereas
random sample may not. If there are pockets of heterogeneity in the population it is better
to go for systematic sampling than random sampling. For example, in assessing the quality
of spices to be shipped for export systematic sample is usually taken from the container.
Auditors for checking vouchers often use systematic sampling. For example a 1-in-5
systematic sample of travel vouchers could be inspected to determine the proportion of
vouchers filled incorrectly.
Question: Do the following statements make intuitive sense? Explain.
In case the list of sampling units from which systematic sample is to be drawn is in
random order, systematic sampling and random sampling would be equivalent. In case
the values of the variable of interest in the list are either in increasing or in decreasing
order, systematic sampling would be better. In case, the values of the variable of interest
in the list show a periodic movement, the systematic sample could lead to under and over
estimate unless the sampling interval is properly chosen.
Cluster Sampling
A cluster sample is a probability sample in which the sampling units are collections of
units or clusters.
When sampling frame is not available cluster sampling could be useful. Also collection
of data through cluster sampling is convenient and economical too.
Suppose we need to select a sample of college students in Ahmedabad. Each college
could be considered as a cluster. So select a few colleges at random and get data from each
student in the selected colleges.
14
The other option could be to divide Ahmedabad into blocks and each block is considered
as a cluster. Select a number of blocks at random and then get data from each college
student in the selected blocks.
Which one is a better option? It depends on the following considerations: The
objective of any survey design is to obtain a specified amount of information about the
population characteristic of interest (like population total, population mean, population
proportion etc.)at minimum cost. The former would be more economical option than the
latter. But at the same time the measurements on the students in the same college may
be highly correlated and in such cases, the amount of information may not increase as new
measurements taken within a cluster. From this latter point of view the second option
could be better.
Notice that from the latter consideration the units within a cluster should be as heterogeneous as possible and the units between the clusters should be as homogeneous as
possible which is exactly the opposite of stratified random sampling.
So in cluster sampling to decide how many clusters and what sizes of the clusters are
to be chosen, the above considerations should be carefully weighed.
15