You are on page 1of 31

Week 1: Introduction to

Statistics & Data Analysis


Arfika Nurhudatiana

Outline
Concepts:
Descriptive statistics vs. Inferential
statistics
Sample vs. Population
Sampling Procedure
Qualitative vs. Quantitative variables
Discrete vs. Continuous variables
Sample Mean, Median, Range, Variance,
Standard Deviation

A Glimpse of Motivation

What is Statistics?
Branch of mathematics, dealing with the
collection and analysis of data, leading to
statistical inference
3 keywords: mathematics, data analysis,
statistical inference
Statistical inference: deducing a general
conclusion based on collected samples.
Accepting/rejecting a hypothesis
Deriving estimates

Example: manufacturing industry


In manufacturing industry,
it is common to have the
following roles:
Process engineers: monitor
the different processes
involved, 1 engineer 1 process

Product engineers: monitor


the output, 1 engineer 1
product

Quality assurance
engineer: perform statistical
investigation, e.g., using
ANOVA

The result of the investigation allows the company to determine


necessary modifications in order to keep the process at a desired level of
quality.

Variations in Data
Two sources of variations:
Variation over time/space: between the value observed at one point
of time with another point of time

Variation in measurement: between the value observed and the true


value

Ideally,
If the observed values in a process were always the same and were
always on target, there would be no need for statistical method.
If in one batch of thermometers produced, the thermometers (used on
the same person at the same time with the same environment condition)
always gave the same value and the value was accurate (correct),
no statistical analysis to evaluate the products is needed.

However,
Our data tend to have variations and thus we need to use statistical
methods to guarantee(estimate as close as possible) the actual
value(s) of our data.

Descriptive vs. Inferential Statistics


Descriptive statistics: statistics which help
describe or characterize the nature of the dataset.
They provide simple summaries about the sample
and the measures.

Example: a student has completed 100 SCUs.


Measures of central tendency: GPA (average score) gives a
hint of the students overall performance.
Measures of spread: we may also be interested in how many
As, Bs, and Cs he/she has and from which semester he/she
got the scores.

Inferential statistics: statistics used to reach


conclusions that extend beyond the immediate data
alone.

Example: is his/her GPA within the range of graduates who


usually get jobs immediately after graduation?

Descriptive vs. Inferential Statistics

Sample vs. Population


Population:
collections of
all individual
items of a
particular type.

Samples:
collection of
observations
taken from a
Example:
population.
During the campaign period, campaign managers conducted a survey
to understand the conditions of the voters. The population was
Indonesian citizens who have the right to vote in the presidential
election 2014. The samples were certain numbers of Indonesian
citizens located in various regions in Indonesia with different ages,
genders, and occupations.

Population and samples can also be students in Binus International,


trees in the forest, fish in the sea, manufacturing products, etc.

Example Scenario:
Consider a market researcher for a soft drink company who
might want to determine the sweetness preferences of
Americans between the ages of 15 and 25.
Obviously, gathering data from every individual in this
population would be nearly impossible and prohibitively
expensive.
It would be more practical to collect data from a subset,
or sample, of the population.
If the sample is unbiased, the sample data can be used
to make inferences about the population.

Sampling Procedure
Population: every American in age group 15 to
25.
Sample
:In? order for a sample to be unbiased, it must be:
1) representative of the population

fulfil the above criteria: American with age between 15


and 25

2) randomly selected
everybody in the population has equal chance to be
selected as sample
If the sample is only from 1 city or 1 school, the
conclusion is only applicable to that city/school and a
narrower age group

3) sufficiently large
If the research only involved 3 respondents, would you
trust the result? Why?
small sample is sensitive to bias (1 wrong answer
largely affects the final result)

Various Sampling Procedures

Sampling Procedure
Random Selection:
1) Simple Random Sampling
any particular sample has the same chance of being
selected as any other sample.

2) Stratified Random Sampling


Used when the sampling units are not homogeneous and
are naturally in non-overlapping groups/segments which
are homogeneous
These groups are called strata
Stratified random sampling means random selection
of a sample within each stratum (singular for strata).
The purpose is to be sure that each of the strata is not
underrepresented (or overrepresented).

3) Cluster Random Sampling


Random sampling within clusters (subgroups)

Sampling Procedure
Example: taking 50 samples from Binus
International population
1) Simple Random Sampling
Randomly take any 50 respondents

2) Stratified Random Sampling


Binus International population is composed of 40% males
and 60% females
Randomly take 20 male respondents & 30 female
respondents to be sure that each of the strata is not
underrepresented (or overrepresented).

3) Cluster Random Sampling


There are 4 major faculties: computing, communication &
film, business, HTM
Take 11-13 respondents from each faculty

Sampling Procedure
Non-Random Selection:
1) Convenience Sampling / Accidental Sampling
the units that are selected for inclusion in the sample are
the easiest to access
Example: the first 50 respondents (but be careful, most of
them may come from IS program)

2) Systematic Sampling
the researcher first randomly picks the first item or
subject from the population. Then, the researcher will
select each nth subject from the list.

3) Purposive Sampling / Judgmental Sampling


Usually involves small sample size
Usually in the form of qualitative/investigative research
to focus on particular characteristics of a population that
are of interest

Qualitative vs. Quantitative Variables


Quantitative variables: measures of values or counts and are
expressed as numbers (can be discrete, can be continuous).
Example: how many children do you have? How often do you go
shopping?
Qualitative (categorical) variables: measures of 'types' and may
be represented by a name, symbol, or a number code
Example: which major do you study? What is your occupation?
Qualitative variables can be nominal (no order/ranking sequence)
or ordinal (has order, e.g., like, neutral, dislike)

Discrete vs. Continuous Variables


Discrete variablesare countable in a finite amount of time.
Example: the number of students in a classroom. (bilangan bulat) (int)
Continuous variables are usually obtained by measuring.
Example: length, weight, and time. Since continuous variables are
real numbers, we usually round them.(double)

Measures of Location: The


Sample Mean and Median
Measures of location are designed to provide the analyst with some
quantitative
values of where the centre, or some other location, of data is
located.

1) Mean: average value

Measures of Location: The


Sample Mean and Median
2) Median
The purpose of the sample median is to reflect the central tendency of the
sample in such a way that it is uninfluenced by extreme values or outliers.

Median and mean can be quite different.

Example
Two samples of 10 northern red oak seedlings were planted in a
greenhouse, one containing seedlings treated with nitrogen
and the other containing seedlings with no nitrogen. All other
environmental conditions were held constant. All seedlings
contained the fungus Pisolithus tinctorus.
The stem weights in grams were recorded after 140 days.
x (nitrogen) ?
x (no nitrogen) ?
x (nitrogen) ?
x (no nitrogen) ?

Mean
nitrog
en =
0.565
Media
n=
0,635

Which one has healthier stem (higher stem weights

Other measures of locations:


A trimmed mean is computed by trimming away a certain percent of
both the largest and the smallest set of values.
For example, the 10% trimmed mean is found by eliminating the largest
10% and smallest 10% and computing the average of the remaining
values.

Before trimming:

What do you observed?


- Mean slightly changes,
median does not change.
- Mean gives more detailed
information (more sensitive
to variations).

Sample mean vs. Population mean


What we just calculated is sample / population mean?
Sample mean gives incomplete information (it is true for the
collected sample only, but not for the real population).
However, by collecting the right samples, we expect the sample
mean to be as near as possible to the population mean.
Therefore, in the future chapters, the sample mean is calculated
as an estimated of the population mean.

Measures of Variability: The Sample Range,


Standard Deviation, and Variance
Data Set 1: 3, 5, 7, 10, 10
Data Set 2: 7, 7, 7, 7, 7
What is the mean and median of the above data
set?

But we know that the two data sets are not identical!

Measures of Variability: The Sample Range,


Standard Deviation, and Variance
How data points differ from the mean can be measured using
variance, standard deviation, and range.
The variance, standard deviation, and range are basically
measures of spread.
Sample

s
2

x X
n 1

range = -

For variance

n-1 degrees of freedom

x X

n 1

For standard deviation

The average of the squared


Q: Why squared? Why n-1?
deviations about the mean is called

If not squared, the numerator always equals 0, because


the negative deviations about the mean always cancel out
the positive deviations about the mean.

s
2

x X
n 1

For variance
n-1 degrees of freedom

x X

n 1

For standard
deviation

Why n-1?

Because the last value of is determined by the initial n 1 of them.


If n is very large, n-1 becomes unsignificant.

No Nitrogens standard deviation = 0.0728 gram


Nitrogens standard deviation = 0.1867 gram
Conclusion: The group with Nitrogen has a larger variance and
the group without nitrogen tends to be more consistent.

Summary
Concepts:
Descriptive statistics vs. Inferential
statistics
Sample vs. Population
Sampling Procedure: random & nonrandom
Qualitative vs. Quantitative variables
Discrete vs. Continuous variables
Sample Mean, Median, Range, Variance,
Standard Deviation

Exercise

Sample size: 15
Mean: 3.78
Median: 3.6
Trimmed: 2.5, 2.8, 2.8 and 5.6, 5.2,
4.8
Trimmed mean: 3.68
Variance: 0.943
Std: 0.97

You might also like