You are on page 1of 88

STATISTICAL METHODS IN

RESEARCH (CHE-503)

Dr. Khuram Maqsood


Assistant Professor
Department of Chemical Engineering
NFC IEFR, Faisalabad
Program Learning Outcomes

■ Critical understanding of relevant theories and technical practice in


core areas of Chemical Engineering
■ Advanced knowledge of specifics topic in Chemical Engineering
■ Ability to successfully apply advanced concepts of basic science and
engineering to identify, formulate, and solve complex problems in
Chemical Engineering
■ Ability to successfully apply research methodology and advanced
concept of Chemical Engineering to design and analysis of chemical
processes
■ Ability to communicate effectively and efficiently about their own
work to the general public as well as to the experts through well-
structured reports and oral presentations
Course Structure
■ Introduction to Chemometrics: (a) Measurements, (b)
Accuracy and precision, (c) Standard deviation, (d)
Confidence interval, (e) Gaussian distribution
■ Foundations of chemical data analysis: (a) Regression, (b)
Student’s t-test, (c) Paired t-test, (d) F-test, (e) ANOVA, (f)
Detection of outliers.
■ Design of experiments: (a) Randomization and interaction, (b)
Factorial design, (c) Fractional factorial design, (d) Response
surface modeling.
■ Multivariate data analysis: (a) Cluster analysis, (b) Principal
component analysis, (c) Linear discriminant analysis, (d)
Applications of multivariate statistical methods in Chemical
Engineering
Course Marks

The course marks will be allocated as follows:

• Quizzes & Weekly Assignments (30%)


• Midterm Exams (30%)
• Final Exam (40%)
Course References

1. Antony, J., Design of Experiments for Engineer and Scientists.


Butterworth-Heinemann, 2003.
2. Bartolucci, A. A.; Singh K. P.; Bae, S., Introduction to Statistical Analysis
of Laboratory Data. Jhon Wiley & Sons, 2016.
3. Brereton, R. G., Chemometrics: Data Analysis for the Laboratory and
Chemical Plant. John Wiley & Sons, 2003.
4. Deming, S. N.; Morgan, S. L., Experimental Design: A Chemometric
Approach. 2nd Edition; Elsevier, 1993.
5. Shardt, Y. A. W., Statistics for Chemical and Process Engineers: A
Modern Approach. Springer, 2015.
6. Varmuza, K.; Filzmoser, P., Introduction to Multivariate Statistical
Analysis in Chemometrics CRC Press, 2009.
MEASUREMENTS
Making Measurements
• Why do we measure?
Variable

• A feature or entity that can assume a


value (observation) from a set of
possible values (observations)
• Some examples:
–length of a tail
–number of seeds in a seed pod
–phosphate concentration of a water
sample
–color of a fish
–ranking of how well you feel
Types of Variables
• Quantitative variables
–Continuous (e.g., length, weight, time,
temperature)
–Discontinuous (e.g., number of fish in an area,
number of seeds in a seed pod; )
–Rank (e.g., one-to-five ranking for the quality of
instruction)
• Derived variables (e.g., density, velocity)
• Character variables (e.g., color, gender)
Systems of Measure

• Two systems in use


predominantly:
–English (America)
–Metric or SI (European)
Systems of Measure:
English (America)
• Disadvantages
–No standard base unit for each kind
of measurement
–Subunits within units not based upon
a consistent multiplication factor
–Difficult to make conversions
between units
• Advantages
–We already know it
Systems of Measure:
Metric or SI (European)
• Advantages
–Use a base unit for
each type of measure
–Subunits/superunits
of base unit based
upon multiples of ten
–Conversions are
much easier
Metric Prefixes

• Regardless of the unit, the entire metric


system uses the same prefixes
• Common prefixes are:
–kilo = 1000
–centi = 1/100th
–milli = 1/1,000th
–micro = 1/1,000,000th
Accuracy Versus Precision
• Accuracy
–How close a measured value
agrees with the true value
• Precision
–How closely repeated
measurements agree with each
other
• Good measuring devices are both
accurate and precise
Precise Accurate

Precise & Accurate


Why Is There Uncertainty?
■ Measurements are performed with instruments,
and no instrument can read to an infinite number
of decimal places
■ Which of the instruments below has the greatest
uncertainty in measurement?
SIGNIFICANT
FIGURES
In Measurements
Significant Figures

■ The significant figures in a measurement


include all of the digits that are known, plus
one last digit that is estimated.
■ The numbers reported in a measurement are
limited by the measuring tool.
Measurement and Significant Figures

■ Every experimental
measurement has a degree
of uncertainty.
■ The volume, V, at right is
certain in the 10’s place,
10mL<V<20mL
■ The 1’s digit is also certain,
17mL<V<18mL
■ A best guess is needed for
the tenths place.
■ To indicate the precision of a measurement,
the value recorded should use all the digits
known with certainty, plus one additional
estimated digit that usually is considered
uncertain by plus or minus 1.
■ No further insignificant digits should be
recorded.
■ The total number of digits used to express such
a measurement is called the number of
significant figures.
■ All but one of the significant figures are known
with certainty. The last significant figure is only
the best possible estimate.
Below are two measurements of the
mass of the same object. The same
quantity is being described at two
different levels of precision or
certainty.
Reading a Meterstick

. l2. . . . I . . . . I3 . . . .I . . . . I4. . cm

First digit (known) = 2 2.?? cm


Second digit (known) = 0.7 2.7? cm
Third digit (estimated) between 0.05- 0.08 cm
Length reported = 2.77 cm
or 2.76 cm
or 2.78 cm
Known + Estimated Digits

In 2.77 cm…
• Known digits 2 and 7 are 100% certain

• The third digit 7 is estimated (uncertain)

• In the reported length, all three digits


(2.77 cm) are significant including the
estimated one
Zero as a Measured Number

. l3. . . . I . . . . I 4 . . . . I . . . . I 5. . cm
What is the length of the line?
First digit 5.?? cm
Second digit 5.0? cm
Last (estimated) digit is 5.00 cm
HOW TO DETERMINE
SIGNIFICANT
FIGURES IN A
PROBLEM
 Use the following rules:
Exercise 1

How many significant figures does


each quantity have?
1) 107cm Ans.: ....... 3
2) 10,700cm Ans.: ....... 3
3) 1,270cm Ans.: ....... 3
4) 12,703cm Ans.: ....... 5
5) 1,060,809cm Ans.: ....... 7
6) 1,040,700cm Ans.: ....... 5
7) 10,407,005cm Ans.: ....... 8
8) 100,000,002cm Ans.: ...... 9
9) 100,000,000cm Ans.: ...... 1
Exercise 2
1) 0.13070kg Ans.: ....... 5
2) 1.07000cm Ans.: ....... 6
3) 0.0007cm Ans.: ....... 1
4) 22.0000cm Ans.: ....... 6
5) 0.000009cm Ans.: ....... 1
6) 1.0400700cm Ans.: ....... 8
7) 10.407005cm Ans.: ....... 8
8) 100.000,0020cm Ans.: ...... 10
9) 100,000,000.0cm Ans.: ...... 10
10) 100,000,000,000cm Ans.: ...... 1
Rule #1

■ Every nonzero digit is significant

Examples:
24 = 2
3.56 = 3
7 =1
Rule #2 – Sandwiched 0’s

■ Zeros between non-zeros are significant

Examples:
7003 = 4
40.9 = 3
Rule #3 – Leading 0’s

■ Zeros appearing in front of non-zero digits


are not significant
■ Act as placeholders
■ Can’t be dropped, show magnitude

Examples:
0.00024 = 2
0.453 = 3
Rule #4 – Trailing 0’s with DP
■ Zeros at the end of a number and to the right of a
decimal point are significant.

Examples:
43.00 = 4
1.010 = 4
1.50 = 3
Rule #5 – Trailing 0’s without
DP
■ Zeros at the end of a number and to the left of a
decimal point aren’t significant

Examples:
300 = 1
27,300 = 3
Easier Way to do Sig Figs!!
Examples:
123.003 grams
decimal present, start on “P” side, draw
arrow, count digits without an arrow
through it.
Answer = 6

10,100 centimeters
Decimal absent, start on “A” side, draw an
arrow, count digits without an arrow
through it.
Answer = 3
DATA ANALYSIS
Data Analysis

■ Steps involved in Data Analysis

– Analyze Data
– Communicate Finding
– Use Finding for Improvement of Existing
Systems/Processes
– Use Findings to Suggest New Processes
Data Analysis

■ Think about analysis EARLY


■ Start with a plan
■ Code, enter, clean
■ Analyze
■ Interpret
■ Reflect
– What did we learn?
– What conclusions can we draw?
– What are our recommendations?
– What are the limitations of our analysis?
Data Analysis

■ To make sure the questions and your data collection


instrument will get the information you want.
■ To align your desired “report” with the results of analysis and
interpretation.
■ To improve reliability--consistent measures over time
Data Analysis

■ Purpose of the evaluation


■ Questions/Parameter
■ What you hope to learn from the question/Parameter
■ Analysis technique
■ How data will be presented
Data Analysis

■ There are two types of data


1. Qualitative
2. Quantitative
Quantitative Data

■ Data that is numerical, counted, or compared on a scale


■ Demographic data
■ Answers to closed-ended survey items
■ Attendance data
■ Scores on standardized instruments
Quantitative Data

■ Several approaches
■ Paper and pencil tally
■ Word processing table
■ Spreadsheet
■ Custom database
Qualitative Data

■ Narratives, logs, experience


■ Focus groups
■ Interviews
■ Open-ended survey items
■ Diaries and journals
■ Notes from observations
Qualitative Data

• Textual data
• Interview transcripts
• Case notes/ clinical notes
• Open-ended survey questions
• Photographs
• Video recordings
Summarizing Data

■ Tables
– Simplest way to summarize data
– Data are presented as absolute numbers or percentages
■ Charts and graphs
– Visual representation of data
– Data are presented as absolute numbers or percentages
Summarizing Data

■ Ensure graphic has a title


■ Label the components of your graphic
■ Indicate source of data with date
■ Provide number of observations (n=xx) as a reference point
■ Add footnote if more information is needed
DATA ANALYSIS
Statistical
Data Analysis

■ Mean or Average
It is a single value which is intended to represent a set of data
or a distribution as a whole. It is more or less central value
round which the observations in the set of data or distribution
usually tend to cluster. Such a central value is also called a
measure of central tendency.
Data Analysis

■ Mean

where “n” is the number of measures in the series and “X” stands for a
score or other measure.

Example:
Find mean for 7, 11, 6, 10, 13, and 20.
Solution:
Mean =
= 11.17
Data Analysis

■ Median
It is the numerical value separating the higher half of a data
sample, a population, or a probability distribution, from the
lower half. The median of a finite list of numbers can be
found by arranging all the observations from lowest value to
highest value and picking the middle one (e.g., the median of
{3, 3, 5, 9, 11} is 5). If there is an even number of
observations, then there is no single middle value; the
median is then usually defined to be the mean of the two
middle values(the median of {3, 5, 7, 9} is (5 + 7) / 2 = 6)
Data Analysis

■ Variance
It measures how far a set of numbers is spread out. A
variance of zero indicates that all the values are identical.
Variance is always non-negative: a small variance indicates
that the data tend to be very close to the mean(expected
value) and hence to each other, while a high variance
indicates that the data are very spread out around the mean
and from each other.
Data Analysis

■ Deviation
Deviate: To differ from a standard, mean value.
For example we have a set of data
5, 6, 9, 13, 25, 26
The mean is = 5+6+9+13+25+26/6 = 14
There can be two types of deviations; positive and negative. The
numbers 25, 26 have shown +ve deviation, whereas numbers
5, 6, 9, 13 has shown –ve deviation.
Data Analysis

■ Standard Deviation
The square root of the variance of a number of observations.
Why it is called as standard deviation ?
Answer: The deviation which could predict the highest and the lowest score
of the distribution is termed as a standard deviation.
The formulas pertaining to the prediction of max. and min. score are as
under;

Max. Score = M + 3 S.D.


Min. Score = M – 3 S.D.
Data Analysis
■ Estimate the Mean, Variance, and Standard Deviation for the
given data:
600, 470, 170, 430, and 300.
Sol:
Mean =

= 394

Variance =
= 21,704
S.D = = 147.
CONFIDENCE
INTERVALS
The situation

■ Want to estimate the actual population mean .


■ But can only get the sample mean.
■ Find a range of values, L <  < U, that we can be really
confident contains .
■ This range of values is called a “confidence interval.”
Confidence Intervals
for Proportions in Newspapers
■ 18% of women, aged 18-24, think they are overweight.
■ The “margin of error” is 5%.
■ The “confidence interval” is 18% ± 5%.
■ We can be really confident that between 13% and 23% of
women, aged 18-24, think they are overweight.
General Form of
most Confidence Intervals
■ Sample estimate ± margin of error
■ Lower limit L = estimate - margin of error
■ Upper limit U = estimate + margin of error
■ Then, we’re confident that the population value is somewhere
between L and U.
Level of Confidence

The level of confidence in a confidence


interval is a probability that represents
the percentage of intervals that will
contain if a large number of repeated
samples are obtained.
For example, a 95% level of confidence
would mean that if 100 confidence intervals
were constructed, each based on a different
sample from the same population, we
would expect 95 of the intervals to contain
the population mean.
Step 1

■ Write down the phenomenon you'd like to test. Let's say you're
working with the following situation: The average weight of a
male student in ABC University is 180 lbs. You'll be testing
how accurately you will be able to predict the weight of male
students in ABC university within a given confidence interval.
Step 2

■ Select a sample from your chosen population. This is what


you will use to gather data for testing your hypothesis. Let's
say you've randomly selected 1,000 male students.
Step 3

■ Calculate your sample mean


and sample standard deviation
■ Sample mean is 180 lbs
■ Let's say the standard
deviation here is 30 lbs.
Step 4

■ Choose your desired confidence level. The most commonly


used confidence levels are 90 percent, 95 percent and 99
percent. This may also be provided for you in the course of a
problem. Let's say you've chosen 95%.
Step 5
Calculate your margin of error
■ Z/2 = the confidence coefficient
■ a = confidence level
■ σ = standard deviation
■ n = sample size.
The confidence level is 95%. Convert
the percentage to a decimal, .95, and
divide it by 2 to get .475. Then, check
out the z table to find the
corresponding value that goes with
.475.

The level of confidence determines the z critical value.


99% = 2.58
95% = 1.96
90% = 1.645
Calculations of Margin of
error

30 30
1.96   1.96   1.96  0.95
1000 31.63
 1.86
Step 6
State your confidence interval
■ The answer is: 180 ± 1.86
■ The upper and lower
bounds of the confidence
interval can be calculated
by adding and subtracting
the margin of error from the
mean. So, your lower bound
is 180 - 1.86, or 178.14,
and your upper bound is
180 + 1.86, or 181.86
Question 1

A sample of size n = 100


produced the sample mean of
X= 16. Assuming the population
standard deviation, = 3,
compute a 95% confidence
interval for the population mean
Sol
Question 2

Assuming the population


standard deviation,  = 3, how
large should a sample be to
estimate the population mean
with a margin of error not
exceeding 0.5?
Sol
GAUSSIAN OR
NORMAL
DISTRIBUTION
Types of Distribution
■ Frequency Distribution
■ Normal (Gaussian) Distribution
■ Probability Distribution
■ Poisson Distribution
■ Binomial Distribution
■ Sampling Distribution
■ t distribution
■ F distribution
What is Normal (Gaussian) Distribution?

■ The normal distribution is a descriptive model


that describes real world situations.

■ It is defined as a continuous frequency


distribution of infinite range (can take any values
not just integers as in the case of binomial and
Poisson distribution).

■ This is the most important probability distribution


in statistics and important tool in analysis of
epidemiological data and sciences.
Characteristics of Normal Distribution

■ It links frequency distribution to probability distribution

■ Has a Bell Shape Curve and is Symmetric

■ It is Symmetric around the mean:


Two halves of the curve are the same (mirror images)
Characteristics of Normal Distribution Cont’d

■ Hence Mean = Median

■ The total area under the curve is 1 (or 100%)

■ Normal Distribution has the same shape as


Standard Normal Distribution.
Characteristics of Normal Distribution Cont’d

■ In a Standard Normal Distribution:

The mean (μ ) = 0 and

Standard deviation (σ) =1


Z Score (Standard Score)3

■ Z = X -μ
σ
■ Z indicates how many standard deviations away from the
mean the point x lies.

■ Z score is calculated to 2 decimal places.


The Normal Distribution

f(X) Changing μ shifts the


distribution left or right.

Changing σ increases or
decreases the spread.
σ

μ X
The Normal Distribution:
as mathematical function (pdf)

1 x 2
1  ( )
f ( x)  e 2 
 2
This is a bell shaped curve
Note constants: with different centers and
=3.14159 spreads depending on 
e=2.71828 and 
Diagram of Normal Distribution Curve
(z distribution)
33.35%
13.6%

2.2%

0.15

-3 -2 -1 μ 1 2 3
Distinguishing Features
■ The mean ± 1 standard deviation covers 66.7% of the area
under the curve

■ The mean ± 2 standard deviation covers 95% of the area


under the curve

■ The mean ± 3 standard deviation covers 99.7% of the area


under the curve
68-95-99.7 Rule

68% of
the data

95% of the data

99.7% of the data


68-95-99.7 Rule
in Math terms…
  1 x 2
1  ( )

  
 2
• e 2  dx  .68

  2 1 x 2
1  ( )

  
2 2
• e 2  dx  .95

  3 1 x 2
1  ( )

  
3 2
•e 2  dx  .997
How good is rule for real data?

■ Check some example data:


■ The mean of the weight of the women = 127.8
■ The standard deviation (SD) = 15.5
68% of 120 = .68x120 = ~ 82 runners
In fact, 79 runners fall within 1-SD (15.5 lbs) of the mean.

112.3 127.8 143.3

25

20

P
e 15
r
c
e
n 10
t

0
80 90 100 110 120 130 140 150 160
POUNDS
95% of 120 = .95 x 120 = ~ 114 runners
In fact, 115 runners fall within 2-SD’s of the mean.

96.8 127.8 158.8

25

20

P
e 15
r
c
e
n 10
t

0
80 90 100 110 120 130 140 150 160
POUNDS
99.7% of 120 = .997 x 120 = 119.6 runners
In fact, all 120 runners fall within 3-SD’s of the mean.

81.3 127.8 174.3

25

20

P
e 15
r
c
e
n 10
t

0
80 90 100 110 120 130 140 150 160
POUNDS
Application/Uses of Normal Distribution

■ It’s application goes beyond describing distributions

■ It is used by researchers and modelers.

■ The major use of normal distribution is the role it plays in


statistical inference.

■ The z score along with the t –score, chi-square and F-statistics is


important in hypothesis testing.

■ It helps managers/management make decisions.

You might also like