You are on page 1of 59

Data Description

MTK3006
Department of Mathematics Faculty of Science and Technology Universiti Malaysia Terengganu chee@umt.edu.my

MTK3006 Statistics for Chemists

Data Description

Part I

MTK3006 Statistics for Chemists

Data Description

Basic Terms

Population A population is a collection of all subjects or objects of interest.

MTK3006 Statistics for Chemists

Data Description

Basic Terms

Population A population is a collection of all subjects or objects of interest. Sample A sample is a portion or part of the population of interest.

MTK3006 Statistics for Chemists

Data Description

Basic Terms
Variable A variable is a characteristic or attribute that can assume dierent values.

MTK3006 Statistics for Chemists

Data Description

Basic Terms
Variable A variable is a characteristic or attribute that can assume dierent values. Data The values that a variable can assume are called data.

MTK3006 Statistics for Chemists

Data Description

Basic Terms
Variable A variable is a characteristic or attribute that can assume dierent values. Data The values that a variable can assume are called data. Data set A collection of data values or measurements forms a data set.

MTK3006 Statistics for Chemists

Data Description

Basic Terms
Variable A variable is a characteristic or attribute that can assume dierent values. Data The values that a variable can assume are called data. Data set A collection of data values or measurements forms a data set. Types of data Quantitative data is a numerical measurement expressed in terms of numbers. Qualitative data is a categorical measurement expressed by means of a natural language description.
MTK3006 Statistics for Chemists Data Description

Basic Terms

Parameter A parameter is a characteristic or measure obtained by using all the data values from a population.

MTK3006 Statistics for Chemists

Data Description

Basic Terms

Parameter A parameter is a characteristic or measure obtained by using all the data values from a population. Statistic A statistic is a characteristic or measure obtained by using the data values from a sample.

MTK3006 Statistics for Chemists

Data Description

Basic Terms

Statistics Statistics is the science of collecting, organizing, summarizing, analyzing and interpreting data.

MTK3006 Statistics for Chemists

Data Description

Basic Terms

Statistics Statistics is the science of collecting, organizing, summarizing, analyzing and interpreting data. Areas of statistics The branch of statistics devoted to the organization, summarization, description and presentation of data sets is called descriptive statistics. The branch of statistics concerned with using sample data to draw conclusions about a population is called inferential statistics.

MTK3006 Statistics for Chemists

Data Description

Part II

MTK3006 Statistics for Chemists

Data Description

Describing Data with Tables

Data collected in original form is called raw data. A frequency distribution is the organization of raw data in table form using classes and frequencies. There are three types of frequency distributions:
Categorical frequency distributions Ungrouped frequency distributions Grouped frequency distributions

MTK3006 Statistics for Chemists

Data Description

Categorical Frequency Distribution

Can be used for data that can be placed in specic categories. Examples political aliation, religious aliation, blood type, etc. Example Blood Type Data A,B,B,AB,O,O,O,B,AB,B,B,B,O, A,O,A,O,O,O,AB,AB,A,O,B,A

Blood Type Frequency Distribution

Class A B O AB

Frequency 5 7 9 4

Percent 20 28 36 16

MTK3006 Statistics for Chemists

Data Description

Ungrouped Frequency Distribution

Can be used for data that can be enumerated and when the range of values in the data set is not large. Examples number of kilometers your instructors have to travel from home to campus, number of girls in 4-child family, etc. Example Number of Kilometers Travelled: 8, 5, 6, 5, 5, 7, 7

Number of Kilometers Travelled

Class 5 6 7 8

Frequency 3 1 2 1

MTK3006 Statistics for Chemists

Data Description

Grouped Frequency Distribution

Can be used when the range of values in the data set is very large. Class limits represent the smallest and largest data values that can be included in a class. The smallest and largest possible data values in a class are the lower and upper class limits. Class boundaries separate the classes. To nd a class boundary, average the upper class limit of one class and the lower class limit of the next class. The class width is found by subtracting the lower (or upper) class limit of one class from the lower (or upper) class limit of the previous class. The class midpoint can be calculated by averaging the upper and lower class limits.

MTK3006 Statistics for Chemists

Data Description

Grouped Frequency Distribution

Rules for classes There should be 5-20 classes. The class width should be an odd number. The classes must not overlap. The classes must not have breaks. The classes must include all the data values. The classes must be equal in width.

MTK3006 Statistics for Chemists

Data Description

Grouped Frequency Distribution


To construct a grouped frequency distribution: Find the highest and lowest values. Find the range. Choose the number of classes. Find the class width by dividing the range by the number of classes and rounding up. Choose a starting point (usually the lowest value); add the class width to get all the lower limits. Find the upper class limits. Find the class boundaries. Find the frequencies and the cumulative frequencies.

MTK3006 Statistics for Chemists

Data Description

Grouped Frequency Distribution

Construct a grouped frequency distribution using 7 112 100 127 120 134 118 105 110 109 118 117 116 118 122 114 114 105 109 114 115 118 117 118 122 106 110 116 121 113 120 119 111 104 111 120 113 105 110 118 112 114 114

classes. 112 110 107 112 108 110 120 117

MTK3006 Statistics for Chemists

Data Description

Grouped Frequency Distribution

Construct a grouped frequency distribution using 7 112 100 127 120 134 118 105 110 109 118 117 116 118 122 114 114 105 109 114 115 118 117 118 122 106 110 116 121 113 120 119 111 104 111 120 113 105 110 118 112 114 114
Class Limits 100 - 104 Class Boundaries 99.5 - 104.5 Frequency 2

classes. 112 110 107 112 108 110 120 117

Cumulative Frequency 2

MTK3006 Statistics for Chemists

Data Description

Grouped Frequency Distribution

Class Limits 100 - 104 105 - 109 110 - 114 115 - 119 120 - 124 125 - 129 130 - 134

Class Boundaries 99.5 - 104.5 104.5 - 109.5 109.5 - 114.5 114.5 - 119.5 119.5 - 124.5 124.5 - 129.5 129.5 - 134.5

Frequency 2 8 18 13 7 1 1

Cumulative Frequency 2 10 28 41 48 49 50

MTK3006 Statistics for Chemists

Data Description

Part III

MTK3006 Statistics for Chemists

Data Description

Measures of Central Tendency

Given a set of data, we often would like to have one number that is representative of a population or sample. There are several standard ways to measure the center.
Meanthe average of the data set Medianthe midpoint of the data set Modethe value that occurs most often in the data set

MTK3006 Statistics for Chemists

Data Description

The Mean
Denote by xi the i th observed data value in the population or sample. Denote by N and n the population and sample sizes respectively. The population mean is the sum of all the population values divided by the total number of population values: = 1 N
N

xi .
i =1

The sample mean is the sum of all the sample values divided by the number of sample values: x= 1 n
n

xi .
i =1

Find the sample mean of 20, 26, 40, 36, 23, 42, 35, 24, 30.

MTK3006 Statistics for Chemists

Data Description

The Mean
Denote by xi the i th observed data value in the population or sample. Denote by N and n the population and sample sizes respectively. The population mean is the sum of all the population values divided by the total number of population values: = 1 N
N

xi .
i =1

The sample mean is the sum of all the sample values divided by the number of sample values: x= 1 n
n

xi .
i =1

Find the sample mean of 20, 26, 40, 36, 23, 42, 35, 24, 30. Answer: x = 30.67
MTK3006 Statistics for Chemists Data Description

The Median

The median is the middle value, or the average of the middle two values, of a population or sample, when the data values are arranged from smallest to largest. The median will be one of the data values if there is an odd number of values. The median will be the average of two data values if there is an even number of values. Find the median of 684, 764, 656, 702, 856, 1133, 1132, 1303.

MTK3006 Statistics for Chemists

Data Description

The Median

The median is the middle value, or the average of the middle two values, of a population or sample, when the data values are arranged from smallest to largest. The median will be one of the data values if there is an odd number of values. The median will be the average of two data values if there is an even number of values. Find the median of 684, 764, 656, 702, 856, 1133, 1132, 1303. Answer: Median = 810

MTK3006 Statistics for Chemists

Data Description

The Mode

The mode is the value in the population or sample that occurs most frequently. It is sometimes said to be the most typical case. There may be no mode, one mode (unimodal), two modes (bimodal), or many modes (multimodal). Find the mode of 18.0, 14.0, 34.5, 10, 11.3, 10, 12.4, 10. Find the mode of 104, 104, 104, 104, 104, 107, 109, 109, 109, 110, 109, 111, 112, 111, 109.

MTK3006 Statistics for Chemists

Data Description

Properties of the Mean

Uses all data values. Sample mean varies less than the sample median or mode. Used in computing other statistics, such as the variance. Unique, usually not one of the data values. Aected by extremely high or low values, called outliers.

MTK3006 Statistics for Chemists

Data Description

Properties of the Median

Gives the midpoint. Used when it is necessary to nd out whether the data values fall into the upper half or lower half of the data set. Aected less than the mean by extremely high or extremely low values.

MTK3006 Statistics for Chemists

Data Description

Properties of the Mode

Used when the most typical case is desired. Easiest to compute. Not always unique or may not exist.

MTK3006 Statistics for Chemists

Data Description

Measures of Dispersion

Dispersion refers to the spread or variability in a data set. Measures of dispersion include range, variance, standard deviation, etc.

MTK3006 Statistics for Chemists

Data Description

The Range
The range is the dierence between the highest and lowest values of a population or sample. Two experimental brands of outdoor paint are tested to see how long each will last before fading. Six cans of each brand constitute a small population. The results (in months) are: Brand A Brand B 10 35 60 45 50 30 30 35 40 40 20 25 The population mean for both brands is the same. Which brand would you buy?
MTK3006 Statistics for Chemists Data Description

The Variance
The variance is the average of the squares of the distance each value is from the mean. The population variance is 2 = The sample variance is 1 s = n1
2 n

1 N

(xi )2 .
i =1

(xi x )2 .
i =1

This formula for s 2 makes a better estimator of 2 than if we had divided by n.


MTK3006 Statistics for Chemists Data Description

The Standard Deviation

The standard deviation is the square root of the variance. The population standard deviation is . The sample standard deviation is s . The standard deviation is measured in the same unit as the measurements in the population or sample. A large standard deviation indicates that the data values are far from the mean, whereas a small standard deviation indicates that they are clustered closely around the mean.

MTK3006 Statistics for Chemists

Data Description

Alternate Formula for the Sample Standard Deviation

s=

1 n1

xi2
i =1

1 n

xi
i =1

Saves time when calculating by hand. Does not use the sample mean. Find the sample standard deviation of 11.2, 11.9, 12.0, 12.8, 13.4, 14.3.

MTK3006 Statistics for Chemists

Data Description

Alternate Formula for the Sample Standard Deviation

s=

1 n1

xi2
i =1

1 n

xi
i =1

Saves time when calculating by hand. Does not use the sample mean. Find the sample standard deviation of 11.2, 11.9, 12.0, 12.8, 13.4, 14.3. Answer: s = 1.13

MTK3006 Statistics for Chemists

Data Description

Measures of Position

Measures of position or location are used to locate the relative position of a data value in the data set. These measures include:
z -score quartiles outlier

MTK3006 Statistics for Chemists

Data Description

The z -score

A z -score or standard score for a value is obtained by subtracting the mean from the value and dividing the result by the standard deviation. The formula for the population (or sample) z -score is z= x or = x x s .

A z -score represents the number of standard deviations a value is above or below the mean.

MTK3006 Statistics for Chemists

Data Description

The Quartiles

Quartiles separate the data set into 4 equal groups. The rst quartile (Q 1) is the value that lies 25% of the way up from the smallest value. The second quartile (Q 2) is the value that lies 50% of the way up from the smallest value, and is equivalent to the median. The third quartile (Q 3) is the value that lies 75% of the way up from the smallest value. The interquartile range (IQR ) is the dierence between the upper and lower quartiles, i.e., IQR = Q 3 Q 1.

MTK3006 Statistics for Chemists

Data Description

The Outlier

An outlier is an extremely high or low data value when compared with the rest of the data values. A data value less than Q 1 1.5 IQR or greater than Q 3 + 1.5 IQR can be considered an outlier.

MTK3006 Statistics for Chemists

Data Description

Part IV

MTK3006 Statistics for Chemists

Data Description

Describing Data with Graphs

Graphs used for qualitative data


Bar charts Pareto charts

Graphs used for quantitative data


Histograms Frequency polygons Stem and leaf plots Box plots Time series plots

MTK3006 Statistics for Chemists

Data Description

Bar Chart

How people get to work

The bars can be plotted vertically or horizontally. Example Modes of Transportation to Work
The vertical scale shows frequencies. The horizontal scale shows categories.

People

10

15

20

25

A bar chart is a chart with rectangular bars.

30

Car

Bus

Train

Walk

MTK3006 Statistics for Chemists

Data Description

Pareto Chart

How people get to work


30

A Pareto chart can be used to represent a categorical frequency distribution. It is a bar chart arranged in descending order of height from left to right.

People

10

15

20

25

Car

Train

Bus

Walk

MTK3006 Statistics for Chemists

Data Description

Histogram
The histogram is a graph that displays the quantitative data by using vertical bars of various heights to represent the frequencies of the classes. The histogram is similar to the bar chart, but it is drawn without gaps between the bars. The class boundaries are represented on the horizontal axis.
Record High Temperatures

18

15

Frequency

12

99.5

104.5

109.5

114.5

119.5

124.5

129.5

134.5

Temperature ( F)

MTK3006 Statistics for Chemists

Data Description

Frequency Polygon
The frequency polygon is a graph that displays the quantitative data by using lines that connect points plotted for the frequencies at the class midpoints. The frequencies are represented by the heights of the points. The class midpoints are represented on the horizontal axis.
Record High Temperatures

18

15
q

Frequency

12

9
q q

3
q q q
|

102

107

112

117

122

127

132

Temperature ( F)

MTK3006 Statistics for Chemists

Data Description

Stem and Leaf Plot

A stem and leaf plot is a data plot that uses part of a data value as the stem and part of the data value as the leaf to form groups or classes. In a stem and leaf plot, each data value is split into a stem and a leaf. The leaf is usually the last digit of the data value and the other digits to the left of the leaf form the stem. For example, the number 123 would be split as: stem 12 leaf 3 The stems are listed on the left and the corresponding leaves on the right.

MTK3006 Statistics for Chemists

Data Description

Stem and Leaf Plot

Construct a stem and leaf plot. 25 14 36 32 31 43 32 52 20 2 33 44 32 57 32 51 13 23 44 45

MTK3006 Statistics for Chemists

Data Description

Stem and Leaf Plot

Construct a stem and leaf plot. 25 14 36 32 31 43 32 52 20 2 33 44 32 57 32 51 13 23 44 45 0 1 2 3 4 5 2 3 0 1 3 1 4 3 2 4 2

5 2 4 7

2 5

MTK3006 Statistics for Chemists

Data Description

Box Plot

A box plot is a graph that presents information from a ve-number summary. The ve-number summary is composed of the minimum, Q 1, median, Q 3 and maximum. The ve-number summary can be graphically represented by using a box plot.

MTK3006 Statistics for Chemists

Data Description

Box Plot

To construct a box plot: Find the ve-number summary. Draw a horizontal axis with a scale that includes the maximum and minimum data values. Draw a box with vertical sides through Q 1 and Q 3, and draw a vertical line though the median. Draw a line from the minimum data value to the left side of the box and a line from the maximum data value to the right side of the box.

MTK3006 Statistics for Chemists

Data Description

Box Plot

Construct a box plot for the data:


89, 47, 164, 296, 30, 215, 138, 78, 48, 39

MTK3006 Statistics for Chemists

Data Description

Box Plot

Construct a box plot for the data:


89, 47, 164, 296, 30, 215, 138, 78, 48, 39

Five-number summary 30-47-83.5-164-296

MTK3006 Statistics for Chemists

Data Description

Box Plot

Construct a box plot for the data:


89, 47, 164, 296, 30, 215, 138, 78, 48, 39
30

47

83.5

164

296

Five-number summary 30-47-83.5-164-296

100

200

300

MTK3006 Statistics for Chemists

Data Description

Time Series Plot


A time series plot represents data that occur over a specic period of time. It is a line graph where the time is represented on the horizontal axis and the quantity that varies over time is represented on the vertical axis. Temperature over a 9Hour Period
60

55

q q

Temperature ( F)

50
q q q q q

45

40

35 12 1 2 3 4 Time 5 6 7 8 9

MTK3006 Statistics for Chemists

Data Description

Part V

MTK3006 Statistics for Chemists

Data Description

R and R Commander

R A language and environment for statistical computing and graphics Available as a free software at http://www.r-project.org/ A command-driven statistical program R Commander A graphical user interface for R Its interface includes menus, buttons and a few other elements

MTK3006 Statistics for Chemists

Data Description

You might also like