Data Description PDF

Data Description
MTK3006
Department of Mathematics Faculty of Science and Technology Universiti Malaysia Terengganu chee@umt.edu.my
MTK3006 Statistics for Chemists
Data Description
Part I
Data Description
Basic Terms
Population A population is a collection of all subjects or objects of interest.
Data Description
Basic Terms
Population A population is a collection of all subjects or objects of interest. Sample A sample is a portion or part of the population of interest.
Data Description
Basic Terms
Variable A variable is a characteristic or attribute that can assume dierent values.
Data Description
Basic Terms
Variable A variable is a characteristic or attribute that can assume dierent values. Data The values that a variable can assume are called data.
Data Description
Basic Terms
Variable A variable is a characteristic or attribute that can assume dierent values. Data The values that a variable can assume are called data. Data set A collection of data values or measurements forms a data set.
Data Description
Basic Terms
Variable A variable is a characteristic or attribute that can assume dierent values. Data The values that a variable can assume are called data. Data set A collection of data values or measurements forms a data set. Types of data Quantitative data is a numerical measurement expressed in terms of numbers. Qualitative data is a categorical measurement expressed by means of a natural language description.
MTK3006 Statistics for Chemists Data Description
Basic Terms
Parameter A parameter is a characteristic or measure obtained by using all the data values from a population.
Data Description
Basic Terms
Parameter A parameter is a characteristic or measure obtained by using all the data values from a population. Statistic A statistic is a characteristic or measure obtained by using the data values from a sample.
Data Description
Basic Terms
Statistics Statistics is the science of collecting, organizing, summarizing, analyzing and interpreting data.
Data Description
Basic Terms
Statistics Statistics is the science of collecting, organizing, summarizing, analyzing and interpreting data. Areas of statistics The branch of statistics devoted to the organization, summarization, description and presentation of data sets is called descriptive statistics. The branch of statistics concerned with using sample data to draw conclusions about a population is called inferential statistics.
Data Description
Part II
Data Description
Describing Data with Tables
Data collected in original form is called raw data. A frequency distribution is the organization of raw data in table form using classes and frequencies. There are three types of frequency distributions:
Categorical frequency distributions Ungrouped frequency distributions Grouped frequency distributions
Data Description
Categorical Frequency Distribution
Can be used for data that can be placed in specic categories. Examples political aliation, religious aliation, blood type, etc. Example Blood Type Data A,B,B,AB,O,O,O,B,AB,B,B,B,O, A,O,A,O,O,O,AB,AB,A,O,B,A
Blood Type Frequency Distribution
Class A B O AB
Frequency 5 7 9 4
Percent 20 28 36 16
Data Description
Ungrouped Frequency Distribution
Can be used for data that can be enumerated and when the range of values in the data set is not large. Examples number of kilometers your instructors have to travel from home to campus, number of girls in 4-child family, etc. Example Number of Kilometers Travelled: 8, 5, 6, 5, 5, 7, 7
Number of Kilometers Travelled
Class 5 6 7 8
Frequency 3 1 2 1
Data Description
Grouped Frequency Distribution
Can be used when the range of values in the data set is very large. Class limits represent the smallest and largest data values that can be included in a class. The smallest and largest possible data values in a class are the lower and upper class limits. Class boundaries separate the classes. To nd a class boundary, average the upper class limit of one class and the lower class limit of the next class. The class width is found by subtracting the lower (or upper) class limit of one class from the lower (or upper) class limit of the previous class. The class midpoint can be calculated by averaging the upper and lower class limits.
Data Description
Rules for classes There should be 5-20 classes. The class width should be an odd number. The classes must not overlap. The classes must not have breaks. The classes must include all the data values. The classes must be equal in width.
Data Description

To construct a grouped frequency distribution: Find the highest and lowest values. Find the range. Choose the number of classes. Find the class width by dividing the range by the number of classes and rounding up. Choose a starting point (usually the lowest value); add the class width to get all the lower limits. Find the upper class limits. Find the class boundaries. Find the frequencies and the cumulative frequencies.
Data Description
Construct a grouped frequency distribution using 7 112 100 127 120 134 118 105 110 109 118 117 116 118 122 114 114 105 109 114 115 118 117 118 122 106 110 116 121 113 120 119 111 104 111 120 113 105 110 118 112 114 114
classes. 112 110 107 112 108 110 120 117
Data Description
Construct a grouped frequency distribution using 7 112 100 127 120 134 118 105 110 109 118 117 116 118 122 114 114 105 109 114 115 118 117 118 122 106 110 116 121 113 120 119 111 104 111 120 113 105 110 118 112 114 114
Class Limits 100 - 104 Class Boundaries 99.5 - 104.5 Frequency 2
classes. 112 110 107 112 108 110 120 117
Cumulative Frequency 2
Data Description
Class Limits 100 - 104 105 - 109 110 - 114 115 - 119 120 - 124 125 - 129 130 - 134
Class Boundaries 99.5 - 104.5 104.5 - 109.5 109.5 - 114.5 114.5 - 119.5 119.5 - 124.5 124.5 - 129.5 129.5 - 134.5
Frequency 2 8 18 13 7 1 1
Cumulative Frequency 2 10 28 41 48 49 50
Data Description
Part III
Data Description
Measures of Central Tendency
Given a set of data, we often would like to have one number that is representative of a population or sample. There are several standard ways to measure the center.
Meanthe average of the data set Medianthe midpoint of the data set Modethe value that occurs most often in the data set
Data Description
The Mean
Denote by xi the i th observed data value in the population or sample. Denote by N and n the population and sample sizes respectively. The population mean is the sum of all the population values divided by the total number of population values: = 1 N
N
xi .
i =1
The sample mean is the sum of all the sample values divided by the number of sample values: x= 1 n
n
xi .
i =1
Find the sample mean of 20, 26, 40, 36, 23, 42, 35, 24, 30.
Data Description
The Mean
Denote by xi the i th observed data value in the population or sample. Denote by N and n the population and sample sizes respectively. The population mean is the sum of all the population values divided by the total number of population values: = 1 N
N
xi .
i =1
The sample mean is the sum of all the sample values divided by the number of sample values: x= 1 n
n
xi .
i =1
Find the sample mean of 20, 26, 40, 36, 23, 42, 35, 24, 30. Answer: x = 30.67
The Median
The median is the middle value, or the average of the middle two values, of a population or sample, when the data values are arranged from smallest to largest. The median will be one of the data values if there is an odd number of values. The median will be the average of two data values if there is an even number of values. Find the median of 684, 764, 656, 702, 856, 1133, 1132, 1303.
Data Description
The Median
The median is the middle value, or the average of the middle two values, of a population or sample, when the data values are arranged from smallest to largest. The median will be one of the data values if there is an odd number of values. The median will be the average of two data values if there is an even number of values. Find the median of 684, 764, 656, 702, 856, 1133, 1132, 1303. Answer: Median = 810
Data Description
The Mode
The mode is the value in the population or sample that occurs most frequently. It is sometimes said to be the most typical case. There may be no mode, one mode (unimodal), two modes (bimodal), or many modes (multimodal). Find the mode of 18.0, 14.0, 34.5, 10, 11.3, 10, 12.4, 10. Find the mode of 104, 104, 104, 104, 104, 107, 109, 109, 109, 110, 109, 111, 112, 111, 109.
Data Description
Properties of the Mean
Uses all data values. Sample mean varies less than the sample median or mode. Used in computing other statistics, such as the variance. Unique, usually not one of the data values. Aected by extremely high or low values, called outliers.
Data Description
Properties of the Median
Gives the midpoint. Used when it is necessary to nd out whether the data values fall into the upper half or lower half of the data set. Aected less than the mean by extremely high or extremely low values.
Data Description
Properties of the Mode
Used when the most typical case is desired. Easiest to compute. Not always unique or may not exist.
Data Description
Measures of Dispersion
Dispersion refers to the spread or variability in a data set. Measures of dispersion include range, variance, standard deviation, etc.
Data Description
The Range
The range is the dierence between the highest and lowest values of a population or sample. Two experimental brands of outdoor paint are tested to see how long each will last before fading. Six cans of each brand constitute a small population. The results (in months) are: Brand A Brand B 10 35 60 45 50 30 30 35 40 40 20 25 The population mean for both brands is the same. Which brand would you buy?
The Variance
The variance is the average of the squares of the distance each value is from the mean. The population variance is 2 = The sample variance is 1 s = n1
2 n
1 N
(xi )2 .
i =1
(xi x )2 .
i =1
This formula for s 2 makes a better estimator of 2 than if we had divided by n.

The Standard Deviation
The standard deviation is the square root of the variance. The population standard deviation is . The sample standard deviation is s . The standard deviation is measured in the same unit as the measurements in the population or sample. A large standard deviation indicates that the data values are far from the mean, whereas a small standard deviation indicates that they are clustered closely around the mean.
Data Description
Alternate Formula for the Sample Standard Deviation
s=
1 n1
xi2
i =1
1 n
xi
i =1
Saves time when calculating by hand. Does not use the sample mean. Find the sample standard deviation of 11.2, 11.9, 12.0, 12.8, 13.4, 14.3.
Data Description
Alternate Formula for the Sample Standard Deviation
s=
1 n1
xi2
i =1
1 n
xi
i =1
Saves time when calculating by hand. Does not use the sample mean. Find the sample standard deviation of 11.2, 11.9, 12.0, 12.8, 13.4, 14.3. Answer: s = 1.13
Data Description
Measures of Position
Measures of position or location are used to locate the relative position of a data value in the data set. These measures include:
z -score quartiles outlier
Data Description
The z -score
A z -score or standard score for a value is obtained by subtracting the mean from the value and dividing the result by the standard deviation. The formula for the population (or sample) z -score is z= x or = x x s .
A z -score represents the number of standard deviations a value is above or below the mean.
Data Description
The Quartiles
Quartiles separate the data set into 4 equal groups. The rst quartile (Q 1) is the value that lies 25% of the way up from the smallest value. The second quartile (Q 2) is the value that lies 50% of the way up from the smallest value, and is equivalent to the median. The third quartile (Q 3) is the value that lies 75% of the way up from the smallest value. The interquartile range (IQR ) is the dierence between the upper and lower quartiles, i.e., IQR = Q 3 Q 1.
Data Description
The Outlier
An outlier is an extremely high or low data value when compared with the rest of the data values. A data value less than Q 1 1.5 IQR or greater than Q 3 + 1.5 IQR can be considered an outlier.
Data Description
Part IV
Data Description
Describing Data with Graphs
Graphs used for qualitative data

Bar charts Pareto charts
Graphs used for quantitative data

Histograms Frequency polygons Stem and leaf plots Box plots Time series plots
Data Description
Bar Chart
How people get to work
The bars can be plotted vertically or horizontally. Example Modes of Transportation to Work
The vertical scale shows frequencies. The horizontal scale shows categories.
People
10
15
20
25
A bar chart is a chart with rectangular bars.
30
Car
Bus
Train
Walk
Data Description
Pareto Chart
How people get to work

30
A Pareto chart can be used to represent a categorical frequency distribution. It is a bar chart arranged in descending order of height from left to right.
People
10
15
20
25
Car
Train
Bus
Walk
Data Description
Histogram
The histogram is a graph that displays the quantitative data by using vertical bars of various heights to represent the frequencies of the classes. The histogram is similar to the bar chart, but it is drawn without gaps between the bars. The class boundaries are represented on the horizontal axis.
Record High Temperatures
18
15
Frequency
12
99.5
104.5
109.5
114.5
119.5
124.5
129.5
134.5
Temperature ( F)
Data Description
Frequency Polygon
The frequency polygon is a graph that displays the quantitative data by using lines that connect points plotted for the frequencies at the class midpoints. The frequencies are represented by the heights of the points. The class midpoints are represented on the horizontal axis.
Record High Temperatures
18
15
q
Frequency
12
9
q q
3
q q q
|
102
107
112
117
122
127
132
Temperature ( F)
Data Description
Stem and Leaf Plot
A stem and leaf plot is a data plot that uses part of a data value as the stem and part of the data value as the leaf to form groups or classes. In a stem and leaf plot, each data value is split into a stem and a leaf. The leaf is usually the last digit of the data value and the other digits to the left of the leaf form the stem. For example, the number 123 would be split as: stem 12 leaf 3 The stems are listed on the left and the corresponding leaves on the right.
Data Description
Stem and Leaf Plot
Construct a stem and leaf plot. 25 14 36 32 31 43 32 52 20 2 33 44 32 57 32 51 13 23 44 45
Data Description
Stem and Leaf Plot
Construct a stem and leaf plot. 25 14 36 32 31 43 32 52 20 2 33 44 32 57 32 51 13 23 44 45 0 1 2 3 4 5 2 3 0 1 3 1 4 3 2 4 2
5 2 4 7
2 5
Data Description
Box Plot
A box plot is a graph that presents information from a ve-number summary. The ve-number summary is composed of the minimum, Q 1, median, Q 3 and maximum. The ve-number summary can be graphically represented by using a box plot.
Data Description
Box Plot
To construct a box plot: Find the ve-number summary. Draw a horizontal axis with a scale that includes the maximum and minimum data values. Draw a box with vertical sides through Q 1 and Q 3, and draw a vertical line though the median. Draw a line from the minimum data value to the left side of the box and a line from the maximum data value to the right side of the box.
Data Description
Box Plot
Construct a box plot for the data:

89, 47, 164, 296, 30, 215, 138, 78, 48, 39
Data Description
Box Plot

89, 47, 164, 296, 30, 215, 138, 78, 48, 39
Five-number summary 30-47-83.5-164-296
Data Description
Box Plot

89, 47, 164, 296, 30, 215, 138, 78, 48, 39
30
47
83.5
164
296
Five-number summary 30-47-83.5-164-296
100
200
300
Data Description
Time Series Plot

A time series plot represents data that occur over a specic period of time. It is a line graph where the time is represented on the horizontal axis and the quantity that varies over time is represented on the vertical axis. Temperature over a 9Hour Period
60
55
q q
Temperature ( F)
50
q q q q q
45
40
35 12 1 2 3 4 Time 5 6 7 8 9
Data Description
Part V
Data Description
R and R Commander
R A language and environment for statistical computing and graphics Available as a free software at http://www.r-project.org/ A command-driven statistical program R Commander A graphical user interface for R Its interface includes menus, buttons and a few other elements
Data Description

Data Description PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Description PDF

Uploaded by

Copyright:

Available Formats

Data Description

MTK3006 Statistics for Chemists

MTK3006 Statistics for Chemists

Population A population is a collection of all subjects or objects of interest.

MTK3006 Statistics for Chemists

MTK3006 Statistics for Chemists

MTK3006 Statistics for Chemists

MTK3006 Statistics for Chemists

MTK3006 Statistics for Chemists

MTK3006 Statistics for Chemists

MTK3006 Statistics for Chemists

MTK3006 Statistics for Chemists

MTK3006 Statistics for Chemists

MTK3006 Statistics for Chemists

Describing Data with Tables

MTK3006 Statistics for Chemists

Categorical Frequency Distribution

Blood Type Frequency Distribution

MTK3006 Statistics for Chemists

Ungrouped Frequency Distribution

Number of Kilometers Travelled

MTK3006 Statistics for Chemists

Grouped Frequency Distribution

MTK3006 Statistics for Chemists

Grouped Frequency Distribution

MTK3006 Statistics for Chemists

Grouped Frequency Distribution

MTK3006 Statistics for Chemists

Grouped Frequency Distribution

classes. 112 110 107 112 108 110 120 117

MTK3006 Statistics for Chemists

Grouped Frequency Distribution

classes. 112 110 107 112 108 110 120 117

MTK3006 Statistics for Chemists

Grouped Frequency Distribution

MTK3006 Statistics for Chemists

MTK3006 Statistics for Chemists

Measures of Central Tendency

MTK3006 Statistics for Chemists

MTK3006 Statistics for Chemists

MTK3006 Statistics for Chemists

MTK3006 Statistics for Chemists

MTK3006 Statistics for Chemists

Properties of the Mean

MTK3006 Statistics for Chemists

Properties of the Median

MTK3006 Statistics for Chemists

Properties of the Mode

MTK3006 Statistics for Chemists

MTK3006 Statistics for Chemists

This formula for s 2 makes a better estimator of 2 than if we had divided by n.

The Standard Deviation

MTK3006 Statistics for Chemists

Alternate Formula for the Sample Standard Deviation

MTK3006 Statistics for Chemists

Alternate Formula for the Sample Standard Deviation

MTK3006 Statistics for Chemists

MTK3006 Statistics for Chemists

MTK3006 Statistics for Chemists

MTK3006 Statistics for Chemists

MTK3006 Statistics for Chemists

MTK3006 Statistics for Chemists

Describing Data with Graphs