Professional Documents
Culture Documents
MTK3006
Department of Mathematics Faculty of Science and Technology Universiti Malaysia Terengganu chee@umt.edu.my
Data Description
Part I
Data Description
Basic Terms
Data Description
Basic Terms
Population A population is a collection of all subjects or objects of interest. Sample A sample is a portion or part of the population of interest.
Data Description
Basic Terms
Variable A variable is a characteristic or attribute that can assume dierent values.
Data Description
Basic Terms
Variable A variable is a characteristic or attribute that can assume dierent values. Data The values that a variable can assume are called data.
Data Description
Basic Terms
Variable A variable is a characteristic or attribute that can assume dierent values. Data The values that a variable can assume are called data. Data set A collection of data values or measurements forms a data set.
Data Description
Basic Terms
Variable A variable is a characteristic or attribute that can assume dierent values. Data The values that a variable can assume are called data. Data set A collection of data values or measurements forms a data set. Types of data Quantitative data is a numerical measurement expressed in terms of numbers. Qualitative data is a categorical measurement expressed by means of a natural language description.
MTK3006 Statistics for Chemists Data Description
Basic Terms
Parameter A parameter is a characteristic or measure obtained by using all the data values from a population.
Data Description
Basic Terms
Parameter A parameter is a characteristic or measure obtained by using all the data values from a population. Statistic A statistic is a characteristic or measure obtained by using the data values from a sample.
Data Description
Basic Terms
Statistics Statistics is the science of collecting, organizing, summarizing, analyzing and interpreting data.
Data Description
Basic Terms
Statistics Statistics is the science of collecting, organizing, summarizing, analyzing and interpreting data. Areas of statistics The branch of statistics devoted to the organization, summarization, description and presentation of data sets is called descriptive statistics. The branch of statistics concerned with using sample data to draw conclusions about a population is called inferential statistics.
Data Description
Part II
Data Description
Data collected in original form is called raw data. A frequency distribution is the organization of raw data in table form using classes and frequencies. There are three types of frequency distributions:
Categorical frequency distributions Ungrouped frequency distributions Grouped frequency distributions
Data Description
Can be used for data that can be placed in specic categories. Examples political aliation, religious aliation, blood type, etc. Example Blood Type Data A,B,B,AB,O,O,O,B,AB,B,B,B,O, A,O,A,O,O,O,AB,AB,A,O,B,A
Class A B O AB
Frequency 5 7 9 4
Percent 20 28 36 16
Data Description
Can be used for data that can be enumerated and when the range of values in the data set is not large. Examples number of kilometers your instructors have to travel from home to campus, number of girls in 4-child family, etc. Example Number of Kilometers Travelled: 8, 5, 6, 5, 5, 7, 7
Class 5 6 7 8
Frequency 3 1 2 1
Data Description
Can be used when the range of values in the data set is very large. Class limits represent the smallest and largest data values that can be included in a class. The smallest and largest possible data values in a class are the lower and upper class limits. Class boundaries separate the classes. To nd a class boundary, average the upper class limit of one class and the lower class limit of the next class. The class width is found by subtracting the lower (or upper) class limit of one class from the lower (or upper) class limit of the previous class. The class midpoint can be calculated by averaging the upper and lower class limits.
Data Description
Rules for classes There should be 5-20 classes. The class width should be an odd number. The classes must not overlap. The classes must not have breaks. The classes must include all the data values. The classes must be equal in width.
Data Description
Data Description
Construct a grouped frequency distribution using 7 112 100 127 120 134 118 105 110 109 118 117 116 118 122 114 114 105 109 114 115 118 117 118 122 106 110 116 121 113 120 119 111 104 111 120 113 105 110 118 112 114 114
Data Description
Construct a grouped frequency distribution using 7 112 100 127 120 134 118 105 110 109 118 117 116 118 122 114 114 105 109 114 115 118 117 118 122 106 110 116 121 113 120 119 111 104 111 120 113 105 110 118 112 114 114
Class Limits 100 - 104 Class Boundaries 99.5 - 104.5 Frequency 2
Cumulative Frequency 2
Data Description
Class Limits 100 - 104 105 - 109 110 - 114 115 - 119 120 - 124 125 - 129 130 - 134
Class Boundaries 99.5 - 104.5 104.5 - 109.5 109.5 - 114.5 114.5 - 119.5 119.5 - 124.5 124.5 - 129.5 129.5 - 134.5
Frequency 2 8 18 13 7 1 1
Cumulative Frequency 2 10 28 41 48 49 50
Data Description
Part III
Data Description
Given a set of data, we often would like to have one number that is representative of a population or sample. There are several standard ways to measure the center.
Meanthe average of the data set Medianthe midpoint of the data set Modethe value that occurs most often in the data set
Data Description
The Mean
Denote by xi the i th observed data value in the population or sample. Denote by N and n the population and sample sizes respectively. The population mean is the sum of all the population values divided by the total number of population values: = 1 N
N
xi .
i =1
The sample mean is the sum of all the sample values divided by the number of sample values: x= 1 n
n
xi .
i =1
Find the sample mean of 20, 26, 40, 36, 23, 42, 35, 24, 30.
Data Description
The Mean
Denote by xi the i th observed data value in the population or sample. Denote by N and n the population and sample sizes respectively. The population mean is the sum of all the population values divided by the total number of population values: = 1 N
N
xi .
i =1
The sample mean is the sum of all the sample values divided by the number of sample values: x= 1 n
n
xi .
i =1
Find the sample mean of 20, 26, 40, 36, 23, 42, 35, 24, 30. Answer: x = 30.67
MTK3006 Statistics for Chemists Data Description
The Median
The median is the middle value, or the average of the middle two values, of a population or sample, when the data values are arranged from smallest to largest. The median will be one of the data values if there is an odd number of values. The median will be the average of two data values if there is an even number of values. Find the median of 684, 764, 656, 702, 856, 1133, 1132, 1303.
Data Description
The Median
The median is the middle value, or the average of the middle two values, of a population or sample, when the data values are arranged from smallest to largest. The median will be one of the data values if there is an odd number of values. The median will be the average of two data values if there is an even number of values. Find the median of 684, 764, 656, 702, 856, 1133, 1132, 1303. Answer: Median = 810
Data Description
The Mode
The mode is the value in the population or sample that occurs most frequently. It is sometimes said to be the most typical case. There may be no mode, one mode (unimodal), two modes (bimodal), or many modes (multimodal). Find the mode of 18.0, 14.0, 34.5, 10, 11.3, 10, 12.4, 10. Find the mode of 104, 104, 104, 104, 104, 107, 109, 109, 109, 110, 109, 111, 112, 111, 109.
Data Description
Uses all data values. Sample mean varies less than the sample median or mode. Used in computing other statistics, such as the variance. Unique, usually not one of the data values. Aected by extremely high or low values, called outliers.
Data Description
Gives the midpoint. Used when it is necessary to nd out whether the data values fall into the upper half or lower half of the data set. Aected less than the mean by extremely high or extremely low values.
Data Description
Used when the most typical case is desired. Easiest to compute. Not always unique or may not exist.
Data Description
Measures of Dispersion
Dispersion refers to the spread or variability in a data set. Measures of dispersion include range, variance, standard deviation, etc.
Data Description
The Range
The range is the dierence between the highest and lowest values of a population or sample. Two experimental brands of outdoor paint are tested to see how long each will last before fading. Six cans of each brand constitute a small population. The results (in months) are: Brand A Brand B 10 35 60 45 50 30 30 35 40 40 20 25 The population mean for both brands is the same. Which brand would you buy?
MTK3006 Statistics for Chemists Data Description
The Variance
The variance is the average of the squares of the distance each value is from the mean. The population variance is 2 = The sample variance is 1 s = n1
2 n
1 N
(xi )2 .
i =1
(xi x )2 .
i =1
The standard deviation is the square root of the variance. The population standard deviation is . The sample standard deviation is s . The standard deviation is measured in the same unit as the measurements in the population or sample. A large standard deviation indicates that the data values are far from the mean, whereas a small standard deviation indicates that they are clustered closely around the mean.
Data Description
s=
1 n1
xi2
i =1
1 n
xi
i =1
Saves time when calculating by hand. Does not use the sample mean. Find the sample standard deviation of 11.2, 11.9, 12.0, 12.8, 13.4, 14.3.
Data Description
s=
1 n1
xi2
i =1
1 n
xi
i =1
Saves time when calculating by hand. Does not use the sample mean. Find the sample standard deviation of 11.2, 11.9, 12.0, 12.8, 13.4, 14.3. Answer: s = 1.13
Data Description
Measures of Position
Measures of position or location are used to locate the relative position of a data value in the data set. These measures include:
z -score quartiles outlier
Data Description
The z -score
A z -score or standard score for a value is obtained by subtracting the mean from the value and dividing the result by the standard deviation. The formula for the population (or sample) z -score is z= x or = x x s .
A z -score represents the number of standard deviations a value is above or below the mean.
Data Description
The Quartiles
Quartiles separate the data set into 4 equal groups. The rst quartile (Q 1) is the value that lies 25% of the way up from the smallest value. The second quartile (Q 2) is the value that lies 50% of the way up from the smallest value, and is equivalent to the median. The third quartile (Q 3) is the value that lies 75% of the way up from the smallest value. The interquartile range (IQR ) is the dierence between the upper and lower quartiles, i.e., IQR = Q 3 Q 1.
Data Description
The Outlier
An outlier is an extremely high or low data value when compared with the rest of the data values. A data value less than Q 1 1.5 IQR or greater than Q 3 + 1.5 IQR can be considered an outlier.
Data Description
Part IV
Data Description
Data Description
Bar Chart
The bars can be plotted vertically or horizontally. Example Modes of Transportation to Work
The vertical scale shows frequencies. The horizontal scale shows categories.
People
10
15
20
25
30
Car
Bus
Train
Walk
Data Description
Pareto Chart
A Pareto chart can be used to represent a categorical frequency distribution. It is a bar chart arranged in descending order of height from left to right.
People
10
15
20
25
Car
Train
Bus
Walk
Data Description
Histogram
The histogram is a graph that displays the quantitative data by using vertical bars of various heights to represent the frequencies of the classes. The histogram is similar to the bar chart, but it is drawn without gaps between the bars. The class boundaries are represented on the horizontal axis.
Record High Temperatures
18
15
Frequency
12
99.5
104.5
109.5
114.5
119.5
124.5
129.5
134.5
Temperature ( F)
Data Description
Frequency Polygon
The frequency polygon is a graph that displays the quantitative data by using lines that connect points plotted for the frequencies at the class midpoints. The frequencies are represented by the heights of the points. The class midpoints are represented on the horizontal axis.
Record High Temperatures
18
15
q
Frequency
12
9
q q
3
q q q
|
102
107
112
117
122
127
132
Temperature ( F)
Data Description
A stem and leaf plot is a data plot that uses part of a data value as the stem and part of the data value as the leaf to form groups or classes. In a stem and leaf plot, each data value is split into a stem and a leaf. The leaf is usually the last digit of the data value and the other digits to the left of the leaf form the stem. For example, the number 123 would be split as: stem 12 leaf 3 The stems are listed on the left and the corresponding leaves on the right.
Data Description
Data Description
5 2 4 7
2 5
Data Description
Box Plot
A box plot is a graph that presents information from a ve-number summary. The ve-number summary is composed of the minimum, Q 1, median, Q 3 and maximum. The ve-number summary can be graphically represented by using a box plot.
Data Description
Box Plot
To construct a box plot: Find the ve-number summary. Draw a horizontal axis with a scale that includes the maximum and minimum data values. Draw a box with vertical sides through Q 1 and Q 3, and draw a vertical line though the median. Draw a line from the minimum data value to the left side of the box and a line from the maximum data value to the right side of the box.
Data Description
Box Plot
Data Description
Box Plot
Data Description
Box Plot
47
83.5
164
296
100
200
300
Data Description
55
q q
Temperature ( F)
50
q q q q q
45
40
35 12 1 2 3 4 Time 5 6 7 8 9
Data Description
Part V
Data Description
R and R Commander
R A language and environment for statistical computing and graphics Available as a free software at http://www.r-project.org/ A command-driven statistical program R Commander A graphical user interface for R Its interface includes menus, buttons and a few other elements
Data Description