Statistics is defined as the science of collecting, organizing, presenting, analyzing, and interpreting data to assist in making more effective decisions. It involves both descriptive statistics, which organize and summarize data, and inferential statistics, which draw conclusions about populations based on samples. Key terms discussed include population, sample, parameter, statistic, categorical vs. numerical variables, and levels of measurement for variables. Frequency distributions and various graphs like histograms and bar charts are used to present descriptive statistics.
Statistics is defined as the science of collecting, organizing, presenting, analyzing, and interpreting data to assist in making more effective decisions. It involves both descriptive statistics, which organize and summarize data, and inferential statistics, which draw conclusions about populations based on samples. Key terms discussed include population, sample, parameter, statistic, categorical vs. numerical variables, and levels of measurement for variables. Frequency distributions and various graphs like histograms and bar charts are used to present descriptive statistics.
Statistics is defined as the science of collecting, organizing, presenting, analyzing, and interpreting data to assist in making more effective decisions. It involves both descriptive statistics, which organize and summarize data, and inferential statistics, which draw conclusions about populations based on samples. Key terms discussed include population, sample, parameter, statistic, categorical vs. numerical variables, and levels of measurement for variables. Frequency distributions and various graphs like histograms and bar charts are used to present descriptive statistics.
collecting, organizing, presenting, analyzing, and interpreting data to assist in making more effective decisions. OR Collection of numerical information is called statistics.
Dr. Iftikhar Hussain Adil
WHAT IS STATISTICS Broadly defined, it is the science, technology and art of extracting information from observational data, with an emphasis on solving real world problems. It is a logic and methodology for the measurement of uncertainty and for examination of the consequences of that uncertainty in the planning and interpretation of experimentation and observation. TYPES OF STATISTICS
Dr. Iftikhar Hussain Adil Statistical Methods Descriptive Statistics Inferential Statistics TYPES OF STATISTICS DESCRIPTIVE STATISTICS Methods of organizing, summarizing, and presenting data in an informative way. INFERENTIAL STATISTICS The methods used to determine something about a population on the basis of a sample.
DESCRIPTIVE STATISTICS Dr. Iftikhar Hussain Adil Inferential Statistics Aim to draw conclusions about an additional population outside of your datasets/sample is known to be inferential statistics. Population versus Sample A population is the complete set of all items that interests an investigator. Population size N, can be very large or even infinite. e.g. All the registered voters of Pakistan All the students at NUST Sample is an observed subset of the population values with sample size given by n Sampling Techniques Simple Random Sampling Systematic Sampling Stratified Sampling Possible strata: (Male and female strata, Resident and non-resident strata, White, Black, Hispanic, and Asian strata, Protestant, Catholic, Jewish, Muslim, etc., strata) Clustered Sampling Sample of Convenience Parameter and Statistic A parameter is a specific characteristic of a population. A statistic is a specific characteristic of a sample. e.g. NBS surveyed its students to determine the average daily expense. From a sample of 80 students the average expense was computed Rs.133. What is population? What is sample? What is parameter? What is statistic? Is Rs.133 a parameter or statistic? Types of Variables Variable: A characteristic of an item or individual that will be analyzed by using statistics. e.g. Gender, Party affiliation of registered voters, HH income of citizens who live in specific geographic area, Publishing category (hard cover, trade paper book, mass marked paper book, text book) of a book. No of televisions in a household etc. Example (Types of variables)
Reg # Gender Age FA/FSC or equivalent Family Members 1 M 18.2 67 4 2 F 19 70 3 3 M 20 80 5 4 F 19.4 85 6 5 F 20.6 73 3 6 M 21 76 4 7 F 20.3 67 5 8 F 19.8 89 4
Types of Variables Categorical Variables A categorical variable is a variable that can take on one of a limited, and usually fixed, number of possible values. Categorical variables are often used to represent categorical data. The values of these variables are selected from an established list of categories. Male/ Female, Pass/ Fail, SA,A,D,SD Numerical variables The values of these variables involve a counted or measured valued Types of Variables Discrete Variables: The vales of these variables counts. e.g. Number of people living in a HH Continuous Variables: These variables have continuous values and any value can theoretically occur limited only by the precision of the measuring process. E.g time to complete a work, air pressure in tyre. Levels of Measurement Levels of measurement often dictate the calculations that can be done to summarize and present the data. It also determines the statistical test that should be performed. e.g. Balls in a bag are of different colors like brown, yellow, blue, green, orange or red etc. Types of Levels of Measurement Ratio Level Data: When a scale consist of not only of equidistant points but also has a meaningful zero point, then we refer it as ratio scale. Ratio scales are more sophisticated of scales since it incorporates all the characteristics of nominal, ordinal and interval scales. E.g. income data Properties of Ratio Level Equal differences in the characteristic are represented by equal differences in the numbers assigned to the classifications. Can be added or subtracted i.e. X 1 +X 2 or X 1 -X 2 is possible Can be multiplied or divided X 1* X 2 or
X 1 /X 2 is possible Can be ordered X 1 <X 2 or X 1 >X 2 Meaningful zero point Types of Levels of Measurement Interval Scale: An interval scale satisfies x 2 - x 1 or x 2 x 1 or x 1 x 2 but not the ratio. e.g. 100 O is not twice as warm as 50o (no zero point, no ratio but x 2 x 1 or x 1 x 2 ) Ordinal Scale: When item are classified according to more or less characteristics, the scale used is referred as ordinal scale. This scale is common in marketing, satisfaction and attitudinal research. E.g. Excellent, v good, good, fair, poor ( No zero point, no equal gap, no ratio but just comparison)
Types of Levels of Measurement Nominal Scale: a discrete classification of data, in which data are neither measured nor ordered but subjects are merely allocated to distinct categories: for example Male female, married unmarried widowed or separated (No ratio, No zero point, No equal gap and no comparison) Example A sample of customers in a specialty ice cream store was asked a series of questions. What is your favorite flavor of ice cream. How many times do you eat ice cream Do you have children under the age of ten living in your home Have you tried our latest ice cream flavor? Self Review 1-1 Chicago-based Market Facts asked a sample of 1,960 consumers to try a newly developed chicken dinner by Boston Market. Of the 1,960 sampled, 1,176 said they would purchase the dinner if it is marketed. (a) What could Market Facts report to Boston Market regarding acceptance of the chicken dinner in the population? (b) Is this an example of descriptive statistics or inferential statistics? Explain. DESCRIPTIVE STATISTICS FREQUENCY DISTRIBUTION A grouping of data into mutually exclusive classes showing the number of observations in each. The raw data are more easily interpreted if organized into a frequency distribution. How to find maximum of data How to find minimum of data Where is the cluster of data What is the typical price of vehicle Dr. Iftikhar Hussain Adil
DESCRIPTIVE STATISTICS Step 1: Decide on the number of classes. Step 2: Determine the class interval 'or width.
Step 3: Set the individual class limits Step 4: Tally the vehicle selling prices into the classes.
Dr. Iftikhar Hussain Adil
DESCRIPTIVE STATISTICS Step 5: Count the number of items in each class. class frequency The number of observations in each class. class midpoint class interval Relative frequency
Dr. Iftikhar Hussain Adil
Self Review 2.2 Barry Bonds of the San Francisco Giants established a new single season home run record by hitting 73 home runs during the 2001 Major League Baseball season. The longest of these home runs traveled 488 feet and the shortest 320 feet. You need to construct a frequency distribution of these home run lengths. (a) How many classes would you use? (b) What class interval would you suggest? (c) What actual classes would you suggest? Exercise Page 31 1. A set of data consists of 38 observations. How many classes would you recommend for the frequency distribution? 2. A set of data consists of 45 observations between $0 and $29. What size would you recommend for the class interval? 3. A set of data consists of 230 observations between $235 and $567. What class interval would you recommend? 4. A set of data contains 53 observations. The lowest value is 42 and the largest is 129. The data are to be organized into a frequency distribution. a. How many classes would you suggest? b. What would you suggest as the lower limit of the first class? 5. Wachesaw Manufacturing, Inc. produced the following number of units the last 16 days. 27, 27, 27, 28, 27, 25, 25, 28, 26, 28, 26, 28, 31, 30, 26,26 The information is to be organized into a frequency distribution. a. How many classes would you recommend? b. What class interval would you suggest? c. What lower limit would you recommend for the first class? d. Organize the information into a frequency distribution and determine the relative frequency distribution. e. Comment on the shape of the distribution.
HISTOGRAM A graph in which the classes are marked on the horizontal axis and the class frequencies on the vertical axis. The class frequencies are represented by the heights of the bars, and the bars are drawn adjacent to each other. HISTOGRAM Frequency Polygon It consists of line segments connecting the points formed by the intersections of the class midpoints and the class frequencies. cumulative frequency distribution cumulative frequency polygon Frequency Polygon Frequency Polygon Cumulative Frequency Polygon Pareto Diagram A pareto diagram is a bar chart that displays the frequency of defect causes Line Graphs Bar Charts A bar chart can be used to depict any of the levels of measurement-nominal, ordinal, interval, or ratio. The level of education is an ordinal scale variable and is reported on the horizontal axis Difference b/w Histogram and Bar Chart In a histogram, the horizontal axis refers to the ratio scale variable-vehicle selling price. This is a continuous variable; hence there is no space between the bars. Another difference between a bar chart and a histogram is the vertical scale. In a histogram the vertical axis is the frequency or number of observations. In a bar chart the vertical scale refers to an amount.
DESCRIPTIVE STATISTICS Measures of Location
Measures of Variability
Measure of Relative Position
Measure of Shape
Dr. Iftikhar Hussain Adil
Measures of Location POPULATION MEAN: For raw data, that is, data that has not been grouped in a frequency distribution, the population mean is the sum of all the values in the population divided by the number of values in the population. Or
Dr. Iftikhar Hussain Adil Measures of Location The Sample Mean: For raw data, that is, ungrouped data, the mean is the sum of all the sampled values divided by the total number of sampled values or
Measures of Location Examples: To obtain grade A, Ben must achieve an average of at least 80 percent in five tests. If his average marks for the first four tests is 78, what is the lowest marks he can get in his fifth test and still obtain grade A? The speeds to the nearest mile per hr, of 120 vehicles passing a check point were recorded and grouped into the table below. Estimate the mean of this distribution.
Speed mph 21-25 26-30 31-35 36-45 46-60 No of vehicles 22 48 25 16 9 Measures of Location Properties of Mean 1. Every set of interval- or ratio-level data has a mean. 2. All the values are included in computing the mean. 3. The mean is unique. 4. The sum of the deviations of each value from the mean will always be zero. The Weighted Mean The weighted mean is a special case of the arithmetic mean. It occurs when there are several observations of the same value.
Example: A candidate obtained the following results at NBS Quizzes Mid Assignments Final 92% 95% 90% 65% The regulations states that quizzes having weight of 15%, assignments 10%, mid 25% and final 50%.What is the candidates final percentage? The Median: The midpoint of the values after they have been ordered from the smallest to the largest, or the largest to the smallest. Properties of Median The median is unique. It is not affected by extremely large or small values. It can be computed for ratio-level, interval-level, and ordinal-level data. MODE: The value of the observation that appears most frequently. Properties of Mode It is Robust measure. In several data sets there is no mode or more than one mode
Geometric Mean The geometric mean is useful in finding the average of percentages, ratios, indexes, or growth rates.
Measures of Variability Why Study Dispersion 1. The average is not representative because of the large spread. 2. A second reason for studying the dispersion in a set of data is to compare the spread in two or more distributions. A small value for a measure of dispersion indicates that the data are clustered closely, say, around the arithmetic mean. The mean is therefore considered representative of the data. Conversely, a large measure of dispersion indicates that the mean is not reliable. Measures of Variability Range The range is based on the largest and the smallest values in the data set. It is the difference of largest and smallest value. Range = Largest value - Smallest value MEAN DEVIATION The arithmetic mean of the absolute values of the deviations from the arithmetic mean.
Advantages and Drawback of Mean Deviation it uses all the values in the computation. It is easy to understand.
It uses absolute values and it is difficult to work with absolute values so this measure is not frequently used.
VARIANCE: The arithmetic mean of the squared deviations from the mean. STANDARD DEVIATION: The square root of the variance. Population Variance:
Sample Variance: CHEBYSHEV'S THEOREM For any set of observations (sample or population), the proportion of the values that lie within k standard deviations of the mean is at least (1 1/k 2 ) where k is any constant greater than 1. EMPIRICAL RULE For a symmetrical, bell-shaped frequency distribution, approximately 68 percent of the observations will lie within plus and minus one standard deviation of the mean; about 95 percent of the observations will lie within plus and minus two standard deviations of the mean; and practically all (99.7 percent) will lie within plus and minus three standard deviations of the mean. Quartiles, Deciles, and Percentiles a percentile (or centile) is the value of a variable below which a certain percent of observations fall L p =(n+1)*P/100
91, 75, 61, 101,43,104 Box Plots A box plot is a graphical display, based on quartiles, that helps us picture a set of data. To construct a box plot, we need only five statistics: the minimum value, Q 1 (the first quartile), the median, Q 3 (the third quartile), and the maximum value. Outlier: An outlier is a value that is inconsistent with the rest of the data. Inter Quartile Range: The inter quartile range is the distance between the first and then third quartile. Skewness Symmetric: In a symmetric set of observations the mean and median are equal and the data values are evenly spread around these values. The data values below the mean and median are a mirror image of those above. Positively Skewed: A set of values is skewed to the right or positively skewed if there is a single peak and the values extend much further to the right of the peak than to the left of the peak. In this case the mean is larger than the median. Skewness Negatively Skewed: In a negatively skewed distribution there is a single peak but the observations extend further to the left, in the negative direction, than to the right. In negatively skewed distribution the mean is smaller than the median. Bimodal: A bimodal distribution will have two or more peaks. This is often the case when the values are from two populations. How to Access Skewness with the help of Boxplot Symmetric The distance from Min to Q 2 = Q 2 to Max
The distance from Min to Q 1 = Q 3 to Max
The distance from Q 1 to Q 2 = Q 2 to Q 3
How to Access Skewness with the help of Boxplot Right Skewed The distance from Q 2 to Max > Min to Q 2
The distance from Q 3 to Max > Min to Q 1
The distance Q 2 to Q 3 > Q 1 to Q 2
How to Access Skewness with the help of Boxplot Left Skewed The distance from Min to Q 2 > Q 2 to Max
The distance from Min to Q 1 > Q 3 to Max
The distance Q 1 to Q 2 > Q 2 to Q 3
Skewness Measures of Skewness Univariate Vs Bivariate Scatter Diagram we use to show the relationship between variables is called a scatter diagram. CONTINGENCY TABLE A table used to classify observations according to two identifiable characteristics. Stem and Leaf Plot Stem and leaf