You are on page 1of 16

Main topic

Topic

What it says summarize data and present in graphical form. Shows the number of times the data points repeat on various frequency Understand what it implies - Decide to leave it or remove from the dataset shows the average value of the data set. Outliers can distort the mean value easily and skew the data towards it. Weighs the value of every data when plotting histogram if the mean is moved towards the right of the graph it means that outliers are pulling the mean towards it middle value of the data set when arranged in ascending order. Answer to problem when the data is skewed due to outliers Most frequently ocurring value in data set. Use this when value is not important When data has two modes, its called bimodal distribution measures the level of dispersion(variability) in the given set of data Std dev tells us how far the data set represent from mean. Higher std dev means data is widely spread from mean. Lower std dev means data closer to mean mesured at std dev divided by mean Scatter plot - visual summary of relationship between variables when one variable is time, relation ship is called time series false relationship could be purely due to co incidences. Look out for hidden variables scatter plot - does not prove casuality, never prove one variable cuase the other

Histogram

Outliers

Mean

Median

Mode

Standard deviation

Coeffcient of variation

Two variables

Correlation

quantifies the extent of which there is a linear relationship between two variables takes value between -1 and 1. -1 represent strong negative correlation and +1 represent positive correlation. 0 represent no correlation even if the correl is 0, relationship may exist but just that it may not be linear (you shaped pattern) Outliers can strongly influence the correlation. Do not decide on the relationship only based on number. Graphical summary may clearly mention the outliers. Attempt to reconcile the outliers by studying more about the data

- Influence of outliers

Generating random samples

Sample size

- Avoid bias. - sample should be representative of whole population - sample size should get us the level fo accruacy and not the numbers - larger population does not neccessitate larger sample size if its not representative of population - How do we collect the sample (mail, phone, in person, survey, handouts etc) - several disadvantages exist in each method - Surveys with low response rate contact non response or make sure that non response reflect the opinion of the ones who responded - better to go for small sample and puruse high response than otherwise Sample is best estimate point of population mean construct an interval which will tell how close sample mean is close to population mean

Learning about sample

Responose rate

Use of confidence levels

ata description
Normal Distribution whats special importance of std dev Large std dev makes the curve flat, small std dev makes the curve narrow and tall (with values more close to mean) 68% of the time, the range lies within 1 std dev from the mean 95% of probability, range lies within 2 std dev from the mean translates any value in to corresponding Z value by subracting the mean and divide by std dev z multipliedby std dev and add/subract from mean would give range and the probability within which range is present (68%, 95%, 99%)

Factors that affect interval level - Sample mean should be at center of the range - higher std dev greater uncertainity about population, wider range to bring in confidence - small sample size demand wider range to create confidence that pm is within the SM - More confident we want our SM represent PM, wider would be the range - Shape of bell curve with mean at the center - X axis is the variable we are studying and Y axis is the likelyhood of different value that occurs mean and median are the same. Probabiility of value less than mean is 50% and more than mean is 50% - location, widthness and narrowness of the curve depends on the std deviation and mean

rule of thumb

Z value

if we start from very left of the curve then it measures cumulative probability. Probability works only on normal distribution curve (not on all the curves)

How to find cummulative probability

first standardize the value of the variable by using excel standardize function (this will find out the value of Z). Second use norms dist function to find out the cummulative probability Other option is to use normdist with value of True. This wil return the cummulative probability

How to find Z value if you have cummulative prob value How to find value of the variable if you have cummulative prob, sample mean and std dev sample mean distributed approximately normally regardless of distribution of the population more samples, better approximation of normal distribution Mean distribution of sample = population Properties of normal distribution to extract info from sample It's important to emphasize: We are not saying that 95% of the time our sample mean is the population mean, but we are saying that 95% of the time a range that is two standard deviations wide centered around the sample mean contains the population mean. accept higher range or increase sample size How do we know if an interval is too wide? Typically, if we would make a different decision for different values within an interval, that interval is too wide.

Central Limit Theorom

Confidence intervals

Estimating population mean Increase confidence level

How wide the interval std dev of sample mean How to find confidence interval

this works only if the sample size is > 30 need to know the level of confidence

Obtaining Z value For smaller sample size (less than 30) we have to use T value Degree of freedom = sample size -1 based on initial estimate, find out sd, also find out what should be the maximum deviation allowed. Apply following formula to get the desired Choosing sample size sample size

Converting the desired confidence level into the corresponding cumulative probability on the standard normal curve is essential because Excel's NORMSINV function and the z-table work with cumulative probabilities

summary of how to build the range that constitutes population mean

working with proportions

often used to indicate frequency of some phenomenon in the population p bar is the proportion of yes to a total population selected sample size should satisfy the condition mentioned

sample size selection

Method

Calculations/Formulas

Use excel function (under analysis tool pak)

Greek letter mu represent mean of data aset user average formula in excel

use median formula in excel

use mode formula in excel

Greem letter sigma

Use excel formula STDEV

can be used to compare among different set of data

use excel correl function to find out the correlation

- select elements from population at random - Analyze the sample - Draw inference about total population we are interested in

Need to know x bar (sample mean), std dev of sampel s and sample siize n. Z represent confidence level. Higher value of Z higher the confidence level is

standardize, normsdist

Normdist Normsinv 2.807033768

Norminv

std dev of population mean divided by sq root of n

to convert desired conf level, take 1desired conf level and divide by 2. Then add the result to the desire conf level. Input 1-confidence interval and degree of freedom

Use TINV

solve the equation or use the excel utility

use the excel utility - confidence interval

use excel utility number of rooms available divided by upper limit of the confidence leve

n x p bar >= 5, n x (1- p bar) > = 5

Confidence Interval Utility


Type of Estimate: Sample Size: Input Area n >= 30 n x-bar s confidence level Center of Interval z*s/sqrt(n) Lower end of int'l Upper end of int'l Interval width 1-confidence level (1-confidence level)/2 z sqrt(n) s/sqrt(n) 70 4.5 1.2 0.95 4.50 0.28 4.22 4.78 0.56 0.05 0.025 1.96 8.37 0.14 Mean n < 30 n x-bar s confidence level Center of Interval t*s/sqrt(n) Lower end of int'l Upper end of int'l Interval width 1-conf t sqrt(n) s/sqrt(n) 20 5 10 0.95 5.00 4.68 0.32 9.68 9.36 0.05 2.09 4.47 2.24 Proportions n >= 30 n p-bar confidence level 100 0.1 0.95

Output Area

Center of Interval z*s/sqrt(n) Lower end of int'l Upper end of int'l Interval width 1-confidence level (1-confidence level)/2 z (p)(1-p) s = sqrt[(p)(1-p)} sqrt(n) s/sqrt(n) Check assumptions: np>5 n(1-p) > 5

0.10 0.06 0.041 0.159 0.12 0.05 0.025 1.96 0.09 0.30 10.00 0.03

Other Calculations

OK OK

146230062.xlsx.ms_office

Confidence Intervals

11/16

Sample Size Utility


Type of Estimate: Input Area

Mean
Sample Standard Deviation, s Desired Accuracy: Half Width of Interval, d Confidence level Required Sample Size 1-confidence level (1-confidence level)/2 z z*s z*s/d Minimal n 50 5 0.95 385 0.05 0.025 1.96 98.00 19.60 384.15

Proportion
Estimate of p Desired Accuracy: Half Width of Interval, d Confidence level Required Sample Size 1-confidence level (1-confidence level)/2 z (p)(1-p) s = sqrt[(p)(1-p)} z*s = {z*sqrt[(p)(1-p)]} z*s/d = {z*sqrt[(p)(1-p)]}/d Minimal n to ensure np>5 Minimal n to ensure n(1-p)>5 Minimal n to ensure d < (zs/sqrt(n)) Minimal n to satisfy all constraints 0.1 0.02 0.95 865 0.05 0.025 1.96 0.09 0.30 0.59 29.40 50.0 5.6 864.3 864.3

Output Area Other Calculations

Assumptions: Sample Size will be above 30. If not, raise sample size to 30 to make assumptions valid. Proportion Estimate is the maximum you expect p to be. If you don't have a good estimate of the proportion, use p = .5, which gives maximal standard deviation.

146230062.xlsx.ms_office

Sample Size

12/16

Cereal 100% Bran All-Bran Almond Delight Apple Cinnamon Cheerios Apple Jacks Bran Chex Bran Flakes Cap'n'Crunch Cheerios Cinnamon Toast Crunch Cocoa Puffs Corn Chex Corn Flakes Corn Pops Count Chocula Cracklin' Oat Bran Cream of Wheat (Quick) Crispix Double Chex Froot Loops Frosted Flakes Frosted Mini-Wheats Fruit & Fibre Dates, Walnuts, and Oats Fruity Pebbles Golden Grahams Grape Nuts Flakes Grape-Nuts Great Grains Pecan Honey Nut Cheerios Honey-comb Kix Life Lucky Charms Maypo Muesli Raisins, Dates, & Almonds Muesli Raisins, Peaches, & Pecans Mueslix Crispy Blend Nut&Honey Crunch Nutri-grain Wheat Post Nat. Raisin Bran Product 19

Complex Carbohydr Protein ates (grams (grams per per serving) serving) 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 12 12 12 13 13 13 13 13 14 14 15 23 9 10 11 11 11 12 14 15 15 15 16 18 21 21 21 21 21 22 22 10 11 11 12 12 13 13 14 14 15

Puffed Rice Puffed Wheat Quaker Oat Squares Raisin Bran Raisin Nut Bran Raisin Squares Rice Chex Rice Krispies Shredded Wheat Smacks Special K Total Corn Flakes Total Raisin Bran Total Whole Grain Triples Trix Wheat Chex Wheaties mean median

3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 6 6

15 16 17 17 17 17 18 20 21 5 7 12 14 16 16 16 16 17

2.49 2.00

14.81 14.00

Variable 1 Variable 2 -1.0 -1.0 1.0 1.0 -1.0 1.0 -1.0 -1.0 -1.0 1.0 1.0 1.0 1.0 1.0 -1.0 -1.0 1.0 -1.0 1.0 1.0 1.0 -1.0 -1.0 -1.0

Age 53 43 33 45 46 55 41 55 36 45 55 50 49 47 69 51 48 62 45 37 50 50 50 58 53 57 53 61 47 56 44 46 58 48 38 74 60 32 51 50 40 61 63 56

Salary ($thousan ds) 145 621 262 208 362 424 339 736 291 58 498 643 390 332 750 368 659 234 396 300 343 536 543 217 298 1103 406 254 862 204 206 250 21 298 350 800 726 370 536 291 808 543 149 350

-1.000000

45 61 70 59 57 69 44 56 50 56 43 48 52 62 48

242 198 213 296 317 482 155 802 200 282 573 388 250 396 572

0.13

You might also like