Basic Stats Information

Main topic
Topic
What it says summarize data and present in graphical form. Shows the number of times the data points repeat on various frequency Understand what it implies - Decide to leave it or remove from the dataset shows the average value of the data set. Outliers can distort the mean value easily and skew the data towards it. Weighs the value of every data when plotting histogram if the mean is moved towards the right of the graph it means that outliers are pulling the mean towards it middle value of the data set when arranged in ascending order. Answer to problem when the data is skewed due to outliers Most frequently ocurring value in data set. Use this when value is not important When data has two modes, its called bimodal distribution measures the level of dispersion(variability) in the given set of data Std dev tells us how far the data set represent from mean. Higher std dev means data is widely spread from mean. Lower std dev means data closer to mean mesured at std dev divided by mean Scatter plot - visual summary of relationship between variables when one variable is time, relation ship is called time series false relationship could be purely due to co incidences. Look out for hidden variables scatter plot - does not prove casuality, never prove one variable cuase the other
Histogram
Outliers
Mean
Median
Mode
Standard deviation
Coeffcient of variation
Two variables
Correlation
quantifies the extent of which there is a linear relationship between two variables takes value between -1 and 1. -1 represent strong negative correlation and +1 represent positive correlation. 0 represent no correlation even if the correl is 0, relationship may exist but just that it may not be linear (you shaped pattern) Outliers can strongly influence the correlation. Do not decide on the relationship only based on number. Graphical summary may clearly mention the outliers. Attempt to reconcile the outliers by studying more about the data
- Influence of outliers
Generating random samples
Sample size
- Avoid bias. - sample should be representative of whole population - sample size should get us the level fo accruacy and not the numbers - larger population does not neccessitate larger sample size if its not representative of population - How do we collect the sample (mail, phone, in person, survey, handouts etc) - several disadvantages exist in each method - Surveys with low response rate contact non response or make sure that non response reflect the opinion of the ones who responded - better to go for small sample and puruse high response than otherwise Sample is best estimate point of population mean construct an interval which will tell how close sample mean is close to population mean
Learning about sample
Responose rate
Use of confidence levels
ata description
Normal Distribution whats special importance of std dev Large std dev makes the curve flat, small std dev makes the curve narrow and tall (with values more close to mean) 68% of the time, the range lies within 1 std dev from the mean 95% of probability, range lies within 2 std dev from the mean translates any value in to corresponding Z value by subracting the mean and divide by std dev z multipliedby std dev and add/subract from mean would give range and the probability within which range is present (68%, 95%, 99%)
Factors that affect interval level - Sample mean should be at center of the range - higher std dev greater uncertainity about population, wider range to bring in confidence - small sample size demand wider range to create confidence that pm is within the SM - More confident we want our SM represent PM, wider would be the range - Shape of bell curve with mean at the center - X axis is the variable we are studying and Y axis is the likelyhood of different value that occurs mean and median are the same. Probabiility of value less than mean is 50% and more than mean is 50% - location, widthness and narrowness of the curve depends on the std deviation and mean
rule of thumb
Z value
if we start from very left of the curve then it measures cumulative probability. Probability works only on normal distribution curve (not on all the curves)
How to find cummulative probability
first standardize the value of the variable by using excel standardize function (this will find out the value of Z). Second use norms dist function to find out the cummulative probability Other option is to use normdist with value of True. This wil return the cummulative probability
How to find Z value if you have cummulative prob value How to find value of the variable if you have cummulative prob, sample mean and std dev sample mean distributed approximately normally regardless of distribution of the population more samples, better approximation of normal distribution Mean distribution of sample = population Properties of normal distribution to extract info from sample It's important to emphasize: We are not saying that 95% of the time our sample mean is the population mean, but we are saying that 95% of the time a range that is two standard deviations wide centered around the sample mean contains the population mean. accept higher range or increase sample size How do we know if an interval is too wide? Typically, if we would make a different decision for different values within an interval, that interval is too wide.
Central Limit Theorom
Confidence intervals
Estimating population mean Increase confidence level
How wide the interval std dev of sample mean How to find confidence interval
this works only if the sample size is > 30 need to know the level of confidence
Obtaining Z value For smaller sample size (less than 30) we have to use T value Degree of freedom = sample size -1 based on initial estimate, find out sd, also find out what should be the maximum deviation allowed. Apply following formula to get the desired Choosing sample size sample size
Converting the desired confidence level into the corresponding cumulative probability on the standard normal curve is essential because Excel's NORMSINV function and the z-table work with cumulative probabilities
summary of how to build the range that constitutes population mean
working with proportions
often used to indicate frequency of some phenomenon in the population p bar is the proportion of yes to a total population selected sample size should satisfy the condition mentioned
sample size selection
Method
Calculations/Formulas
Use excel function (under analysis tool pak)
Greek letter mu represent mean of data aset user average formula in excel
use median formula in excel
use mode formula in excel
Greem letter sigma
Use excel formula STDEV
can be used to compare among different set of data
use excel correl function to find out the correlation
- select elements from population at random - Analyze the sample - Draw inference about total population we are interested in
Need to know x bar (sample mean), std dev of sampel s and sample siize n. Z represent confidence level. Higher value of Z higher the confidence level is
standardize, normsdist
Normdist Normsinv 2.807033768
Norminv
std dev of population mean divided by sq root of n
to convert desired conf level, take 1desired conf level and divide by 2. Then add the result to the desire conf level. Input 1-confidence interval and degree of freedom
Use TINV
solve the equation or use the excel utility
use the excel utility - confidence interval
use excel utility number of rooms available divided by upper limit of the confidence leve
n x p bar >= 5, n x (1- p bar) > = 5
Confidence Interval Utility

Type of Estimate: Sample Size: Input Area n >= 30 n x-bar s confidence level Center of Interval z*s/sqrt(n) Lower end of int'l Upper end of int'l Interval width 1-confidence level (1-confidence level)/2 z sqrt(n) s/sqrt(n) 70 4.5 1.2 0.95 4.50 0.28 4.22 4.78 0.56 0.05 0.025 1.96 8.37 0.14 Mean n < 30 n x-bar s confidence level Center of Interval t*s/sqrt(n) Lower end of int'l Upper end of int'l Interval width 1-conf t sqrt(n) s/sqrt(n) 20 5 10 0.95 5.00 4.68 0.32 9.68 9.36 0.05 2.09 4.47 2.24 Proportions n >= 30 n p-bar confidence level 100 0.1 0.95
Output Area
Center of Interval z*s/sqrt(n) Lower end of int'l Upper end of int'l Interval width 1-confidence level (1-confidence level)/2 z (p)(1-p) s = sqrt[(p)(1-p)} sqrt(n) s/sqrt(n) Check assumptions: np>5 n(1-p) > 5
0.10 0.06 0.041 0.159 0.12 0.05 0.025 1.96 0.09 0.30 10.00 0.03
Other Calculations
OK OK
146230062.xlsx.ms_office
Confidence Intervals
11/16
Sample Size Utility

Type of Estimate: Input Area
Mean
Sample Standard Deviation, s Desired Accuracy: Half Width of Interval, d Confidence level Required Sample Size 1-confidence level (1-confidence level)/2 z z*s z*s/d Minimal n 50 5 0.95 385 0.05 0.025 1.96 98.00 19.60 384.15
Proportion
Estimate of p Desired Accuracy: Half Width of Interval, d Confidence level Required Sample Size 1-confidence level (1-confidence level)/2 z (p)(1-p) s = sqrt[(p)(1-p)} z*s = {z*sqrt[(p)(1-p)]} z*s/d = {z*sqrt[(p)(1-p)]}/d Minimal n to ensure np>5 Minimal n to ensure n(1-p)>5 Minimal n to ensure d < (zs/sqrt(n)) Minimal n to satisfy all constraints 0.1 0.02 0.95 865 0.05 0.025 1.96 0.09 0.30 0.59 29.40 50.0 5.6 864.3 864.3
Output Area Other Calculations
Assumptions: Sample Size will be above 30. If not, raise sample size to 30 to make assumptions valid. Proportion Estimate is the maximum you expect p to be. If you don't have a good estimate of the proportion, use p = .5, which gives maximal standard deviation.
146230062.xlsx.ms_office
Sample Size
12/16
Cereal 100% Bran All-Bran Almond Delight Apple Cinnamon Cheerios Apple Jacks Bran Chex Bran Flakes Cap'n'Crunch Cheerios Cinnamon Toast Crunch Cocoa Puffs Corn Chex Corn Flakes Corn Pops Count Chocula Cracklin' Oat Bran Cream of Wheat (Quick) Crispix Double Chex Froot Loops Frosted Flakes Frosted Mini-Wheats Fruit & Fibre Dates, Walnuts, and Oats Fruity Pebbles Golden Grahams Grape Nuts Flakes Grape-Nuts Great Grains Pecan Honey Nut Cheerios Honey-comb Kix Life Lucky Charms Maypo Muesli Raisins, Dates, & Almonds Muesli Raisins, Peaches, & Pecans Mueslix Crispy Blend Nut&Honey Crunch Nutri-grain Wheat Post Nat. Raisin Bran Product 19
Complex Carbohydr Protein ates (grams (grams per per serving) serving) 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 12 12 12 13 13 13 13 13 14 14 15 23 9 10 11 11 11 12 14 15 15 15 16 18 21 21 21 21 21 22 22 10 11 11 12 12 13 13 14 14 15
Puffed Rice Puffed Wheat Quaker Oat Squares Raisin Bran Raisin Nut Bran Raisin Squares Rice Chex Rice Krispies Shredded Wheat Smacks Special K Total Corn Flakes Total Raisin Bran Total Whole Grain Triples Trix Wheat Chex Wheaties mean median
3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 6 6
15 16 17 17 17 17 18 20 21 5 7 12 14 16 16 16 16 17
2.49 2.00
14.81 14.00
Variable 1 Variable 2 -1.0 -1.0 1.0 1.0 -1.0 1.0 -1.0 -1.0 -1.0 1.0 1.0 1.0 1.0 1.0 -1.0 -1.0 1.0 -1.0 1.0 1.0 1.0 -1.0 -1.0 -1.0
Age 53 43 33 45 46 55 41 55 36 45 55 50 49 47 69 51 48 62 45 37 50 50 50 58 53 57 53 61 47 56 44 46 58 48 38 74 60 32 51 50 40 61 63 56
Salary ($thousan ds) 145 621 262 208 362 424 339 736 291 58 498 643 390 332 750 368 659 234 396 300 343 536 543 217 298 1103 406 254 862 204 206 250 21 298 350 800 726 370 536 291 808 543 149 350
-1.000000
45 61 70 59 57 69 44 56 50 56 43 48 52 62 48
242 198 213 296 317 482 155 802 200 282 573 388 250 396 572
0.13

Basic Stats Information

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Basic Stats Information

Uploaded by

Copyright:

Available Formats

Main topic

Generating random samples

Learning about sample

Use of confidence levels

How to find cummulative probability

Central Limit Theorom

Estimating population mean Increase confidence level

summary of how to build the range that constitutes population mean

working with proportions

sample size selection

Use excel function (under analysis tool pak)

use median formula in excel

use mode formula in excel

Greem letter sigma

Use excel formula STDEV

can be used to compare among different set of data

use excel correl function to find out the correlation

Normdist Normsinv 2.807033768

std dev of population mean divided by sq root of n

solve the equation or use the excel utility

use the excel utility - confidence interval

n x p bar >= 5, n x (1- p bar) > = 5

Confidence Interval Utility

Sample Size Utility

Output Area Other Calculations

You might also like