You are on page 1of 17

Lecture 4: Probability and Statistics

Goals of the lecture Finding a representative value that best characterizes the average of the data set. Finding a representative value that provides a measure of the variation in the measured data set. Establishing an interval about the representative average value in which the true value is expected to lie.

Introduction to Uncertainty and Estimation of Precision Uncertainty


Two classes of experiments exist: Single-sample experiment: measurement is taken exactly once Repeat-sample experiment: the same measurement is taken several times, under identical conditions Repeat-sampling allows an estimate of the measurement to be made via statistical methods Total uncertainty Ux in a measurement of x is calculated from bias and precision uncertainties: Given Bx = bias uncertainty; Px = precision uncertainty Assume sources of bias and precision error are independent Total uncertainty , where Ux, Bx and Px are all at the same odds (coverage, confidence).

Error Distribution: Characterizes the probability that an error of a given size will occur during repeat-sample experiments.

Probability: an expression of the likelihood of a particular event taking place, measured with reference to all possible events.
The probability density function (PDF) for the entire population of possible precision error values is generally assumed to be Gaussian (normal, bell-shaped) Since total precision error is random, each individual measurement in the sample will have a distinct error whose likelihood of occurrence (roughly) decreases with size

Normal Distribution Function (Gaussian Function)

For a normal distribution, if we know the mean and standard deviation, we can estimate the probability that a single measurement will lie within a band around the mean:

Probability density function In simple terms, a probability density function (PDF) is constructed by drawing a smooth curve fit through the vertically normalized histogram. Histograms:
a histogram is constructed by divvying up the n measurements of a sample into J bins or intervals (also called classes) such that for the first bin (j = 1), x1 < x x2, for the second bin (j = 2), x2 < x x3, etc. We define xmid,j as the middle value of x in bin j. For example, xmid,2 = (x2 + x3)/2. Generally between 6 and 15 bins are used, i.e., 6 J 15. Then a bar plot is made of the frequency the number of measurements in each bin versus the value of x, as sketched. (The frequency is also called the class frequency.) The bin width (also called the interval width or class width) is usually constant, although it does not have to be. In the sketch above, the bin width x3 for the third bin (j = 3) is shown.

The histogram can be modified by dividing the vertical axis by the total number of measurements, n. The resulting probability histogram has the same shape, but the vertical axis represents a relative frequency or probability, i.e.,

We can also define a vertically normalized histogram by further dividing the vertical axis by the bin width or interval width. The vertical axis of the vertically normalized histogram is defined as:

Using a histogram to display this data, we need to choose K small intervals for each bin of the histogram.

For small N, the number of measurement results in at least one bin should be >= 5

For intermediate values of N

K = 1.87(20 1)0.40 +1 = 7.1 Minimum Value = 0.68, Maximum Value = 1.34

Therefore a bin width x of 0.10 is chosen


This histogram is an estimate of the data set probability density function.

nj is the number of samples in each bin. p(x) defines the probability that measured variable might assume any particular value upon any individual measurement.

The Mean Value (Central Tendency) and Standard Deviation


Continuous Random Variable

Mean Value

Variance

Discrete Random Variable Mean Value Variance

Finite Statistics
Unless we have made a very large number of measurements, we don't have an accurate estimate of the mean or standard deviation of a data set. If we assume the values are normally distributed, we can estimate the mean and standard deviation from the data.
The sample mean and sample variance are given by: and

How close are these values to the true mean and standard deviation? That depends on how many samples we have.

For a normally distributed data set, we can say that the probability of a sample, xi, differing from the data set mean value, , is given by x

Standard Deviation of the Means


If we take a set of N measurements of the same variable, then repeat this process M times, the mean of each data set will differ somewhat from the others. It can be shown that the mean values themselves will follow a normal distribution even if the original distribution is not normal.
The standard deviation of the means is given by :

sample of N values differs from the true mean of the distribution by an amount

Normalized probability density function a normalized probability


density function is constructed by transforming both the abscissa (horizontal axis) and ordinate (vertical axis) of the PDF plot as follows:

The above transformations accomplish two things:


The first transformation normalizes the abscissa such that the PDF is centered around z = 0. The second transformation normalizes the ordinate such that the PDF is spread out in similar fashion regardless of the value of standard deviation. The Gaussian or Normal probability density function
It is symmetric about the mean. The mean, median, and mode are all equal to , the expected value (at the peak of the distribution). Its plot is commonly called a bell curve because of its shape. The actual shape depends on the magnitude of the standard deviation. Namely, if is small, the bell will be tall and skinny, while if is large, the bell will be short and fat, as sketched.

Confidence level

is defined as the probability that a random variable lies within a specified range of values. The range of values itself is called the confidence interval. For example, as discussed above we are 95.44% confident that a purely random variable lies within two standard deviations from the mean.

Level of significance, , is defined as the probability that a random


variable lies outside of a specified range of values. In the above example, we are 100 95.44 = 4.56% confident that a purely random variable lies either below or above two standard deviations from the mean. (We usually round this off to 5% for practical engineering statistical analysis.)

Regression Analysis
Regression analysis is used to find an equation for y as a function of x that provides the best fit to the data. Typically, y is some measured output as a function of some known input, x. Recall that the linear correlation coefficient is used to determine if there is a trend.

Linear regression analysis


Linear regression analysis is also called linear least-squares fit analysis. The goal of linear regression analysis is to find the best fit straight line through a set of y vs. x data. The technique for deriving equations for this best-fit or least-squares fit line is as follows:
1. An equation for a straight line that attempts to fit the data pairs is chosen as Y=ax+b. 2. in the above equation, a is the slope (a = dy/dx most of us are more familiar with the symbol m rather than a for the slope of a line), and b is the yintercept the y location where the line crosses the y axis (in other words, the value of Y at x = 0).

Linear regression analysis (Cont.)


3. An upper case Y is used for the fitted line to distinguish the fitted data from the actual data values, y. 4. In linear regression analysis, coefficients a and b are optimized for the best possible fit to the data. 5. The optimization process itself is actually very straightforward: 6. For each data pair (xi, yi), error ei is defined as the difference between the predicted or fitted value and the actual value: ei = error at data pair i, or :

ei is also called the residual. Note: Here, what we call the actual value does not necessarily mean the correct value, but rather the value of the actual measured data point.

7. We define E as the sum of the squared errors of the fit a global measure of the error associated with all n data points. The equation for E is :

Linear regression analysis (Cont.)


8. It is now assumed that the best fit is the one for which E is the smallest. 9. In other words, coefficients a and b that minimize E need to be found. These coefficients will be the ones that create the best-fit straight line Y = ax + b. 10. How can a and b be found such that E is minimized? Well, as any good engineer or mathematician knows, to find a minimum (or maximum) of a quantity, that quantity is differentiated, and the derivative is set to zero. 11. Here, two partial derivatives are required, since E is a function of two variables, a and b. Therefore, we obtain:

Linear regression analysis (Cont.)

Correlation coefficient :
In engineering analysis, we often want to fit a trend line or curve to a set of x-y data. Consider a set of n measurements of some variable y as a function of another variable x. Typically, y is some measured output as a function of some known input, x. In general, in such a set of measurements, there may be: Some scatter (precision error or random error). A trend in spite of the scatter, y may show an overall increase with x, or perhaps an overall decrease with x. The linear correlation coefficient is used to determine if there is a trend. If there is a trend, regression analysis is used to find an equation for y as a function of x which provides the best fit to the data. The linear correlation coefficient rxy is defined as:

Data Outlier Detection


How do you handle spurious data points? The most common and simplest approach is to label points that lie outside the range of 99.8% probability of occurrence, , as outliers. This three-sigma test works well with data set of 10 or more points.

Number of Measurements Required


Some sample statistics must be known to estimate the variation in the data set and therefore estimate a confidence interval in the data yet to be acquired.

You might also like