You are on page 1of 6

Stats week 1

Basic Probability and Statistics terms:


Axioms of probability:
Generally probability is a number that is assigned to each member of a collection of events from a
random experiment that satisfies the following properties:

If S is the sample space and E is any event in a random experiment,

Random variable:
We often summarize the outcome from a random experiment by a simple number. In many of
the examples of random experiments that we have considered, the sample space has been a
description of possible outcomes. Hence, the variable that associates a number with the outcome
of a random experiment is referred to as a random variable.

Events:
An Event is a subset of the sample space of a random experiment.

Sample Space:
To model and analyze a random experiment, we must understand the set of possible outcomes
from the experiment. Hence, the set of all possible outcomes of a random experiment is called the
sample space
of the experiment. The sample space is denoted as S.

Independence:
In probability, any two events are independent if any one of the following equivalent statements
is true:

Here, a mutually exclusive relationship between two events is based only on the outcomes that
comprise the events. However, an independence relationship depends on the probability model
used for the random experiment.

Bayes Theorem:
Generally the conditional probabilities provide the probability of an event given a condition. But
after a random experiment generates an outcome, we are naturally interested in the probability

after a random experiment generates an outcome, we are naturally interested in the probability
that a condition was present given an outcome. To under this essential question, Bayes' Theorem
comes into picture. The following equation shows the Bayes' Theorem.

This is a useful result that enables us to solve for P(A|B) in terms of P(B|A).
If E1, E2, , Ek are k mutually exclusive and exhaustive events and B is any event,

Basic statistical terms


Numerical Summaries:
In the field application of statistics, well-constructed data summaries and displays are essential to
good statistical thinking, because they can focus the engineer on important features of the data or
provide insight about the type of model that should be used in solving the problem. We often find
it useful to describe data features numerically, so when we can characterize the location or
central tendency in the data, we will mostly refer to the arithmetic mean as the sample mean
(given that we consider our data as a sample).

1. Sample mean: If the n observations in a sample are denoted by x1, x2,., xn, the sample
mean is

Although the sample mean is useful, it does not convey all of the information about a sample of
data. The variability or scatter in the data may be described by the sample variance or the sample
standard deviation.

2. Sample Variance : If x1, x2, ., xn is a sample of observations, the sample variance is

The sample standard deviation, s, is the positive square root of the sample variance.

3. Sample Range: If the N observations in a sample are denoted by x1, x2, ., xn, the sample range is

Data Visualization tools for Statistical modelling:


Often the end result of a statistical analysis is the estimation of parameters of a postulated model.
This is natural for scientists and engineers since they often deal in modeling. A statistical model is
not deterministic but, rather, must entail some probabilistic aspects. The user of statistical
methods cannot generate sufficient information or experimental data to characterize the
population totally. But sets of data are often used to learn about certain properties of the
population. Hence there are some plots or effectively display that complements the study of
statistical population

Scatter Plot
Scatter plots are similar to line graphs in that they use horizontal and vertical axes to plot data
points. However, they have a very specific purpose. Scatter plots show how much one variable is
affected by another. The relationship between two variables is called their correlation .
Scatter plots usually consist of a large body of data. The closer the data points come when plotted
to making a straight line, the higher the correlation between the two variables, or the stronger the
relationship.
If the data points make a straight line going from the origin out to high x- and y-values, then the
variables are said to have a positive correlation . If the line goes from a high-value on the y-axis
down to a high-value on the x-axis, the variables have a negative correlation .

Histogram:
Dividing each class frequency by the total number of observations, we obtain the proportion of
the set of observations in each of the classes. A table listing relative frequencies is called a relative
frequency distribution. The information provided by a relative frequency distribution in tabular
form is easier to grasp if presented graphically. Using the midpoint of each interval and the
corresponding relative frequency, we construct a relative frequency histogram
Here is how the histogram looks after plotting.

Skewness of a plot:
Skewness is a measure of the asymmetry of the probability distribution of a real-valued random
variable about its mean. The skewness value can be positive or negative, or even undefined. For
example, consider the two distributions in the figure just below. Within each graph, the bars on
the right side of the distribution taper differently than the bars on the left side. These tapering
sides are called tails, and they provide a visual means for determining which of the two kinds of
skewness a distribution has:
Negative skew: The left tail is longer; the mass of the distribution is concentrated on the right of
the figure. The distribution is said to be left-skewed, left-tailed, or skewed to the left.
Positive skew: The right tail is longer; the mass of the distribution is concentrated on the left of
the figure. The distribution is said to be right-skewed, right-tailed, or skewed to the right.

Box Plot:
The box plot is a graphical display that simultaneously describes several important features of a
data set, such as center, spread, departure from symmetry, and identification of unusual
observations or outliers. A box plot displays the three quartiles, the minimum, and the maximum
of the data on a rectangular box, aligned either horizontally or vertically. The box encloses the
interquartile range with the left (or lower) edge at the first quartile, q1, and the right (or upper)
edge at the third quartile, q3. A line is drawn through the box at the second quartile (which is the
50th percentile or the median), A line, or whisker, extends from each end of the box. The lower
whisker is a line from the first quartile to the smallest data point within 1.5 interquartile ranges
from the first quartile. The upper whisker is a line from the third quartile to the largest data point
within 1.5 interquartile ranges from the third quartile.
Data farther from the box than the whiskers are plotted as individual points. A point beyond a
whisker, but less than three interquartile ranges from the box edge, is called an outlier. A point
more than three interquartile ranges from the box edge is called an extreme outlier.

Measure of Central Tendency:


A measure of central tendency is a single value that attempts to describe a set of data by identifying the
central position within that set of data. As such, measures of central tendency are sometimes called
measures of central location. They are also classed as summary statistics. The mean is most likely the
measure of central tendency that you are most familiar with, but there are others, such as the median
and the mode.

Mean:
It is the most well-known measure of central tendency and it can be used in both discrete and
continuous data.
The mean
The mean is equal to the sum of all the values in the data set divided by the number of values in the
data set. So, if we have n values in a data set and they have values x1, x2, ..., xn, the sample mean, usually
denoted by

denoted by

Median:
The median is the numerical value separating the higher half of a data sample, a population or a
probability distribution from the lower half.
In individual series (if number of observation is very low) first one must arrange all the observations in
order. Then count(n) is the total number of observation in given data.
If n is odd then Median (M) = value of ((n + 1)/2)th item term.
If n is even then Median (M) = value of [((n)/2)th item term + ((n)/2 + 1)th item term ]/2

Mode:
The mode is the most frequent score in our data set. On a histogram it represents the highest bar in a
bar chart or histogram. You can, therefore, sometimes consider the mode as being the most popular
option. An example of a mode is presented below

You might also like