Professional Documents
Culture Documents
STATISTICS
ASSESSMENT
Review types of data, collecting data, sorting data, measures of
central tendency, measures of spread and displaying data.
Interpreting results from two sets of data (i.e. back to back stem and
leaf displays, histograms, double column graphs, or box and
whiskers plots).
Find the range, interquartile range and standard deviation as
measures of spread of data sets
- Find the mean and standard deviation of a set of data using
digital technologies calculators
- Compare and describe the spread of sets of data with the same
mean but different standard deviations
Bivariate Data: recognises the difference between dependent and
independent variables. Describes the strength and direction of the
relationship between two variables displayed in a scatter plot, e.g.
Strong positive relationships, weak negative relationships with
justifications.
Uses lines of best fit to predict what might happen between known
data values (interpolation) and predict what might happen beyond
known data values (extrapolation).
Know the six processes to setting up statistical investigations.
Identify reasons why data in a display may be misrepresented.
EXPRECTATIONS
Use measures of central tendency (mean, mode, median) and the
range to analyse data that is displayed in a frequency table, stem-
and-leaf plot or dot plot.
Use terms skewed or symmetrical when describing the shape of a
distribution.
Compare two sets of data and draw conclusions by finding the
mean, mode and/ or median, and the range of both sets.
Construct a cumulative frequency table, histogram and polygon
(ogive) for ungrouped data.
Use cumulative frequency to find the median.
Group data into class intervals.
Construct a cumulative frequency table and histogram for grouped
data.
Find the mean and modal class of grouped data.
Determine the upper and lower quartiles for a set of scores.
Construct a box-and-whisker plot using the five-point summary.
Use a calculator to find the standard deviation of a set of scores.
Use the mean and standard deviation to compare two sets of data.
Compare the relative merits of measures of spread (range,
interquartile range and standard deviation).
STATISTIC TERMANOLOGY
BIVARIATE DATA
- data that has to variables
BOX PLOT (CAT-AND-WHISKERS PLOT)
- a diagram obtained from the five number summary
- the box shows the middle 50% of scores (the interquartile range)
- the whiskers show us the extent of the bottom and top quartiles
as well as the range
CENSUS
- a survey of a whole population
CUMULATIVE FREQUENCY
- the number of scores less than or equal to a particular outcome
- e.g. For the data 3,6,5,3,5,5,4,3,3,6 the cumulative frequency of
5 is 8 (there are 8 scores of 5 or less)
CUMULATIVE FREQUENCY HISTORGRAM (AND POLYGON)
- these show the outcomes and their cumulative frequencies
DATA
- the pieces of information (or scores) to be examined
- categorical: data that uses non-numerical categories
- ordered data involves a ranking, e.g. exam grades, garment
sizes
- distinct data has no order, e.g. colours, types of cars
- numerical: data that uses numbers to show how much
- continuous data can have any numerical value within a range,
e.g. height
- discrete data is restricted to certain numerical values, e.g.
number of pets
DOT PLOT
- a type of graph that uses one axis and a number of dots above
the axis
EXTRAPOLATION
- predicting a data beyond the range of values given
FIVE NUMBER SUMMARY
- a set of numbers consisting of the minimum score, the three
quartiles and the maximum score
FREQUENCY
- the number of times an outcome occurs in the data
- e.g. for the data 3,6,5,3,5,5,4,3,3,6 the outcome 5 has a
frequency of 3
FREQUENCY DISTRIBUTION TABLE
- a table that shows all the possible outcomes and their
frequencies (it usually is extended by adding other columns such
as the cumulative frequency)
FREQUENCY HISTROGRAM
- a type of column graph showing the outcomes and their
frequencies.
FREQUENCY POLYGON
- a type of line graph showing outcomes and their frequencies
- to complete the polygon, the outcomes immediately above and
below the actual outcomes are used (the height of these columns
is zero)
GROUPED DATA
- data that is organised into groups or classes
- class intervals: the size of the groups into which the data is
organised e.g. 1-5 (5 scores); 11-20 (10 scores)
- class centre: the middle outcome of a class
e.g. the class 1-5 has a class centre of 3
INTERPOLATION
- estimating data that lie within the domain of the values given
INTERQUARTILE RANGE
- the range of the middle 50% of scores
- the difference between the median of the upper half of scores
and the median of the lower half of scores
- IQR = Q3-Q1
LINE OF BEST FIT
- a line that best fits; the data on a scatter plot mean
MEAN
- the number obtained by evening out all the scores until they are
equal
- e.g. if the scores 3,6,5,3,5,5,4,3,3,6 were evened out, the
number obtained would be 4.3
- to obtain the mean, we divide the sum of the scores with the
total number of scores
MEDIAN
- the middle score for an odd number of scores or the mean of the
middle two scores for an even number of scores
- the median class is grouped data containing the median
MODE (MODAL CLASS)
- the outcome or class that contains the most scores
OGIVE
- this is another name for the cumulative frequency polygon
OUTCOME
- a possible value of the data
OUTLIER
- a score that is separated from the main body of scores
QUARTILES
- the points that divide the scores the scores up into quarters
- the second quartile, Q2, divides the scores into halves (Q2 =
median)
- the first quartile, Q1, is the median of the lower half of scores
- the third quartile, Q3, is the median of the upper half of scores
RANGE
- the difference between the highest and lowest scores
SAMPLE
- a part (usually a small part) of a large population
- random sample: a sample taken so that each member of the
population has the same change of being included
- systematic sample: a sample selected according to some
ordering scheme, e.g. every tenth member
- stratified sample: a sample is proportionally taken from each
subgroup in a population
SCATTER PLOT
- a graph that uses points on a number plane to show the
relationship between two categories.
SHAPE (OF A DISTRIBUTION)
- a set of scores can be symmetrical or skewed
SOURCES OF DATA
- primary: the data has been collected by yourself
- secondary: the data has come from an external source, e.g.
newspapers, internet
STANDARD DEVIATION
- a measure of spread that can be thought of as the average
distance of scores from the mean
- the larger the standard deviation, the larger the spread
STATISTICS
- the collection, organisation and interpretation of numerical data
STEM-AND-LEAF PLOT
- a graph that shows the spread of scores without losing the
identity of the data
- ordered stem-and-leaf plot: the leaves are placed in order
- back-to-back stem-and-leaf plot: this can be used to compare two
sets of scores, one set on each side
VARIABLE
- something that can be observed, measured or counted to provide
data
1
STATISTICS
TYPES OF DATA
The data we collect is made up of variables. These are pieces of
information like a quantity or a characteristic that can be observed or
measured. They may change either over time or between individual
observations. The main types of data are:
COLLECTING DATA
There are three main ways of collecting data, including:
CENSUS
- a whole population is surveyed, e.g. every student in the school
is questioned
SAMPLE
- a selected group of a population is surveyed, e.g. a small number
in each class is questioned
OBSERVATION
- numerical facts are collected and tabulated, e.g. sports data,
weather, sales figures, etc.
SORTING DATA
A large amount of data needs to be tabulated (organised into a table) so
that it can be analysed. A common form of table is the frequency
distribution table.
DISCRETE DATA
OUTCOME ( TALLY FREQUENCY f x CUMULATIVE
x ) ( f ) FREQUENCY
1 ||| 3 3 3
2 |||| 4 8 7
3 ||||||| 7 21 14
4 ||||||||| 9 36 23
5 ||||| 5 25 28
6 || 2 12 30
TOTAL | 30 | 105
GROUPED DATA
Used to cluster discrete data into groups or to divide continuous data into
adjoining groups.
CLASS CLASS TALLY FREQUENCY f c . c . CUMULATIV
CENTRE ( f ) E
FREQUENCY
1-<5 1 ||| 3 3 3
5-<9 2 |||| 4 8 7
9-<13 3 ||||||| 7 21 14
13-<17 4 ||||||||| 9 36 23
17-<21 5 ||||| 5 25 28
21-<25 6 || 2 12 30
TOTAL | 30 | 390
After the data has been sorted, certain key numbers can be determined.
Some measure how the data clusters around the centre. These are called
measures of central tendency (or measures of location). Others measure
how the data spreads from the centre. These are called measures of
spread.
MODE
- the score or outcome that occurs the most, i.e. with the highest
frequency.
- the mode is the score that occurs most often
- e.g. 3,4,3,4,6,7,5,7,3,5,2 the mode from the set of data is 3
MEAN
- the sum of the scores divided by the number of scores, i.e. the
usual definition of average.
- for raw data, mean
of scores
number of scores
- for tabulated data, mean
of f x column ]
of frequency column
MEDIAN
- the middle score when the scores are placed in order. If there is
an even number of scores, the median is the average of the two
middle scores.
- the median is (when scores are arranged from lowest to highest)
- the middle score (for an odd number of scores)
- the average of the two middle scores (for an even number of
scores)
MESURES OF SPREAD
RANGE
-the highest score minus the lowest score, for grouped data,
unless the original scores are known, the maximum possible
range can be determined by suing the class groupings.
- RANGE = highest score lowest score
QUARTILES
- the median being the middle score, divides a set of data into two
equal parts. Quartiles divide the data into four equal parts.
- the first quartile is often referred to as the lower quartile and is
represented by the symbol Q1. It is the value below which 25% of
the scores lie.
- the second quartile is the middle value; it is also the median. It is
the value that separates the lower 50% of scores from the upper
50% of scores.
- the third quartile is often called the upper quartile and is
represented by the symbol Q3. It is the value above which 25%
of the scores lie.
INTERQUARTILE RANGE
- the interquartile range is the difference between the upper
quartile and the lower quartile.
- INTERQUARTILE RANGE = upper quartile lower quartile
= Q3 Q1
- the interquartile range of the middle 50% of scores ignores very
low or very high scores (outliers)
- the interquartile range is not meaningful for a small set of scores
- associated with these measures of spread is the five number
summary of a set of data that is defined as: minimum score, first
quartile Q1, median Q2, third quartile Q3, maximum score.
interquartile range
25% 50% 25%
DISPLAYING DATA
DOT PLOT
- a simple display where each score is represented by a dot.
- the mode is easy to identify as the highest column of dots
- the highest and lowest scores determine the range
- a clear impression of the spread of the scores is given
- any outliers are easily identified
1 2 3 4 5
280
260
240
220
200
180
160
140
number
120 of
100
80
60
40
20
800
600
400
cumulative frequency
200
scores
25 35 45 55 65 75
85
BOX PLOT
- this is drawn using the five number-summary for a set of data
- it gives an impression of the spread of the data and also whether
it is symmetrical or skewed from its centre
- this will be indicated by the box being nearer to one end than the
other
- if there are more low scores, the skew is said to be positive; more
high scores would mean the data is negatively skewed
lower upper
minimum quartile median quartile maximum
value value
1 2 3 4 5 6 7 8 9 10 11 12 13 14
STEM-LEAF-PLOT
- a stem-and-leaf plot resembles a histogram (on its side) in which
the data is grouped
- the individual scores can still be identified
- the data may be unordered
- two sets of data can be compared using a back-to-back stem-
and-leaf plot
- the range and mode are easily identified
- the scores are ordered so the median and quartiles can be
determined by counting
96644 3 022269
322 4 15567889
5 5 2999
6 0
FEATURES OF A DISPLAY OF DATA
OUTLIERS
- an outlier is a value that is clearly separated from the main body
of the data
30
20
frequency
10
outlier
0
0 5 10
15
CLUSTERS
- cluster refers to whether the data is bunched (close together)
14 cluster
12
10
number
6 criteria
0 2 3 8 13 14
15
number of clusters
SYMMETRY AND SKEWNESS
- the general shape of a display or distribution refers to whether it
is symmetrical or skewed (lopsided)
- the following histogram and stem-and-leaf plot are both
symmetrical in shape, thus we can infer that the data is
consistent.
0 2 3 8 13 14
15
LEAF STEM
0 6
1 379
2 58
3 023378
4 04
5 678
6 9
- if distribution is not
- in a skewed distribution, most of the data are clustered at one
end of the distribution and taper off towards the tail at the
other end
GROUPED DATA
- when measuring the masses of students, there could be a large
number of different masses
- constructing frequency tables and graphs would not provide
useful information for this data and so, to overcome this problem,
the scores are grouped together in class intervals
MEASURES OF SPREAD
- the mode, median and mean are measures of central tendency
as they give an indication of a central value
- the range is a measure of spread
- measures of spread indicate how much a set of data is spread
out
QUARTILES
- the median, being the middle score, divides a set of data into two
equal parts
- quartiles are the values that divide the set of data into four equal
parts
INTERQUARTILE RANGE
- the interquartile range is the difference between the upper
quartile and the lower quartile
- the interquartile range takes into account the middle 50% of
scores and ignores very high or very low scores (outliers)
BOX-AND-WHISKER PLOTS
- the lower extreme (lowest score), lower quartile, median, upper
quartile and upper extreme (highest score) together make the
five number summary
- these points can be shown on a box-whisker plot
STANDARD DEVIATION
- the interquartile range measures the spread of the scores about
the median
- the standard deviation of a set of scores is a measure of the
spread of the scores about the mean
- if the standard deviation of a set of scores is small, there will be
little spread of the scores about the mean
- the lower the standard deviation is the data becomes more
consistent
INTERPOLATION
- an interpolation is a prediction between given data points
- the process of estimating data within the domain of the values
given
- this is valid when a definite relationship exists between the two
variables
EXTRAPOLATION
- an extrapolation is a prediction beyond given data points
- the process of predicting data beyond the values given
- often not useful and can lead to false results as there are no
guarantee that an observed pattern will continue beyond the
data presented
BIVARIATE DATA
- when data is collected from two different variables that may or
not be related
- used to analyse the relationship between two variables
- DEPENDENT VARIABLE measurement
- INDEPENDENT VARIABLE change
SCATTER PLOTS
MODEATE-POSITIVE RELATIONSHIP
- looking at a positive scatter plot gives the general impression
that as one variable increases, so does the other
- this is said to be a positive relationship though it may not be
exact
- if there was an exact relationship between two variables the
points would lie along a straight line
WEAK RELATIONSHIP
- if the scatter plot seems totally random, it would suggest there is
no direct relationship between the two variables
NO CHANGE
- the scatterplot shows a linear pattern, but because the line of
points is horizontal, it would suggest that the there is no bearing
between the two variables, thus there is no relationship linking
the variables
LINE OF BEST FIT
- for scatter plots that appear to show a relationship between the
two variables, a line can be drawn that runs through the middle
of the plotted points
COLLECTING DATA
- once we have posed questions, we need to collect data to answer
them
- before we do the actual collecting, we have to decide on how we
will collect the data, the type of data we will collect and the
sources from which we can collect them
- the sources can be either primary or secondary
- it is important that the data to be collected are from reliable
sources and not from some obscure website or outdated book,
otherwise the data may not be accurate
- some reliable sources of note are government organisations such
as the Australian bureau of statistics and the bureau of
meteorology, which have strict data collection methodologies in
place to ensure the accuracy and reliability of their data
ORGANISING DATA
- in the third stage, we arrange the data we have collected into a
form that gives structure and order to the data
- a common way of accomplishing this is to use a table, e.g. a
frequency table
- how this data will be organised will vary as a function of the
nature of the statistical investigation
WRITING A REPORT
- once we have finished analysing the data, it is time to put
everything together in a written report
- the report should address the background and aim of statistical
inquiry and the questions sought to answer, detail the data
collection method (including sources and types of data) involve a
thorough discussion of the findings, list and explain the reasoning
behind the conclusions, and, if appropriate, include
recommendations for the future
2
FINANCIAL
PRINCIPAL the original amount of money invested (or lent) for the
purpose of earning interest
INTEREST
- the payment made for the use of money invested (or borrowed)
- financial institutions (such as banks and credit unions) reward
investors by paying them interest on their savings or investments
- conversely, when borrowing money, the borrower pays interest to
the financial institution on that loan
- the original amount of money invested or borrowed is called the
principal
SIMPLE INTEREST
- the interest paid on the original principal
- the interest calculated on the original investment amount or the
amount borrowed
- the same interest is paid for each time period also known as flat
rate interest
COMPOUND INTEREST
- simple interest is calculated only on the original amount (the
principal) invested or borrowed and so the interest for each
period remains the same
- for compound interest, the interest earned after one period is
added to the principal so that, next time, the interest is
calculated on a larger principal
- this means more interest because we are also earning interest on
the interest we have already earned
- the interest earned during one-time period will then earn interest
in the next time period
TERMS
- p.a. = per annum/ yearly
- six monthly / twice a year (divided by 2) = every 6th months
- quarterly (divided by 4) = every 3 months
- monthly (divided by 12)
- weekly (divided by 52)
- yearly to daily (365 days)
COMPARISON
- comparing 6 monthly
- R (divided by 2) = N (multiplied by 2)
- comparing quarterly
- R (divided by 4) = N (multiplied by 4)
- Comparing monthly
- R (divided by 12) = N (multiplied by 12)1