You are on page 1of 19

MATHS

NOTES | KEVIN LUU | 5.2 5.3 MATHEMATICS

STATISTICS
ASSESSMENT
Review types of data, collecting data, sorting data, measures of
central tendency, measures of spread and displaying data.
Interpreting results from two sets of data (i.e. back to back stem and
leaf displays, histograms, double column graphs, or box and
whiskers plots).
Find the range, interquartile range and standard deviation as
measures of spread of data sets
- Find the mean and standard deviation of a set of data using
digital technologies calculators
- Compare and describe the spread of sets of data with the same
mean but different standard deviations
Bivariate Data: recognises the difference between dependent and
independent variables. Describes the strength and direction of the
relationship between two variables displayed in a scatter plot, e.g.
Strong positive relationships, weak negative relationships with
justifications.
Uses lines of best fit to predict what might happen between known
data values (interpolation) and predict what might happen beyond
known data values (extrapolation).
Know the six processes to setting up statistical investigations.
Identify reasons why data in a display may be misrepresented.

EXPRECTATIONS
Use measures of central tendency (mean, mode, median) and the
range to analyse data that is displayed in a frequency table, stem-
and-leaf plot or dot plot.
Use terms skewed or symmetrical when describing the shape of a
distribution.
Compare two sets of data and draw conclusions by finding the
mean, mode and/ or median, and the range of both sets.
Construct a cumulative frequency table, histogram and polygon
(ogive) for ungrouped data.
Use cumulative frequency to find the median.
Group data into class intervals.
Construct a cumulative frequency table and histogram for grouped
data.
Find the mean and modal class of grouped data.
Determine the upper and lower quartiles for a set of scores.
Construct a box-and-whisker plot using the five-point summary.
Use a calculator to find the standard deviation of a set of scores.
Use the mean and standard deviation to compare two sets of data.
Compare the relative merits of measures of spread (range,
interquartile range and standard deviation).

STATISTIC TERMANOLOGY
BIVARIATE DATA
- data that has to variables
BOX PLOT (CAT-AND-WHISKERS PLOT)
- a diagram obtained from the five number summary
- the box shows the middle 50% of scores (the interquartile range)
- the whiskers show us the extent of the bottom and top quartiles
as well as the range
CENSUS
- a survey of a whole population
CUMULATIVE FREQUENCY
- the number of scores less than or equal to a particular outcome
- e.g. For the data 3,6,5,3,5,5,4,3,3,6 the cumulative frequency of
5 is 8 (there are 8 scores of 5 or less)
CUMULATIVE FREQUENCY HISTORGRAM (AND POLYGON)
- these show the outcomes and their cumulative frequencies
DATA
- the pieces of information (or scores) to be examined
- categorical: data that uses non-numerical categories
- ordered data involves a ranking, e.g. exam grades, garment
sizes
- distinct data has no order, e.g. colours, types of cars
- numerical: data that uses numbers to show how much
- continuous data can have any numerical value within a range,
e.g. height
- discrete data is restricted to certain numerical values, e.g.
number of pets
DOT PLOT
- a type of graph that uses one axis and a number of dots above
the axis
EXTRAPOLATION
- predicting a data beyond the range of values given
FIVE NUMBER SUMMARY
- a set of numbers consisting of the minimum score, the three
quartiles and the maximum score
FREQUENCY
- the number of times an outcome occurs in the data
- e.g. for the data 3,6,5,3,5,5,4,3,3,6 the outcome 5 has a
frequency of 3
FREQUENCY DISTRIBUTION TABLE
- a table that shows all the possible outcomes and their
frequencies (it usually is extended by adding other columns such
as the cumulative frequency)
FREQUENCY HISTROGRAM
- a type of column graph showing the outcomes and their
frequencies.
FREQUENCY POLYGON
- a type of line graph showing outcomes and their frequencies
- to complete the polygon, the outcomes immediately above and
below the actual outcomes are used (the height of these columns
is zero)
GROUPED DATA
- data that is organised into groups or classes
- class intervals: the size of the groups into which the data is
organised e.g. 1-5 (5 scores); 11-20 (10 scores)
- class centre: the middle outcome of a class
e.g. the class 1-5 has a class centre of 3
INTERPOLATION
- estimating data that lie within the domain of the values given
INTERQUARTILE RANGE
- the range of the middle 50% of scores
- the difference between the median of the upper half of scores
and the median of the lower half of scores
- IQR = Q3-Q1
LINE OF BEST FIT
- a line that best fits; the data on a scatter plot mean
MEAN
- the number obtained by evening out all the scores until they are
equal
- e.g. if the scores 3,6,5,3,5,5,4,3,3,6 were evened out, the
number obtained would be 4.3
- to obtain the mean, we divide the sum of the scores with the
total number of scores
MEDIAN
- the middle score for an odd number of scores or the mean of the
middle two scores for an even number of scores
- the median class is grouped data containing the median
MODE (MODAL CLASS)
- the outcome or class that contains the most scores
OGIVE
- this is another name for the cumulative frequency polygon
OUTCOME
- a possible value of the data
OUTLIER
- a score that is separated from the main body of scores
QUARTILES
- the points that divide the scores the scores up into quarters
- the second quartile, Q2, divides the scores into halves (Q2 =
median)
- the first quartile, Q1, is the median of the lower half of scores
- the third quartile, Q3, is the median of the upper half of scores
RANGE
- the difference between the highest and lowest scores
SAMPLE
- a part (usually a small part) of a large population
- random sample: a sample taken so that each member of the
population has the same change of being included
- systematic sample: a sample selected according to some
ordering scheme, e.g. every tenth member
- stratified sample: a sample is proportionally taken from each
subgroup in a population
SCATTER PLOT
- a graph that uses points on a number plane to show the
relationship between two categories.
SHAPE (OF A DISTRIBUTION)
- a set of scores can be symmetrical or skewed
SOURCES OF DATA
- primary: the data has been collected by yourself
- secondary: the data has come from an external source, e.g.
newspapers, internet
STANDARD DEVIATION
- a measure of spread that can be thought of as the average
distance of scores from the mean
- the larger the standard deviation, the larger the spread
STATISTICS
- the collection, organisation and interpretation of numerical data
STEM-AND-LEAF PLOT
- a graph that shows the spread of scores without losing the
identity of the data
- ordered stem-and-leaf plot: the leaves are placed in order
- back-to-back stem-and-leaf plot: this can be used to compare two
sets of scores, one set on each side
VARIABLE
- something that can be observed, measured or counted to provide
data

1
STATISTICS
TYPES OF DATA
The data we collect is made up of variables. These are pieces of
information like a quantity or a characteristic that can be observed or
measured. They may change either over time or between individual
observations. The main types of data are:

CATEGORICAL VARIABLES ARE CATEGORIES


- ordered | e.g. exam grades, garment sizes
- distinct | e.g. types of cars, eye colour

NUMERICAL VARIABLES ARE NUMBERS


- discrete | e.g. goals scored, number of pets
- continuous | e.g. height of a person, distance thrown

COLLECTING DATA
There are three main ways of collecting data, including:

CENSUS
- a whole population is surveyed, e.g. every student in the school
is questioned
SAMPLE
- a selected group of a population is surveyed, e.g. a small number
in each class is questioned
OBSERVATION
- numerical facts are collected and tabulated, e.g. sports data,
weather, sales figures, etc.

A sample is usually random to limit the chances of bias occurring.


However, it may be systematic if the members of the sample are chosen
according to a rule, such as every 10th member of a population. If a
population is composed of various sub-groups, the sample could be
stratified to ensure a proportionate representation of each group in the
sample.

Primary source data is collected first hand by observation or survey.

Secondary source data is obtained from an external source such as a


newspaper, website or another persons research.

SORTING DATA
A large amount of data needs to be tabulated (organised into a table) so
that it can be analysed. A common form of table is the frequency
distribution table.

DISCRETE DATA
OUTCOME ( TALLY FREQUENCY f x CUMULATIVE
x ) ( f ) FREQUENCY
1 ||| 3 3 3
2 |||| 4 8 7
3 ||||||| 7 21 14
4 ||||||||| 9 36 23
5 ||||| 5 25 28
6 || 2 12 30
TOTAL | 30 | 105

GROUPED DATA
Used to cluster discrete data into groups or to divide continuous data into
adjoining groups.
CLASS CLASS TALLY FREQUENCY f c . c . CUMULATIV
CENTRE ( f ) E
FREQUENCY
1-<5 1 ||| 3 3 3
5-<9 2 |||| 4 8 7
9-<13 3 ||||||| 7 21 14
13-<17 4 ||||||||| 9 36 23
17-<21 5 ||||| 5 25 28
21-<25 6 || 2 12 30
TOTAL | 30 | 390

After the data has been sorted, certain key numbers can be determined.
Some measure how the data clusters around the centre. These are called
measures of central tendency (or measures of location). Others measure
how the data spreads from the centre. These are called measures of
spread.

MEASURES OF CENTRAL TENDENCY


The mean, median and mode are called measures of location because
they give an indication of a central value (or average) around which a set
of scores tend to cluster.

MODE
- the score or outcome that occurs the most, i.e. with the highest
frequency.
- the mode is the score that occurs most often
- e.g. 3,4,3,4,6,7,5,7,3,5,2 the mode from the set of data is 3
MEAN
- the sum of the scores divided by the number of scores, i.e. the
usual definition of average.
- for raw data, mean
of scores
number of scores
- for tabulated data, mean
of f x column ]
of frequency column
MEDIAN
- the middle score when the scores are placed in order. If there is
an even number of scores, the median is the average of the two
middle scores.
- the median is (when scores are arranged from lowest to highest)
- the middle score (for an odd number of scores)
- the average of the two middle scores (for an even number of
scores)

MESURES OF SPREAD
RANGE
-the highest score minus the lowest score, for grouped data,
unless the original scores are known, the maximum possible
range can be determined by suing the class groupings.
- RANGE = highest score lowest score
QUARTILES
- the median being the middle score, divides a set of data into two
equal parts. Quartiles divide the data into four equal parts.
- the first quartile is often referred to as the lower quartile and is
represented by the symbol Q1. It is the value below which 25% of
the scores lie.
- the second quartile is the middle value; it is also the median. It is
the value that separates the lower 50% of scores from the upper
50% of scores.
- the third quartile is often called the upper quartile and is
represented by the symbol Q3. It is the value above which 25%
of the scores lie.
INTERQUARTILE RANGE
- the interquartile range is the difference between the upper
quartile and the lower quartile.
- INTERQUARTILE RANGE = upper quartile lower quartile
= Q3 Q1
- the interquartile range of the middle 50% of scores ignores very
low or very high scores (outliers)
- the interquartile range is not meaningful for a small set of scores
- associated with these measures of spread is the five number
summary of a set of data that is defined as: minimum score, first
quartile Q1, median Q2, third quartile Q3, maximum score.

interquartile range
25% 50% 25%

lower quartile median upper quartile


Q1 Q2 Q3

DISPLAYING DATA
DOT PLOT
- a simple display where each score is represented by a dot.
- the mode is easy to identify as the highest column of dots
- the highest and lowest scores determine the range
- a clear impression of the spread of the scores is given
- any outliers are easily identified
1 2 3 4 5

FREQUENCY HISTOGRAM AND POLYGON


- the frequency of each score is represented by a column in a
histogram and a dot in a polygon
- these dots coincide with the centre of each column
- the dots are joined to form the polygon, which is completed by
joining the axis
- the mode is identified by the highest column
- a clear impression of the spread of the scores is given
- any outliers are easily identified
- for grouped data, the classes can be represented on the
horizontal axis by the class centres

280
260
240
220
200
180
160
140

number
120 of
100
80
60
40
20

560 600 640 680 720 760 800 840 880


mark

CUMULATIVE FREQUENCY HISTOGRAM AND POLYGON


- graphing the cumulative frequency results in columns of
increasing height, the last column representing the total
frequency
- the polygon is formed by joining the corners of adjoin columns
- the polygon is useful for indicating the median and quartiles

800

600

400
cumulative frequency
200

scores
25 35 45 55 65 75
85

BOX PLOT
- this is drawn using the five number-summary for a set of data
- it gives an impression of the spread of the data and also whether
it is symmetrical or skewed from its centre
- this will be indicated by the box being nearer to one end than the
other
- if there are more low scores, the skew is said to be positive; more
high scores would mean the data is negatively skewed

lower upper
minimum quartile median quartile maximum
value value

1 2 3 4 5 6 7 8 9 10 11 12 13 14

STEM-LEAF-PLOT
- a stem-and-leaf plot resembles a histogram (on its side) in which
the data is grouped
- the individual scores can still be identified
- the data may be unordered
- two sets of data can be compared using a back-to-back stem-
and-leaf plot
- the range and mode are easily identified
- the scores are ordered so the median and quartiles can be
determined by counting

STEM LEAF STEM


988 0
7432220 1 6679
9887731 2 048

96644 3 022269
322 4 15567889

5 5 2999
6 0
FEATURES OF A DISPLAY OF DATA
OUTLIERS
- an outlier is a value that is clearly separated from the main body
of the data

30

20

frequency
10
outlier

0
0 5 10
15

CLUSTERS
- cluster refers to whether the data is bunched (close together)

14 cluster

12

10

number
6 criteria

0 2 3 8 13 14
15

number of clusters
SYMMETRY AND SKEWNESS
- the general shape of a display or distribution refers to whether it
is symmetrical or skewed (lopsided)
- the following histogram and stem-and-leaf plot are both
symmetrical in shape, thus we can infer that the data is
consistent.
0 2 3 8 13 14
15

LEAF STEM
0 6
1 379
2 58

3 023378
4 04

5 678
6 9
- if distribution is not
- in a skewed distribution, most of the data are clustered at one
end of the distribution and taper off towards the tail at the
other end

negative skew - positive skew +

COMPARING THE MEAN, MEDIAN AND MODE


- when the mean, median and mode are found for a set of data, it
is necessary to decide which measure is most appropriate
- the mean is usually the most appropriate measure of location as
it takes into account of every data score
- if there are any outliers in the set of data, then the mean may be
affected by these extreme scores and will not accurately
represent all of the scores
- thus the median is a better measure as it is not affected by
outliers
- the mode is useful when the most common score is important, or
when the data is categorical (such as hair colour or make of car).
When dealing with categorical data, it is not possible to have a
mean or median

CUMULATIVE FREQUENCY TABLES AND GRAPHS


- the cumulative frequency is a progressive total of the frequency
- the cumulative frequency for a particular score of the frequencies
for that score and for all scores less than it
- a cumulative frequency histogram and polygon (also called the
ogive) can be drawn using the score and the cumulative
frequency columns
- the cumulative frequency can be used to find the median

GROUPED DATA
- when measuring the masses of students, there could be a large
number of different masses
- constructing frequency tables and graphs would not provide
useful information for this data and so, to overcome this problem,
the scores are grouped together in class intervals

MEASURES OF SPREAD
- the mode, median and mean are measures of central tendency
as they give an indication of a central value
- the range is a measure of spread
- measures of spread indicate how much a set of data is spread
out

QUARTILES
- the median, being the middle score, divides a set of data into two
equal parts
- quartiles are the values that divide the set of data into four equal
parts

INTERQUARTILE RANGE
- the interquartile range is the difference between the upper
quartile and the lower quartile
- the interquartile range takes into account the middle 50% of
scores and ignores very high or very low scores (outliers)

BOX-AND-WHISKER PLOTS
- the lower extreme (lowest score), lower quartile, median, upper
quartile and upper extreme (highest score) together make the
five number summary
- these points can be shown on a box-whisker plot

STANDARD DEVIATION
- the interquartile range measures the spread of the scores about
the median
- the standard deviation of a set of scores is a measure of the
spread of the scores about the mean
- if the standard deviation of a set of scores is small, there will be
little spread of the scores about the mean
- the lower the standard deviation is the data becomes more
consistent

THE NORMAL DISTRIBUTION


- if a frequency distribution of a population (such as heights of all
Australian women) is normal, it can be represented by a bell-
shaped curve called the normal curve or normal distribution
curve
- it is symmetrical about the mean and is unimodal
- a total of 68% of the population will lie within one standard
deviation of the mean
- a total of 95% of the population will lie within two standard
deviations of the mean
- a total of 99.7% of the population will lie within three standard
deviations of the mean

COMPARING THE RANGE, INTERQUARTILE RANGE AND STANDARD


DEVIATION
- the standard deviation is usually the most appropriate measure
of spread because it takes into account all of the values in the
data set
- the range is easiest to calculate, but its value only depends upon
two scores, the highest and lowest score
- if there are any outliers in the set of data, then the standard
deviation and range will be affected by these extreme scores and
will then give an exaggerated representation of the spread
- the interquartile range is a best measure because it concentrates
only on the middle 50% and so its not affected by outliers
STANDARD DEVIATION 2
- the standard deviation is a measure of how far the scores are
spread about the mean. It can be thought of as the average
distance of the scores from the mean
- the smaller the standard deviation, the less spread of scores
closer to the mean
- the larger the standard deviation, the more spread of scores
further from the mean

INTERPOLATION
- an interpolation is a prediction between given data points
- the process of estimating data within the domain of the values
given
- this is valid when a definite relationship exists between the two
variables
EXTRAPOLATION
- an extrapolation is a prediction beyond given data points
- the process of predicting data beyond the values given
- often not useful and can lead to false results as there are no
guarantee that an observed pattern will continue beyond the
data presented

BIVARIATE DATA
- when data is collected from two different variables that may or
not be related
- used to analyse the relationship between two variables
- DEPENDENT VARIABLE measurement
- INDEPENDENT VARIABLE change

SCATTER PLOTS
MODEATE-POSITIVE RELATIONSHIP
- looking at a positive scatter plot gives the general impression
that as one variable increases, so does the other
- this is said to be a positive relationship though it may not be
exact
- if there was an exact relationship between two variables the
points would lie along a straight line

MODERATE NEGATIVE RELATIONSHIP


- looking at a negative scatter plot gives the general impression
that as one variable increases, the other decreases
- this is said to be negative relationship between the two variables
- if there was draw connecting pair of points, it would have a
negative slope

WEAK RELATIONSHIP
- if the scatter plot seems totally random, it would suggest there is
no direct relationship between the two variables

NO CHANGE
- the scatterplot shows a linear pattern, but because the line of
points is horizontal, it would suggest that the there is no bearing
between the two variables, thus there is no relationship linking
the variables
LINE OF BEST FIT
- for scatter plots that appear to show a relationship between the
two variables, a line can be drawn that runs through the middle
of the plotted points

- the gradient of the line can be calculated by using two


convenient points through which the line passes using the
gradient formula (rise over run)
- the range mode and median can be determined for each of the
variables from a scatter plot by observation and counting

THE 6 STAGES OF DATA ANALYSIS


POSING QUESTIONS
- the first stage is to pinpoint the final information that will be
needed in order to be able to draw a conclusion
- this involves coming up with questions that, if answered, would
lead to meaningful information that would allow us to draw a
conclusion and to make recommendations

COLLECTING DATA
- once we have posed questions, we need to collect data to answer
them
- before we do the actual collecting, we have to decide on how we
will collect the data, the type of data we will collect and the
sources from which we can collect them
- the sources can be either primary or secondary
- it is important that the data to be collected are from reliable
sources and not from some obscure website or outdated book,
otherwise the data may not be accurate
- some reliable sources of note are government organisations such
as the Australian bureau of statistics and the bureau of
meteorology, which have strict data collection methodologies in
place to ensure the accuracy and reliability of their data

ORGANISING DATA
- in the third stage, we arrange the data we have collected into a
form that gives structure and order to the data
- a common way of accomplishing this is to use a table, e.g. a
frequency table
- how this data will be organised will vary as a function of the
nature of the statistical investigation

SUMMARISING AND DISPLAYING DATA


- once we have organised the data, we need to present the data in
a form that will be easy to read, understand and analyse
- most often this will be accomplished by using graph such as a
column graph, bar graph, dot plot or line chart
- the particular type of graph to be used will depend on the
purpose of the investigation
- besides displaying the data in a graph, it may also be beneficial
to summarise the data using statistical quantities such as the
mean, median, mode and range

ANALYSING DATA AND DRAWING CONCLUSION


- after we have finished summarising and displaying the data, it is
time to examine and interpret the data, to decide on what means
and too ultimately draw conclusions from it
- this may involve identifying the trends and patterns from the
graph, and identifying how those trends and patterns change
over time or across categories (such as across different
populations). From these trends, we can then draw conclusions
and possibly predictions about future outcomes

WRITING A REPORT
- once we have finished analysing the data, it is time to put
everything together in a written report
- the report should address the background and aim of statistical
inquiry and the questions sought to answer, detail the data
collection method (including sources and types of data) involve a
thorough discussion of the findings, list and explain the reasoning
behind the conclusions, and, if appropriate, include
recommendations for the future
2
FINANCIAL
PRINCIPAL the original amount of money invested (or lent) for the
purpose of earning interest

SIMPLE INTEREST interest paid only on the original sum of money


(principal) invested and not any interest earned by that sum

COMPOUND INTEREST interest paid on the sum (principal) invested as


well as any accumulated interest

INTEREST
- the payment made for the use of money invested (or borrowed)
- financial institutions (such as banks and credit unions) reward
investors by paying them interest on their savings or investments
- conversely, when borrowing money, the borrower pays interest to
the financial institution on that loan
- the original amount of money invested or borrowed is called the
principal

SIMPLE INTEREST
- the interest paid on the original principal
- the interest calculated on the original investment amount or the
amount borrowed
- the same interest is paid for each time period also known as flat
rate interest

CALCULATION OF SIMPLE INTEREST


- I = PRT
- I as the interest
- P as the principal
- R as the interest rate per period, expressed as a decimal
- T as the number of period

COMPOUND INTEREST
- simple interest is calculated only on the original amount (the
principal) invested or borrowed and so the interest for each
period remains the same
- for compound interest, the interest earned after one period is
added to the principal so that, next time, the interest is
calculated on a larger principal
- this means more interest because we are also earning interest on
the interest we have already earned
- the interest earned during one-time period will then earn interest
in the next time period

CALCULATION OF COMPOUND INTEREST


FIRST STEP OF CALCULATION
- A = P (1+R)n
- A as the total amount of investment
- P as the principal
- R as the interest rate per period expressed in decimal
- n as number of periods

SECOND STEP OF CALCULATION


- compound interest = final amount principal
- I=AP

TERMS
- p.a. = per annum/ yearly
- six monthly / twice a year (divided by 2) = every 6th months
- quarterly (divided by 4) = every 3 months
- monthly (divided by 12)
- weekly (divided by 52)
- yearly to daily (365 days)

COMPARISON
- comparing 6 monthly
- R (divided by 2) = N (multiplied by 2)
- comparing quarterly
- R (divided by 4) = N (multiplied by 4)
- Comparing monthly
- R (divided by 12) = N (multiplied by 12)1

You might also like