Statistical Test

RESEARCH
Statistical tests in improved patient care are the anticipated sequelae of evidence
based practice. Statistical science will always remain an essential
orthopaedic research step in the progression of this paradigm.

The purpose of this article is to highlight the basic principles
of statistical analysis in orthopaedic research without reference
A A Qureshi
to complex mathematical theorems or scientific proofs. The
T Ibrahim appropriate use of statistics is intimately related to the major
considerations in study design and ultimately is driven by the
research question. Thus, it is crucial that statistical analysis is
considered at an early stage in the inception of a study as this can
Abstract help to avoid several potential pitfalls later on. Although
An understanding of the basic principles of statistical analysis is vital
preliminary discussion with a statistician is beneficial to a study,
before commencing research. This article aims to provide a concise over-
it is no substitute for a basic grounding in statistical methods
view of this extensive subject, highlighting the important concepts. Statis-
amongst the trial developers. Important concepts and consider-
tical analysis should be considered at the planning stage of any study so
ations relating to the design of studies have been covered in our
as to establish hypotheses, specify the primary outcome of interest and
previous article.2 The intention of this article is to deliver
undertake a sample power calculation. The research question, scale of
a concise overview of how statistics can be appropriately utilized
measurement and distribution of the outcome variable all have a bearing
to generate robust findings from a study. This understanding is
on the appropriate choice of statistical test. A statistical test can only be
twinned with the acquisition of skills enabling critical analysis of
employed if the distribution assumptions of the test have been met. The
the interpretation and presentation of results in scientific papers,
interpretation of significance must be tempered by limitations of the
ultimately endowing the reader with the insight to question the
method of analysis, as well as recognizing the variability of the effect
legitimacy of the conclusions drawn from any study they read.
of interest using an interval estimate. The various descriptive statistics
in diagnostic studies are also explored.
Why are statistics necessary?
Keywords confidence intervals; p values; power calculation; statistics Scientific reasoning has traditionally involved considering enti-
ties as discrete and absolute; where measurements are unwa-
vering despite endless repetition or altered circumstances.
However, even on the smallest conceivable scale of observation
Introduction e the importance of statistics in clinical practice this perspective has shifted. The birth of quantum mechanics in
the late 19th century arose from the realization that phenomena
A fundamental understanding of statistical analysis is a neces-
involving electrons could not be explained in terms of classic
sary pre-requisite to undertaking clinical research. Despite this,
mechanics.3 Heisenberg’s uncertainty principle proposed that an
many otherwise well designed studies are let down by poor
electron’s spatial location was best understood as existing within
analysis and incorrect application of tests as an unfortunate
a cloud of probability where a precise location was “uncertain.”
consequence of insufficient knowledge or attention being
This uncertainty arises from the understanding that there are
devoted to this vital part of the research. To a certain extent this
countless factors exerting varying magnitudes of influence on the
reflects deficiencies amongst clinicians in understanding and
behaviour of this smallest of species. As the complexity of the
implementing statistical methods. Although importance is
phenomenon of interest increases, the number of governing
attached to study design, critical analysis of research based on
factors and the accompanying uncertainty must doubtlessly
the appropriate use of statistics is often suboptimal. All too
increase. Thus it can be seen that when biological systems are
frequently an assigned p value assumes overwhelming impor-
subjected to scientific observation, the one intended true
tance in the results of a study and has been demonstrated to be
measure of a variable is rarely observed due to variation in the
a source of publication bias.1 This may have significant health-
phenomenon of interest as a result of the complex interplay of
care implications if ineffective, costly treatments are adopted
competing influences. In essence, this is why we call these
whilst beneficial interventions are marginalized despite evidence
measured properties variables e because they vary and this
that is not capable of standing up to scientific scrutiny. The
variation is often described as random. Usually we are investi-
demands generated by the ever increasing development of
gating the effect of altering a variable, known as the independent
medical technologies cannot be met by finite healthcare
variable, on the behaviour of an outcome or dependent variable.
resources. The appropriate utilization of resources alongside
With the understanding that all variables are subject to variation,
the extent of which is related to the number of determining
factors, we can elevate our thinking to consider the following
A A Qureshi MB BS MSc MRCS Specialist trainee in Orthopaedic Surgery, points:
University Hospitals of Leicester, Leicester Royal Infirmary, Leicester Can we quantify the observed variation for a particular
LE1 5WW, UK. dependent variable?
e.g. What is 10 year survivorship of a specific total hip
T Ibrahim MB BS(Hons.) MD FRCS(Tr & Orth) Clinical Lecturer in Orthopaedic replacement?
Surgery, University of Leicester, Leicester General Hospital, Leicester Can we determine which independent variables are important
LE5 4PW, UK. in determining the extent of this variation?
ORTHOPAEDICS AND TRAUMA 24:6 463 Ó 2010 Elsevier Ltd. All rights reserved.
RESEARCH
e.g. Age of patient, BMI, postoperative infection, length of What type of data do we have?
inpatient stay e which of these variables affect survivorship?
Correct statistical analysis depends on the scale of measurement of
Can we determine the direction of effect when we change one
the variables, as this determines the distribution of the data and the
of the independent variables?
appropriate statistical tests. Data generated from studies can be
e.g. Is an increased body mass index associated with
broadly split into two scale types e categorical which is fundamen-
enhanced survivorship or early failure?
tally qualitative and does not possess numerical properties and
Can we determine the magnitude of effect when we change
numerical data in which quantitative analysis is possible (Figure 2).
one of the independent variables?
What is the difference in survivorship in number of years Categorical data
between a patient aged 60 years and a patient aged 70 years? This can be of two types:
Can we formulate a predictive model to explain the variation
of the dependent variable? Nominal: the sole property is that the data can be named,
By knowing the patient age, BMI, and the nature of any enabling distinction. For example e hair colour, blood group,
postoperative complications can we predict the length of type of hip prosthesis. The only further level of differentiation is
survivorship of the implant? that of equivalence. No ranking of this dataset is possible.
Statistical science attempts to address such questions. The
most comprehensive definition of statistics is given by Kirk- Ordinal: although different categories exist, it is possible to rank
wood4 as “the science of collecting, summarizing, presenting, these data in a specific order. An example is the Likert scale used
and interpreting data, and the using of them to test hypoth- in questionnaires whereby patients register their agreement to
eses.” It is logical to surmise that the quality of statistical a statement by choosing one of five categories e strongly agree,
analysis is directly related to the quality of the data and by agree, neither agree nor disagree, disagree and strongly disagree.
extension of the study itself. Effective study design is the Although these responses can be ranked in an order of increasing
crucial foundation of research. Well formulated statistical agreement, the relationships between different ranks cannot be
analysis can be rendered redundant by a poorly executed precisely defined. Although such scales generate numerical data
study, whereas a well designed study marred by poor statis- we should not treat this as quantitative data e.g. comparing the
tical analysis can still be redeemed by repeat analysis. The sums of scores on such questionnaires.
research question itself drives the study design and in turn the
analysis. Deficiencies in the planning stage are one of the main Numerical data
reasons for poor studies. These elements may be considered as These data are quantitative and two scales of measurement are
sequential mechanisms within the research engine (Figure 1). possible.
The engine will not start if one of the early gears has failed
even if the subsequent ones have been furnished to perfection. Discrete: these data consists of counts or frequencies. The vari-
Conversely, poor statistical analysis in a well designed study able can only assume a finite number of possibilities where in-
will permit limited conclusions. The importance of study between values do not exist e.g. number of operations.
design will not be dealt with here, having been covered in our
previous article. Let us instead consider that we have finished Continuous: measurements can take any value within a specified
a study, gathered the results and are now looking to analyze range. Examples include the SI units of mass or distance. A
the data. further subclassification is possible:
Planning
Study Design
Execution
Data Gathering
Statistics
Data analysis
Statistics
Data interpretation
Presentation
Publication
Figure 1 Statistical analysis in the research sequence.
RESEARCH
Variables
Categorical
Nominal
Ordinal
Numerical
Discrete
Continous
Interval
Ratio
Figure 2 Scale of measurement of different variables.
Interval e the difference between measurements (interval) class interval as this distorts the distribution. If we connect the
has meaning but the ratio does not. Examples include relative midpoints of each bar, a best fit curve can be mapped to these
measures e.g. measuring walking distance after 1 km. We can points, known as a frequency polygraph, enabling interpolation
say that a patient who scores 300 m (actually 1300 m) has of values between the class intervals (Figure 3).
walked 150 m more than a patient who has walked 150 m The shape of the frequency polygraph curve is very important.
(actually 1150 m) but we cannot say that the first patient has Curves are mathematical functions and statistical tests can be
walked twice the distance of the second. derived from these functions to infer the properties of a variable
Ratio e these data have the added benefit that the ratio fitting a particular distribution. The statistical test is only valid if
between values as well as the interval carries meaning. Taking the dataset satisfies the assumptions of the appropriate distri-
the above example, measuring the actual rather than the relative bution. Rather than determining the actual mathematical func-
walking distance gives us ratio data. This scale is only possible if tion of a distribution curve, a far easier approach is to look at its
the value of zero has a true meaning. shape. Most curves form an approximate bell shaped distribu-
As we progress down this list of scale types from nominal to tion, with a peak flanked by two variable tails which taper off to
continuous, a greater extent of information is engendered by the the outliers. There are three important aspects of the shape of the
data. We can simplify data from continuous to ordinal by curve:
grouping values together but such compression leads to loss of 1. Modality e a curve is unimodal if it has one peak i.e. one
information; transforming the distribution of the data limits the mode. A bimodal curve has two peaks and so on.
statistical tests that we can use. 2. Skewness e this relates to symmetry of the tails. A curve is
positively skewed if most of the scores are clustered at the
How are my data distributed? higher end of the spectrum making this tail larger. A nega-
tively skewed distribution is the converse of this with most
Having determined our scale of measurement, we can now
of the scores being low (Figure 4).
determine the distribution of the variable through graphical
3. Kurtosis e this describes how flat the curve is. In essence,
assessment. Plotting the distribution allows us to make judge-
how the data are distributed about the peak.
ments regarding the most typical value of our variable and the
extent of spread or dispersion around this.
How do we describe our dataset?
If our dataset consists of categorical data, we can graphically
compare counts or frequencies between the different groups on Let us now consider the frequency distribution as a visual
a bar chart or pie chart. From this we can determine the most depiction of the variable of interest in a population rather than
commonly occurring score and the relative ratios between within a dataset. The population does not have to consist of
different scores. However, if our data have a continuous scale of individuals e it is simply a set of occurrences of that variable. If
measurement, we can construct a histogram (Figure 3). Histo- we can measure each and every occurrence of a particular vari-
grams are not bar charts. They should be considered as able we will have a frequency distribution of the population. In
frequency distribution charts with distinct mathematical prop- most cases, we are unable to obtain measurements for the entire
erties. The width of the bar corresponds to the actual limits of population and restrict ourselves to a representative sample.
that class interval i.e. the range of values for which the cumu- Through random sampling, this sample should embody all of the
lative frequency has been determined. Smaller class intervals characteristics of the population of interest. This is an important
better define the shape of the distribution and give an idea as to point which will be explored later when we come to consider
what may be happening to our variable between class intervals. inferring statistical findings from a sample to the population from
This is why it is best not to compress data through widening the which it was drawn.
RESEARCH
When we analyze the population distribution of the variable

25
of interest, we hope to discern the true value of that variable,

which lies somewhere within the distribution. As the frequency
distribution curve equates to a probability distribution of that
20
variable, we can say that the true variable probably lies some-
where within the central peak of a unimodal distribution. If our
Frequency
15
curve has quite flat kurtosis, then we can be less sure of this and
then the question arises as to how far away from this peak do we
think this true value may lie. This neatly brings us to the two
10
descriptive concepts used to understand where this true value

may lie e measures of central tendency, which hope to deter-
5
mine the most typical value, and measures of variability, which

describe dispersion of the variable across the population.
0
1 2 3 4 5 Measures of central tendency

Variable The appropriate selection of this measure depends on both the
Figure 3 Histogram with superimposed “best fit” frequency polygraph. scale of measurement and the sample size.
Mean: this is the arithmetic average and equates to the sum of all
observations divided by the number of observations. Although
a b
40
50
40
30
Frequency
Frequency
30
20
20
10
10
0
0 20 40 60 80 1 2 3 4 5
Variable Variable
c
25
20
Frequency
15
10
5
0
1 2 3 4 5
Variable
Figure 4 Frequency polygraphs demonstrating a positively skewed distribution, b negatively skewed distribution and c distribution with symmetrical
variance.
RESEARCH
means can be generated for ordinal variables, they are more

appropriate when describing continuous variables.
Median: this is the value that comes halfway when the data are
ranked in order. If there is an even number of observations, then
the value falls between the central two scores. This is a more
appropriate measure in ordinal data where ranks exist but we
cannot be sure of the relationship between different ranks.
Medians are also useful when we are measuring a continuous
variable but our distribution is skewed. The outliers in the tail of
skewed data tend to exert a greater effect on the mean rather than
the median.
Mode: this is the most frequently occurring observation and is

usually reserved for nominal data where other measures of Figure 5 Box and whisper plot for a dataset.
central tendency are not appropriate.
Analogous to the various scales of measurement, as we move
down this list from mean to mode a decreasing amount of 1. Unimodality e the distribution has a solitary peak equating
consideration and calculation is required to arrive at the measure to the mode.
of central tendency. Consequently, statistical tests based on 2. Symmetrical variance e the distribution is symmetrical
means rather than medians carry greater confidence whereas few about the peak, in essence equating the mode to the mean
statistical applications utilize the mode. and the median.
3. Defined kurtosis e 68% of observations lie within 1 standard
deviation of the mean and 95% of observations lie within 2
Measures of variability
standard deviations of the mean.
Let us now direct our attention to quantifying the variability or The distribution functions as a predetermined probability density
dispersion of scores within our population. function allowing us to use statistical tests derived from this
function. These are known as parametric tests. Parameters are
Range: this is simply the lowest and highest value and may be an characteristics used to describe population distributions. Para-
unsatisfactory descriptor in the presence of extreme values or metric data imply that the data are normally distributed. These
“outliers.” tests rely on fairly firm assumptions regarding these parameters
and are usually based on the sample means. Consequently, we
Centiles: these values encompass most but not all of the data in an can be fairly confident in the robustness of the findings they
attempt to negate the effects of extreme outliers and are thus more generate. All naturally occurring continuous variables when
suitable for skewed data. A centile is any value below which plotted as a population distribution assume these parameters.
a given percentage of the values occur e.g. the 90th centile Statistical tests e.g. the Shapiro Wilkinson test, can be under-
encloses the first 90% of values. The median lies at the 50th taken to define the extent of normality of a dataset but this can be
centile. Intercentile ranges are usually used when the median is done more easily by plotting the histogram or a normal plot to
the most appropriate measure of central tendency e.g. the inter- assess the shape of the distribution.
quartile range extends from the 25th to the 75th centiles. These are Often we are not looking at the variable in the entire pop-
usually graphically depicted as box and whisker plots (Figure 5). ulation but rather a sample of it. Sampling error relates to the
discrepancy between the sample characteristics and the pop-
Standard deviation: this is a representation of how the various ulation characteristics and if this is large the distribution of our
scores in the population are dispersed quantitatively relative to sample data may not be normal. These sample data can be
the mean and are used for continuous data. It is a function of the described as non-parametric.
variance which equates to the arithmetic mean of the squares of The analysis of non-parametric data can be undertaken in
the difference of each score from the population mean. If we three ways:
obtain a square root of the variance, a necessary pre-requisite to
obtain a measure in the same units as our population scores, we
arrive at the standard deviation.
Are our data normally distributed?

Gaus, a famous mathematician born in 1777, discovered that
most continuous variables when depicted on a frequency poly-
gram assumed a particular distribution that has come to be
known as the Gaussian or normal distribution5 (Figure 6). There
are certain key properties of this distribution which define its
shape: Figure 6 The properties of the normal distribution.
RESEARCH
1. Non-parametric statistical analysis e generally the results of Comparing two independent groups
such tests contain a greater degree of uncertainty than their Many studies involve comparing two groups e either different
parametric equivalents. This is due to the lack of assump- interventions or exposed versus non-exposed. The type of data
tions regarding the distribution and such tests are based on and the distribution determine the appropriate statistical test
ranks and medians rather than continuous data and means. (Figure 7).
2. Parametric statistical analysis e if we know that the pop-
ulation from which the sample is drawn is normally Comparing paired data
distributed we can use parametric tests on the sample data Occasionally two groups are wrongly compared using the above
even though they may not be normally distributed. However, analysis. This is the case when we are looking at paired data such
often we may not be certain or able to prove that the pop- as repeated measures before and after an intervention. If we have
ulation data are normally distributed. continuous data, we are interested in the mean of the differences
3. Transformation into normalized data e linear trans- between successive readings rather than the difference in the
formations such as multiplication or subtraction may be means between the two groups; effectively reducing the data to
insufficient to normalize a dataset, necessitating non-linear a one sample problem. We then need to ascertain whether the
methods, for example logarithms, with the drawback of distribution of the differences is normal rather than considering
increasing the complexity of results interpretation. normality of the original two samples when considering whether
Other types of distribution exist, which act as the basis for to undertake the parametric test or its non-parametric equivalent
statistical tests for non-normal data.6 The binomial distribution is (Figure 8).
based on the relative frequencies of all possible permutations and
combinations of discrete data. The Poisson distribution is usually Comparing more than two groups
applied to discrete quantitative data, such as counts or incidences When we are comparing more than two groups we have two
occurring over a period of time, for example the number of options (Figure 9). Either we can perform multiple tests for
patients undergoing hip fracture surgery per day in a particular independent groups or more preferably we can use a one way
hospital. The ‘t’ distribution is a theoretical distribution derived analysis of variance (ANOVA). This is a parametric test which
from the normal distribution with an additional parameter of simultaneously compares all groups’ means on the basis that all
degrees of freedom, which determines how long the tails of the of the groups are normally distributed with the same variance. If
distribution are. An increase in the sample size increases our data are non-parametric then we need to use a different test.
the degrees of freedom, thereby shortening the tails and bringing
Correlating the results of two groups
the distribution closer to normal. Statistical tests for small
There are many instances when we are trying to show an asso-
samples have been derived from this distribution.
ciation between two variables. If we are trying to show that two
variables correlate then we should look at using a correlation test
Statistical tests for parametric and non-parametric data
(Figure 10), which can be useful in determining causality,
So far we have looked at descriptive statistics, but in order to concurrent validity and internal consistency. However, one must
generate conclusions through inferential statistics we need to be always be aware of spurious correlations with the passage of
able to test one or more hypotheses. A hypothesis is a statement time, such as the price of butter increasing with the birth rate due
of fact generated by a research question. For example, we may be to temporal trends rather than any meaningful relationship
interested in the degree of association between two continuous between the two. A further point to note is that correlation and
variables, the difference between outcomes of two groups or the association do not equal causality. Establishing causality of A
level of agreement between two observers for one variable. The causing B, we have to prove that A always precedes B, A and B
choice of appropriate test does not just rely on the research correlate and that if A is absent B cannot occur.
question we are addressing but also the previously mentioned There are two tests for correlation depending on whether our
data properties with respect to scale of measurement and numerical data are parametric or non-parametric. This yields a
distribution. value between 1 (negative correlation) to þ1 (positive
Comparing two independent groups
Categorical data Numerical data
>n=5 in all <n=5 in any Parametric data Non-parametric data

categories category
Chi squared test Fishers exact test Independent t test Mann Whitney test
Figure 7 Statistical tests for comparing two independent groups.
RESEARCH
Hypothesis 2 (H2): group B is better than group A i.e. differ-

ence in opposite direction (alternative hypothesis)
Comparing paired data
Hypothesis 0 (H0): there is no difference between the groups
i.e. effect of interest ¼ zero (null hypothesis)
Parametric data Non-parametric data H1 and H2 are examples of alternative hypotheses, which is what
studies are usually interested in. However, statistical tests work
on the basis of deriving a probability, a ‘p’ value, of accepting or
Paired t test Wilcoxon signed rejecting the null hypothesis, which is the exact opposite of the
rank test alternative hypotheses. The null hypothesis always defines the
effect of interest as zero or non-existant. p values are often
Figure 8 Statistical tests for paired data. quoted as the probability that the results obtained were due to
chance, which is an incorrect oversimplification. The correct
correlation) with zero equating to no correlation. These results definition is:
indicate the measure of scatter of the data around a best fit line when
“the probability of obtaining the observed difference, or one
the two variables are plotted against each other. However, these
more extreme, given the null hypothesis is true”7
tests can only be used if we expect a linear correlation.
Simply, the p value is a probability statement about the likeli-
Regression models deepen our knowledge of association by
hood of the statistical observation, or one more extreme, given
describing and quantifying the relationship between two vari-
that the effect of interest is zero. If the p value is high we cannot
ables. These models can also be used for non-linear relation-
rule out the null hypothesis, whereas, if it is small we rule out the
ships. However, our analysis will be dictated by which variables
null hypothesis and thus favour the alternative hypothesis. The
we assign as predictor/independent variables and which ones we
question arises as to what magnitude of p value can be regarded
assign as outcome/dependent variables.
as small. Arbitrarily, the consensus opinion is for a cut off value
The validation of diagnostic tools requires demonstration of
of 5% i.e. a p value of less than 0.05 is deemed statistically
intra- and interobserver agreement6 e in essence, establishing
significant with respect to rejecting the null hypothesis.
that the results of the test are independent of the observer or
However, hypothesis testing, and by extension p values, can
extrinsic circumstances at the time of measurement. If we are
produce errors in analysis. For example, if we are comparing two
comparing quantitative scales we can calculate the standard
treatments in a clinical trial we may demonstrate a difference
deviation of differences or the co-efficient of variation. However,
between the two groups with a significant p value and thus reject
if the data are categorical then we can generate a k statistic,
the null hypothesis when it may actually be true (type 1 error).
which measures the exact number of agreements occurring in
Conversely, we may accept the null hypothesis that there is no
excess of those expected by chance. A value of 1 indicates perfect
difference between the two when this is actually false (type II
agreement whereas less than 0.2 is poor agreement.
error). These scenarios are depicted in Table 1.
We can see that a type I error results from errors in
Hypothesis testing as the basis of statistical tests interpretation
measurement causing us to detect a difference when this is not
We have learnt how to correctly utilize a statistical test based on actually present and represents errors in either the experimental
the scale of measurement and distribution of data we have in technique or the level at which we have set significance. The
conjunction with the research question we are trying to assess. converse situation, type II error, occurs when we have failed to
However, to correctly interpret the results we have to understand demonstrate a difference that exists. This may also relate to
the basis for these tests. The statistical tests described above all experimental flaws but may also be as a result of sampling error.
act to test hypotheses. We have already mentioned that often we cannot study the entire
For example, if we were to compare the results of two groups population of interest and we have to take a representative
A and B, the possible hypotheses are: sample, which if randomly drawn should share all identifiable
Hypothesis 1 (H1): group A is better than group B i.e. differ- and unidentifiable variables with the target population. The
ence in one direction (alternative hypothesis) extent to which it does this is known as the sampling error. We
Comparing > 2 groups
Multiple independent t/Mann Whitney tests Analysis of variance
Bonferroni correction Parametric Non-parametric
One way Kruskall-Wallis

ANOVA test
Figure 9 Statistical tests for comparing more than two groups.
RESEARCH
Correlating results of different groups
>2 groups and/or 2 groups and linear

Non-linear relationship relationship
Regression models Parametric Non-parametric
Pearson’s correlation Spearman Ro

coefficient
Figure 10 Tests of correlation of two or more groups.
can see that as the sampling size increases, the representation Problems with p values and hypothesis testing
and this error must decrease. Therefore, our ability to detect
Hypothesis testing is sound in principle, but restricting analysis
a meaningful difference between two groups is highly dependent
to the interpretation of p values alone has significant drawbacks.
on sample size.
Let us consider some theoretical examples of how this may
occur. A new perioperative regime to optimize the care of lower
Power calculations limb arthroplasty patients is introduced in order to reduce the
We can see that larger studies are more powerful with respect to inpatient stay and associated costs of treatment. A study is
their ability to detect a meaningful difference and thus reject the carried out looking at the inpatient stay of this population before
null hypothesis. The power of a study is defined as: and after this intervention is introduced. An analysis comparing
the before and after groups demonstrates a statistically signifi-
“the probability that a study of a given size will register as cant difference, with a p value less than 0.05. The new inter-
statistically significant a real difference of a given vention is heralded as a success and implemented. However, on
magnitude.”8 closer scrutiny it can be seen that the actual difference between
In essence, a power calculation allows us to determine the sample the two groups is less than 1 day which is not clinically signifi-
size required to actually register a specific magnitude of effect. cant. The costs of implementing this intervention have actually
However, to determine this, we first need to establish the smallest exceeded the financial benefit in terms of reduced stay. Very
true clinical or experimental effect that we consider as meaningful. small differences can become statistically significant if a large
In other words if this magnitude of effect exists we can state that enough sample size is used.
there is a difference between the two groups. The other important Occasionally multiple tests are carried out with datasets,
variables are the probability (b) of our study detecting this either purposefully or in the vain hope of demonstrating
magnitude of effect and the level of statistical significance (a). a statistically significant relationship. As significance relates to
These are typically set at 80% and 5% respectively but can be set probability, the chances are that the greater the number of tests
at any level. The power calculation is based on these variables and undertaken, the more likely you are to come up with a significant
assumes independent groups with roughly equal sample sizes that p value. Although multiple testing should be avoided, by either
have normal distributions. A power calculation is likely to produce the appropriate limitation of tests to those specified in advance
a very large sample size if the clinical effect, or difference between by the research question or through use of regression models,
two groups we are measuring, is very small or the variance is very occasionally it may be unavoidable. In these instances the easiest
large. Conversely, if we reduce the power or increase the level of correction is to undertake a Bonferroni transformation, whereby
significance then our sample size may be smaller but this is at the the statistically significant p value is multiplied by the number of
cost of increasing the risks of type II and type I errors respectively. tests undertaken to see if it is still significant.
Type I and type II errors
True state of affairs

Effect of interest/difference non-existant Effect of interest/difference exists
Result of statistical test No effect/difference detected O Type II error (b)
Effect/difference detected Type I error (a) O
Table 1
RESEARCH
Another limitation to hypothesis testing can be exemplified by that if we took 100 similar sized samples from the population and
considering a theoretical study comparing a new thrombopro- derived 95% confidence intervals for these samples, then 95 of
phylactic drug with an old one, which demonstrates a fivefold these intervals would contain the true population value i.e. 95%
reduction in the incidence of postoperative thromboembolic of 95% similar sized confidence intervals will contain the true
events. However, the p value generated is 0.15 which is deemed value.6 The equation for deriving a 95% confidence interval is
non-significant. On this basis, no further studies are undertaken given by the formula.7
and plans to replace the old drug with the new one are indefi-
nitely shelved. This theoretical example highlights the very real 95% CI ¼ c 1:96ðd=OnÞ
risk that we may dismiss effective interventions on the basis of where c ¼ mean, d¼ standard deviation, n ¼ sample size.
statistical significance alone. A high p value merely suggests that We can see from the above formulation that if our standard
there is insufficient evidence to reject the null hypothesis. Arbi- deviation i.e. the variance and/or the sample size is small then
trarily accepting the null hypothesis due to setting significance at our confidence interval and thus our magnitude of uncertainty is
a particular level implies that we have found no proof of differ- also large. The width of confidence intervals decreases with
ence. However, “no proof of difference” does not equate to increasing sample size but it is always advantageous to compare
“proof of no difference”9 e a fundamental consideration when intervals, no matter how large, rather than point estimates. An
interpreting p values. Judging a p value by setting significance at example of comparing confidence intervals to means is as
a particular level effectively reduces the answer to any research follows e consider a theoretical study where the mean risk of
question to a yes/no status. It is more informative to look at the developing a complication with Operation A compared to Oper-
p value itself and the probability implications rather than looking ation B is fourfold with a confidence interval of 0.6 to 9.7. The
at arbitrary cut offs. Unfortunately, this relative ease of under- means alone suggest that the risk is higher in Operation A.
standing has led to publication bias towards significant results.1 However, looking at the confidence interval we can see that it
The analysis of data should focus on characterizing the actual encloses 1 i.e. equivalence. Therefore if this is the true difference
study effect under consideration rather than looking at proba- then we can state that there is equivalent risk with both opera-
bility statements alone. tions. There is also the possibility that there is less risk with
Operation A because the interval encloses 0.6. This finding is
Estimation and confidence intervals tempered by the fact that the confidence interval extends in the
opposite direction to a greater than ninefold risk with Operation
Hypothesis testing asks us the question “is it? or isn’t it?”. What
A. The reader is thus endowed with greater knowledge regarding
we are actually interested in is the answers to “how big is the
the difference between the two groups which is far in excess of
difference?” and “in what direction is the difference?”. As our
that provided by a p value alone. For this reason, statistical
results encompass uncertainty we have to rely on methods to
analysis should always look at confidence intervals so as to show
estimate where the true value of interest lies. As we have seen
the direction and magnitude of effect. Only then can we deter-
earlier we can make point estimates based on measures of central
mine whether a statistically significant p value actually has
tendency, such as the arithmetic mean. However, these do not
clinical significance. However we must always remember that
take into account the variability or dispersion of the data relative
confidence intervals like p values are not immune to errors in
to this value. Of greater interest is the estimation of an interval
study design or bias and that the intervals should always be
that we can be confident encloses the unknown true population
regarded as the smallest estimate for the real error in defining the
value. These are known as confidence intervals and encompass
true population value.
an estimate of the true value (for example the arithmetic mean)
as well as the sampling variability, with some level of assurance
Diagnostic tests
or confidence. Any confidence interval can be constructed, but
by convention 95% confidence intervals are usually derived. It is When we are looking at diagnostic tests for conditions, we need
important to note that confidence intervals are not direct prob- to be aware of certain statistical definitions. Usually we are
ability statements. To state that a 95% confidence interval means interested in comparing the results of a new diagnostic test with
that there is a 95% probability that the true population value of an established reference test or standard for diagnosing the
interest lies within this range is false. What it actually means is condition. If we plot a two by two table for all possible
Different outcomes for diagnostic test compared to reference standard
Disease/reference standard Total

Present Absent
Test result Positive a (true positives) b (false positives) aþb
Negative c (false negatives) d (true negatives) cþd
Total aþc bþd n
Table 2
RESEARCH
Positive predictive value ¼ Proportion of patients with a positive

Common mistakes in statistical analysis and test result who are correctly diagnosed ¼ a=a þ b
interpretation
Negative predicitve value ¼ Proportion of patients with a negative
Planning/study design test result who are correctly diagnosed ¼ d=c þ d
C Absence of research question/hypotheses before data collection
C No sample size/power calculation
C No criteria specified for sample, i.e. inclusion/exclusion criteria

Conclusion
C Too many variables, primary outcome of interest not stated
Data analysis
In summary we can see that statistical tests can only be mean-
C Compressing data/changing continuous data to ordinal data
ingfully and appropriately applied if we understand the proper-
C Using inappropriate measures of central tendency, e.g. means for
ties of our dataset as well as basing our analysis on a suitable
skewed data
research question. However, the statistical tests themselves are
C Not assessing normality of frequency distribution
only half the battle; the remainder being how to interpret the
C Incorrectly applying parametric tests when assumptions not
results to generate credible findings. Table 3 summarizes the
satisfied
common errors in both analysis and interpretation to conclude
C Paired data analyzed as independent groups
this discourse and serve as a reminder of the main points to
C Inappropriate methods of assessing agreement
consider before embarking on statistical analysis. A
C Inappropriate multiple testing with no correction
Results interpretation
C Significance rather than actual p values quoted
C Means compared but no estimate of variability, i.e. confidence REFERENCES

intervals 1 Hopewell S, Loudon K, Clarke MJ, Oxman AD, Dickersin K. Publication
C Statistical significance favoured over clinical significance
bias in clinical trials due to statistical significance or direction of trial
results. Cochrane Database Syst Rev; 2009; Issue 1.
Table 3 2 Qureshi AA, Ibrahim T. Study design in clinical orthopaedic trials.
Orthop and Trauma 2010; 24: 229e40.
3 Ballentine LE. The statistical interpretation of quantum mechanics. Rev
Mod Phys 1970; 42: 358e81.
permutations of results of these two tests we can define several 4 Kirkwood BR, Sterne JAC. Essential medical statistics. 2nd edn.
measures of the efficacy of our diagnostic test as shown6 Blackwell, 2003.
(Table 2). 5 Altman DG, Bland JM. The normal distribution. BMJ 1995; 310: 298.
6 Bland M. An introduction to medical statistics. 3rd edn. Oxford
Sensitivity ¼ Proportion of positive results ðor patients who have University Press, 2000.
the conditionÞ that are correctly identified by the test 7 Campbell MJ, Machin D. Medical statistics: a commonsense approach.
¼ a=a þ c 3rd edn. Wiley, 1999.
ðif highly sensitive test can be used to rule condition outÞ 8 Altman DG. Statistics and ethics in medical research. III How large
a sample? BMJ 1980; 281: 1336e8.
Specificity ¼ Proportion of negative results ðor patients without 9 Altman DG, Bland JM. Absence of evidence is not evidence of absence.
the conditionÞ that are correctly identified by the test ¼ d=b þ d BMJ 1995; 311: 485.

Statistical Test

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Statistical Test

Uploaded by

Copyright:

Available Formats

RESEARCH

orthopaedic research step in the progression of this paradigm.

Figure 1 Statistical analysis in the research sequence.

Figure 2 Scale of measurement of different variables.

When we analyze the population distribution of the variable

of interest, we hope to discern the true value of that variable,

descriptive concepts used to understand where this true value

mine the most typical value, and measures of variability, which

1 2 3 4 5 Measures of central tendency

means can be generated for ordinal variables, they are more

Mode: this is the most frequently occurring observation and is

Are our data normally distributed?

Comparing two independent groups

Categorical data Numerical data

>n=5 in all <n=5 in any Parametric data Non-parametric data

Figure 7 Statistical tests for comparing two independent groups.

Hypothesis 2 (H2): group B is better than group A i.e. differ-

Comparing > 2 groups

Multiple independent t/Mann Whitney tests Analysis of variance

Bonferroni correction Parametric Non-parametric

One way Kruskall-Wallis

Figure 9 Statistical tests for comparing more than two groups.

Correlating results of different groups

>2 groups and/or 2 groups and linear

Regression models Parametric Non-parametric

Pearson’s correlation Spearman Ro

Figure 10 Tests of correlation of two or more groups.

Type I and type II errors

True state of affairs

Different outcomes for diagnostic test compared to reference standard

Disease/reference standard Total

Total aþc bþd n

Positive predictive value ¼ Proportion of patients with a positive

C No sample size/power calculation

C No criteria specified for sample, i.e. inclusion/exclusion criteria

C Means compared but no estimate of variability, i.e. confidence REFERENCES

You might also like