Professional Documents
Culture Documents
Statistical tests in improved patient care are the anticipated sequelae of evidence
based practice. Statistical science will always remain an essential
ORTHOPAEDICS AND TRAUMA 24:6 463 Ó 2010 Elsevier Ltd. All rights reserved.
RESEARCH
e.g. Age of patient, BMI, postoperative infection, length of What type of data do we have?
inpatient stay e which of these variables affect survivorship?
Correct statistical analysis depends on the scale of measurement of
Can we determine the direction of effect when we change one
the variables, as this determines the distribution of the data and the
of the independent variables?
appropriate statistical tests. Data generated from studies can be
e.g. Is an increased body mass index associated with
broadly split into two scale types e categorical which is fundamen-
enhanced survivorship or early failure?
tally qualitative and does not possess numerical properties and
Can we determine the magnitude of effect when we change
numerical data in which quantitative analysis is possible (Figure 2).
one of the independent variables?
What is the difference in survivorship in number of years Categorical data
between a patient aged 60 years and a patient aged 70 years? This can be of two types:
Can we formulate a predictive model to explain the variation
of the dependent variable? Nominal: the sole property is that the data can be named,
By knowing the patient age, BMI, and the nature of any enabling distinction. For example e hair colour, blood group,
postoperative complications can we predict the length of type of hip prosthesis. The only further level of differentiation is
survivorship of the implant? that of equivalence. No ranking of this dataset is possible.
Statistical science attempts to address such questions. The
most comprehensive definition of statistics is given by Kirk- Ordinal: although different categories exist, it is possible to rank
wood4 as “the science of collecting, summarizing, presenting, these data in a specific order. An example is the Likert scale used
and interpreting data, and the using of them to test hypoth- in questionnaires whereby patients register their agreement to
eses.” It is logical to surmise that the quality of statistical a statement by choosing one of five categories e strongly agree,
analysis is directly related to the quality of the data and by agree, neither agree nor disagree, disagree and strongly disagree.
extension of the study itself. Effective study design is the Although these responses can be ranked in an order of increasing
crucial foundation of research. Well formulated statistical agreement, the relationships between different ranks cannot be
analysis can be rendered redundant by a poorly executed precisely defined. Although such scales generate numerical data
study, whereas a well designed study marred by poor statis- we should not treat this as quantitative data e.g. comparing the
tical analysis can still be redeemed by repeat analysis. The sums of scores on such questionnaires.
research question itself drives the study design and in turn the
analysis. Deficiencies in the planning stage are one of the main Numerical data
reasons for poor studies. These elements may be considered as These data are quantitative and two scales of measurement are
sequential mechanisms within the research engine (Figure 1). possible.
The engine will not start if one of the early gears has failed
even if the subsequent ones have been furnished to perfection. Discrete: these data consists of counts or frequencies. The vari-
Conversely, poor statistical analysis in a well designed study able can only assume a finite number of possibilities where in-
will permit limited conclusions. The importance of study between values do not exist e.g. number of operations.
design will not be dealt with here, having been covered in our
previous article. Let us instead consider that we have finished Continuous: measurements can take any value within a specified
a study, gathered the results and are now looking to analyze range. Examples include the SI units of mass or distance. A
the data. further subclassification is possible:
Planning
Study Design
Execution
Data Gathering
Statistics
Data analysis
Statistics
Data interpretation
Presentation
Publication
ORTHOPAEDICS AND TRAUMA 24:6 464 Ó 2010 Elsevier Ltd. All rights reserved.
RESEARCH
Variables
Categorical
Nominal
Ordinal
Numerical
Discrete
Continous
Interval
Ratio
Interval e the difference between measurements (interval) class interval as this distorts the distribution. If we connect the
has meaning but the ratio does not. Examples include relative midpoints of each bar, a best fit curve can be mapped to these
measures e.g. measuring walking distance after 1 km. We can points, known as a frequency polygraph, enabling interpolation
say that a patient who scores 300 m (actually 1300 m) has of values between the class intervals (Figure 3).
walked 150 m more than a patient who has walked 150 m The shape of the frequency polygraph curve is very important.
(actually 1150 m) but we cannot say that the first patient has Curves are mathematical functions and statistical tests can be
walked twice the distance of the second. derived from these functions to infer the properties of a variable
Ratio e these data have the added benefit that the ratio fitting a particular distribution. The statistical test is only valid if
between values as well as the interval carries meaning. Taking the dataset satisfies the assumptions of the appropriate distri-
the above example, measuring the actual rather than the relative bution. Rather than determining the actual mathematical func-
walking distance gives us ratio data. This scale is only possible if tion of a distribution curve, a far easier approach is to look at its
the value of zero has a true meaning. shape. Most curves form an approximate bell shaped distribu-
As we progress down this list of scale types from nominal to tion, with a peak flanked by two variable tails which taper off to
continuous, a greater extent of information is engendered by the the outliers. There are three important aspects of the shape of the
data. We can simplify data from continuous to ordinal by curve:
grouping values together but such compression leads to loss of 1. Modality e a curve is unimodal if it has one peak i.e. one
information; transforming the distribution of the data limits the mode. A bimodal curve has two peaks and so on.
statistical tests that we can use. 2. Skewness e this relates to symmetry of the tails. A curve is
positively skewed if most of the scores are clustered at the
How are my data distributed? higher end of the spectrum making this tail larger. A nega-
tively skewed distribution is the converse of this with most
Having determined our scale of measurement, we can now
of the scores being low (Figure 4).
determine the distribution of the variable through graphical
3. Kurtosis e this describes how flat the curve is. In essence,
assessment. Plotting the distribution allows us to make judge-
how the data are distributed about the peak.
ments regarding the most typical value of our variable and the
extent of spread or dispersion around this.
How do we describe our dataset?
If our dataset consists of categorical data, we can graphically
compare counts or frequencies between the different groups on Let us now consider the frequency distribution as a visual
a bar chart or pie chart. From this we can determine the most depiction of the variable of interest in a population rather than
commonly occurring score and the relative ratios between within a dataset. The population does not have to consist of
different scores. However, if our data have a continuous scale of individuals e it is simply a set of occurrences of that variable. If
measurement, we can construct a histogram (Figure 3). Histo- we can measure each and every occurrence of a particular vari-
grams are not bar charts. They should be considered as able we will have a frequency distribution of the population. In
frequency distribution charts with distinct mathematical prop- most cases, we are unable to obtain measurements for the entire
erties. The width of the bar corresponds to the actual limits of population and restrict ourselves to a representative sample.
that class interval i.e. the range of values for which the cumu- Through random sampling, this sample should embody all of the
lative frequency has been determined. Smaller class intervals characteristics of the population of interest. This is an important
better define the shape of the distribution and give an idea as to point which will be explored later when we come to consider
what may be happening to our variable between class intervals. inferring statistical findings from a sample to the population from
This is why it is best not to compress data through widening the which it was drawn.
ORTHOPAEDICS AND TRAUMA 24:6 465 Ó 2010 Elsevier Ltd. All rights reserved.
RESEARCH
variable, we can say that the true variable probably lies some-
where within the central peak of a unimodal distribution. If our
Frequency
15
curve has quite flat kurtosis, then we can be less sure of this and
then the question arises as to how far away from this peak do we
think this true value may lie. This neatly brings us to the two
10
Mean: this is the arithmetic average and equates to the sum of all
observations divided by the number of observations. Although
a b
40
50
40
30
Frequency
Frequency
30
20
20
10
10
0
0 20 40 60 80 1 2 3 4 5
Variable Variable
c
25
20
Frequency
15
10
5
0
1 2 3 4 5
Variable
Figure 4 Frequency polygraphs demonstrating a positively skewed distribution, b negatively skewed distribution and c distribution with symmetrical
variance.
ORTHOPAEDICS AND TRAUMA 24:6 466 Ó 2010 Elsevier Ltd. All rights reserved.
RESEARCH
Median: this is the value that comes halfway when the data are
ranked in order. If there is an even number of observations, then
the value falls between the central two scores. This is a more
appropriate measure in ordinal data where ranks exist but we
cannot be sure of the relationship between different ranks.
Medians are also useful when we are measuring a continuous
variable but our distribution is skewed. The outliers in the tail of
skewed data tend to exert a greater effect on the mean rather than
the median.
ORTHOPAEDICS AND TRAUMA 24:6 467 Ó 2010 Elsevier Ltd. All rights reserved.
RESEARCH
1. Non-parametric statistical analysis e generally the results of Comparing two independent groups
such tests contain a greater degree of uncertainty than their Many studies involve comparing two groups e either different
parametric equivalents. This is due to the lack of assump- interventions or exposed versus non-exposed. The type of data
tions regarding the distribution and such tests are based on and the distribution determine the appropriate statistical test
ranks and medians rather than continuous data and means. (Figure 7).
2. Parametric statistical analysis e if we know that the pop-
ulation from which the sample is drawn is normally Comparing paired data
distributed we can use parametric tests on the sample data Occasionally two groups are wrongly compared using the above
even though they may not be normally distributed. However, analysis. This is the case when we are looking at paired data such
often we may not be certain or able to prove that the pop- as repeated measures before and after an intervention. If we have
ulation data are normally distributed. continuous data, we are interested in the mean of the differences
3. Transformation into normalized data e linear trans- between successive readings rather than the difference in the
formations such as multiplication or subtraction may be means between the two groups; effectively reducing the data to
insufficient to normalize a dataset, necessitating non-linear a one sample problem. We then need to ascertain whether the
methods, for example logarithms, with the drawback of distribution of the differences is normal rather than considering
increasing the complexity of results interpretation. normality of the original two samples when considering whether
Other types of distribution exist, which act as the basis for to undertake the parametric test or its non-parametric equivalent
statistical tests for non-normal data.6 The binomial distribution is (Figure 8).
based on the relative frequencies of all possible permutations and
combinations of discrete data. The Poisson distribution is usually Comparing more than two groups
applied to discrete quantitative data, such as counts or incidences When we are comparing more than two groups we have two
occurring over a period of time, for example the number of options (Figure 9). Either we can perform multiple tests for
patients undergoing hip fracture surgery per day in a particular independent groups or more preferably we can use a one way
hospital. The ‘t’ distribution is a theoretical distribution derived analysis of variance (ANOVA). This is a parametric test which
from the normal distribution with an additional parameter of simultaneously compares all groups’ means on the basis that all
degrees of freedom, which determines how long the tails of the of the groups are normally distributed with the same variance. If
distribution are. An increase in the sample size increases our data are non-parametric then we need to use a different test.
the degrees of freedom, thereby shortening the tails and bringing
Correlating the results of two groups
the distribution closer to normal. Statistical tests for small
There are many instances when we are trying to show an asso-
samples have been derived from this distribution.
ciation between two variables. If we are trying to show that two
variables correlate then we should look at using a correlation test
Statistical tests for parametric and non-parametric data
(Figure 10), which can be useful in determining causality,
So far we have looked at descriptive statistics, but in order to concurrent validity and internal consistency. However, one must
generate conclusions through inferential statistics we need to be always be aware of spurious correlations with the passage of
able to test one or more hypotheses. A hypothesis is a statement time, such as the price of butter increasing with the birth rate due
of fact generated by a research question. For example, we may be to temporal trends rather than any meaningful relationship
interested in the degree of association between two continuous between the two. A further point to note is that correlation and
variables, the difference between outcomes of two groups or the association do not equal causality. Establishing causality of A
level of agreement between two observers for one variable. The causing B, we have to prove that A always precedes B, A and B
choice of appropriate test does not just rely on the research correlate and that if A is absent B cannot occur.
question we are addressing but also the previously mentioned There are two tests for correlation depending on whether our
data properties with respect to scale of measurement and numerical data are parametric or non-parametric. This yields a
distribution. value between 1 (negative correlation) to þ1 (positive
Chi squared test Fishers exact test Independent t test Mann Whitney test
ORTHOPAEDICS AND TRAUMA 24:6 468 Ó 2010 Elsevier Ltd. All rights reserved.
RESEARCH
ORTHOPAEDICS AND TRAUMA 24:6 469 Ó 2010 Elsevier Ltd. All rights reserved.
RESEARCH
can see that as the sampling size increases, the representation Problems with p values and hypothesis testing
and this error must decrease. Therefore, our ability to detect
Hypothesis testing is sound in principle, but restricting analysis
a meaningful difference between two groups is highly dependent
to the interpretation of p values alone has significant drawbacks.
on sample size.
Let us consider some theoretical examples of how this may
occur. A new perioperative regime to optimize the care of lower
Power calculations limb arthroplasty patients is introduced in order to reduce the
We can see that larger studies are more powerful with respect to inpatient stay and associated costs of treatment. A study is
their ability to detect a meaningful difference and thus reject the carried out looking at the inpatient stay of this population before
null hypothesis. The power of a study is defined as: and after this intervention is introduced. An analysis comparing
the before and after groups demonstrates a statistically signifi-
“the probability that a study of a given size will register as cant difference, with a p value less than 0.05. The new inter-
statistically significant a real difference of a given vention is heralded as a success and implemented. However, on
magnitude.”8 closer scrutiny it can be seen that the actual difference between
In essence, a power calculation allows us to determine the sample the two groups is less than 1 day which is not clinically signifi-
size required to actually register a specific magnitude of effect. cant. The costs of implementing this intervention have actually
However, to determine this, we first need to establish the smallest exceeded the financial benefit in terms of reduced stay. Very
true clinical or experimental effect that we consider as meaningful. small differences can become statistically significant if a large
In other words if this magnitude of effect exists we can state that enough sample size is used.
there is a difference between the two groups. The other important Occasionally multiple tests are carried out with datasets,
variables are the probability (b) of our study detecting this either purposefully or in the vain hope of demonstrating
magnitude of effect and the level of statistical significance (a). a statistically significant relationship. As significance relates to
These are typically set at 80% and 5% respectively but can be set probability, the chances are that the greater the number of tests
at any level. The power calculation is based on these variables and undertaken, the more likely you are to come up with a significant
assumes independent groups with roughly equal sample sizes that p value. Although multiple testing should be avoided, by either
have normal distributions. A power calculation is likely to produce the appropriate limitation of tests to those specified in advance
a very large sample size if the clinical effect, or difference between by the research question or through use of regression models,
two groups we are measuring, is very small or the variance is very occasionally it may be unavoidable. In these instances the easiest
large. Conversely, if we reduce the power or increase the level of correction is to undertake a Bonferroni transformation, whereby
significance then our sample size may be smaller but this is at the the statistically significant p value is multiplied by the number of
cost of increasing the risks of type II and type I errors respectively. tests undertaken to see if it is still significant.
Table 1
ORTHOPAEDICS AND TRAUMA 24:6 470 Ó 2010 Elsevier Ltd. All rights reserved.
RESEARCH
Another limitation to hypothesis testing can be exemplified by that if we took 100 similar sized samples from the population and
considering a theoretical study comparing a new thrombopro- derived 95% confidence intervals for these samples, then 95 of
phylactic drug with an old one, which demonstrates a fivefold these intervals would contain the true population value i.e. 95%
reduction in the incidence of postoperative thromboembolic of 95% similar sized confidence intervals will contain the true
events. However, the p value generated is 0.15 which is deemed value.6 The equation for deriving a 95% confidence interval is
non-significant. On this basis, no further studies are undertaken given by the formula.7
and plans to replace the old drug with the new one are indefi-
nitely shelved. This theoretical example highlights the very real 95% CI ¼ c 1:96ðd=OnÞ
risk that we may dismiss effective interventions on the basis of where c ¼ mean, d¼ standard deviation, n ¼ sample size.
statistical significance alone. A high p value merely suggests that We can see from the above formulation that if our standard
there is insufficient evidence to reject the null hypothesis. Arbi- deviation i.e. the variance and/or the sample size is small then
trarily accepting the null hypothesis due to setting significance at our confidence interval and thus our magnitude of uncertainty is
a particular level implies that we have found no proof of differ- also large. The width of confidence intervals decreases with
ence. However, “no proof of difference” does not equate to increasing sample size but it is always advantageous to compare
“proof of no difference”9 e a fundamental consideration when intervals, no matter how large, rather than point estimates. An
interpreting p values. Judging a p value by setting significance at example of comparing confidence intervals to means is as
a particular level effectively reduces the answer to any research follows e consider a theoretical study where the mean risk of
question to a yes/no status. It is more informative to look at the developing a complication with Operation A compared to Oper-
p value itself and the probability implications rather than looking ation B is fourfold with a confidence interval of 0.6 to 9.7. The
at arbitrary cut offs. Unfortunately, this relative ease of under- means alone suggest that the risk is higher in Operation A.
standing has led to publication bias towards significant results.1 However, looking at the confidence interval we can see that it
The analysis of data should focus on characterizing the actual encloses 1 i.e. equivalence. Therefore if this is the true difference
study effect under consideration rather than looking at proba- then we can state that there is equivalent risk with both opera-
bility statements alone. tions. There is also the possibility that there is less risk with
Operation A because the interval encloses 0.6. This finding is
Estimation and confidence intervals tempered by the fact that the confidence interval extends in the
opposite direction to a greater than ninefold risk with Operation
Hypothesis testing asks us the question “is it? or isn’t it?”. What
A. The reader is thus endowed with greater knowledge regarding
we are actually interested in is the answers to “how big is the
the difference between the two groups which is far in excess of
difference?” and “in what direction is the difference?”. As our
that provided by a p value alone. For this reason, statistical
results encompass uncertainty we have to rely on methods to
analysis should always look at confidence intervals so as to show
estimate where the true value of interest lies. As we have seen
the direction and magnitude of effect. Only then can we deter-
earlier we can make point estimates based on measures of central
mine whether a statistically significant p value actually has
tendency, such as the arithmetic mean. However, these do not
clinical significance. However we must always remember that
take into account the variability or dispersion of the data relative
confidence intervals like p values are not immune to errors in
to this value. Of greater interest is the estimation of an interval
study design or bias and that the intervals should always be
that we can be confident encloses the unknown true population
regarded as the smallest estimate for the real error in defining the
value. These are known as confidence intervals and encompass
true population value.
an estimate of the true value (for example the arithmetic mean)
as well as the sampling variability, with some level of assurance
Diagnostic tests
or confidence. Any confidence interval can be constructed, but
by convention 95% confidence intervals are usually derived. It is When we are looking at diagnostic tests for conditions, we need
important to note that confidence intervals are not direct prob- to be aware of certain statistical definitions. Usually we are
ability statements. To state that a 95% confidence interval means interested in comparing the results of a new diagnostic test with
that there is a 95% probability that the true population value of an established reference test or standard for diagnosing the
interest lies within this range is false. What it actually means is condition. If we plot a two by two table for all possible
Table 2
ORTHOPAEDICS AND TRAUMA 24:6 471 Ó 2010 Elsevier Ltd. All rights reserved.
RESEARCH
Data analysis
In summary we can see that statistical tests can only be mean-
C Compressing data/changing continuous data to ordinal data
ingfully and appropriately applied if we understand the proper-
C Using inappropriate measures of central tendency, e.g. means for
ties of our dataset as well as basing our analysis on a suitable
skewed data
research question. However, the statistical tests themselves are
C Not assessing normality of frequency distribution
only half the battle; the remainder being how to interpret the
C Incorrectly applying parametric tests when assumptions not
results to generate credible findings. Table 3 summarizes the
satisfied
common errors in both analysis and interpretation to conclude
C Paired data analyzed as independent groups
this discourse and serve as a reminder of the main points to
C Inappropriate methods of assessing agreement
consider before embarking on statistical analysis. A
C Inappropriate multiple testing with no correction
Results interpretation
C Significance rather than actual p values quoted
ORTHOPAEDICS AND TRAUMA 24:6 472 Ó 2010 Elsevier Ltd. All rights reserved.