You are on page 1of 12

CLINICAL

ARTICLE Clinical Trials 2013; 10: 225235


TRIALS

Central statistical monitoring: Detecting fraud in


clinical trials
Janice M Poguea,b,c, PJ Devereauxa,b, Kristian Thorlunda and Salim Yusuf a,b

Background Central statistical monitoring in multicenter trials could allow trialists


to identify centers with problematic data or conduct and intervene while the trial is
still ongoing. Currently, there are few published models that can be used for this
purpose.
Purpose To develop and validate a series of risk scores to identify fabricated data
within a multicenter trial, to be used in central statistical monitoring.
Methods We used a database from a multicenter trial in which data from 9 of 109
centers were documented to be fabricated. These data were used to build a series of
risk scores to predict fraud at centers. All analyses were performed at the level of the
center. Exploratory factor analysis was used to select from 52 possible predictors,
chosen from a variety of previously published methods. The final models were
selected from a total of 18 independent predictors, based on the factors identified.
These models were converted to risk scores for each center.
Results Five different risk scores were identified, and each had the ability to discri-
minate well between centers with and without fabricated data (area under the curve
values ranged from 0.90 to 0.95). True- and false-positive rates are presented for
each risk score to arrive at a recommended cutoff of seven or above (high risk score).
We validated these risk scores, using an independent multicenter trial database that
contained no data fabrication and found the occurrence of false-positive high scores
to be low and comparable to the model-building data set.
Limitations These risk score have been validated only for their false-positive rate
and require validation within another trial that contains centers that have fabricated
data. Validation in noncardiovascular trials is also required to gage the usefulness of
these risk scores in central statistical monitoring.
Conclusions With further validation, these risk scores could become part of a series
of tools that provide evidence-based central statistical monitoring, which in turn can
improve the efficiency of trials, and minimize the need for more expensive on-site
monitoring. Clinical Trials 2013; 10: 225235. http://ctj.sagepub.com

Introduction
The goals of any clinical trial can be achieved if there variety of errors, deviations, or misconducts while
are sufficient procedures in place to reduce bias and undertaking a trial, including procedural errors in
maximize precision in the outcome of interest. Cen- failing to follow the protocol, and data recording
ter investigators and study personnel may make a errors, including in rare cases, falsifying data [1].

a
Department of Clinical Epidemiology and Biostatistics, McMaster University, Hamilton, ON, Canada, bFaculty of Health
Sciences, McMaster University, Hamilton, ON, Canada, cPopulation Health Research Institute, Hamilton Health Sciences,
Hamilton, ON, Canada
Author for correspondence: Janice M Pogue, Population Health Research Institute, Hamilton Health Sciences, General
Campus, DBCVSRI 237 Barton Street East, Hamilton, ON L8L 2X2, Canada.
Email: Janice.Pogue@phri.ca

The Author(s), 2012


Reprints and permission: http://www.sagepub.co.uk/journalsPermissions.nav 10.1177/1740774512469312
226 JM Pogue et al.

Not all of these behaviors will lead to biased trial probability of concerns and for which there should
results or increased variability in outcomes [18], but be consideration for on-site monitoring [1,2,4,5].
for those that do, trialists would like to routinely The purpose of this article is to retrospectively
detect and stop these actions while the trial is evaluate the ability of central statistical monitoring
ongoing. Trialists are then left with the question of to identify centers that are known to have fabricated
how best to detect and correct important errors at trial data. The identification of fabricated data is an
centers within a multicenter trial. important function of trial monitoring. Although
It has been suggested that central statistical moni- falsified data may not alter the results of a large mul-
toring may serve as the foundation for quality assur- ticenter trial, at least for blinded trials [13,6,7], it
ance and center monitoring in multicenter clinical could reduce the precision of the trial results. Unfor-
trials [1,2,4,5]. Traditionally, quality assurance for tunately, even isolated and small amounts of fraud
clinical trials has included trial oversight committees within a trial can cause significant doubts about its
(trial management committee, trial steering commit- conclusions and have the potential to lead to a lack
tee, and data monitoring committee); central moni- of public confidence for the clinical trial process in
toring of center performance, including central general [13,5,6,1418].
statistical monitoring; and on-site monitoring [1]. Many authors have suggested that central statistical
The first two monitoring methods are used univer- monitoring of data may be used to identify fraud at
sally, but the use of on-site monitoring visits differs centers within multicenter trials [13,5,6,14,1825].
among trials. Morrison et al. [9] surveyed 65 research Case reports have been published describing how an
organizations involved in the conduct of clinical individual center with fraudulent data was detected
trials and found the use of a wide variety of monitor- by examining center-level data during the conduct of
ing practices. Of the academic, government, or trials including Animal Models for Protecting
cooperative research groups, only 31% always used Ischemic Myocardium (AMPIM) [14], Multiple Risk
on-site monitoring, whereas 84% of the industry/ Factor Intervention Trial (MRFIT) [6], NSABP-06 [15],
contract research organization (CRO) group relied Second European Stroke Prevention study [21], Clopi-
on this approach. Among the various monitoring dogrel and Metoprolol in Myocardial Infarction Trial
methods, on-site visits are expensive, yet we have lit- (COMMIT-1) [19], and Perioperative Ischemic Evalua-
tle evidence that they affect the validity of the main tion (POISE) [26]. Al-Marzouki et al. [20] performed a
findings of an otherwise unbiased trial. While Sha- comparison of a diet intervention database that was
piro and Charrow [10] have documented that Food thought to be fabricated to a sample from a validated
and Drug Administration (FDA) audits have identi- study. In comparing the intervention to the control
fied serious deficiencies at trial centers, they did not group within each data set for a variety of laboratory
classify these deficiencies as capable of changing measurements, the authors found more statistically
individual clinical trial results or not. There are no significant differences between the two randomized
published systematic reviews estimating the effect of groups within the data set that was thought to be
on-site monitoring on trial results, but the experi- fabricated.
ences from some individual programs and trials have Any prognostic model must begin with a litera-
been published. The National Cancer Institute ture review of previously published articles within
found that instituting an on-site auditing program that area, and the difficulty with central statistical
did not change the agreement rate for treatment fail- monitoring is that a great many factors have been
ures or the percentage of protocol deviations within suggested to predict fabricated data. A very large
their program [11]. The National Surgical Adjuvant number of possible statistical methods have been
Breast and Bowel Project (NSABP) in reviewing all its proposed to identify unusual patterns at centers,
participants data found no additional treatment and these are summarized in Table 1. Such a long
failures or deaths, and only a small number of pre- list provides no direct guidance as to which statisti-
viously unknown ineligible participants. An audit of cal methods are most appropriate for central statisti-
the Global Utilization of Streptokinase and TPA for cal monitoring. Yet, there is even more uncertainty
Occluded Coronary Arteries (GUSTO) trial also in how to proceed given that within a typical trial
found no errors that would change the trials results the types of variables that these methods may be
[12]. It has been suggested that regular on-site visits applied to include inclusion and exclusion criteria,
to review entire case report forms are excessive in medical history, physical measurements, laboratory
cost and result in little gain in important data qual- tests, study-specific tests, visit or test dates, compli-
ity [4,12,13]. Given the lack of clear evidence of the ance, outcomes, adverse events, completeness of fol-
value of on-site monitoring to detect biases, and low-up, data query rates, time to clean data, center
their high costs, which is reflected in less use in enrollment, number of participant seen per day, or
investigator initiated trials, it has been suggested calendar day of visits. Given such a long list of pos-
that central statistical monitoring may be able to sible methods and even a modest number of vari-
assist in identifying sites where there is a higher ables collected per trial participant, there is a need

Clinical Trials 2013; 10: 225235 http://ctj.sagepub.com


Detecting fraud in trials 227

for a parsimonious and efficient approach to central undergoing noncardiac surgery. Initially there were
statistical monitoring. This article seeks to evaluate 9298 trial participants randomized from 196 clinical
some of the methods for central statistical monitor- centers from 24 countries, but data fabrication was
ing using statistical summaries and tests to establish detected in 9 centers involving a total of 947 rando-
a series of good prognostic models to identify cen- mized trial participants. During this trial, fraud was
ters with fabricated data within a multicenter trial. detected initially at the six participating centers in
Ideally, such a model would be a simple algorithm Iran, all managed by a common research team. Here,
that could be used widely in different types of multi- problems were first detected when a center called in
center trials, at modest costs. to randomize a patient when they should not have
In this article, we use data from the POISE Trial had any available drug kits left. An on-site audit
[26] to retrospectively model confirmed data fabrica- organized by the National Principal Investigator
tion at centers within a multicenter trial. took place, and only minor problems were reported.
However, a former center staff member claimed
fraud had occurred in Iran.
Methods Subsequent on-site monitoring by the Study Prin-
cipal Investigator found massive amounts of data
The POISE Trial fabrication. We discovered that many submitted
electrocardiograms and troponin values did not
The POISE Trial [26] examined the effect of a perio- belong to the patients for whom they were sub-
perative beta-blocker versus placebo in participants mitted. Commonly, the date of the test was covered
at risk of cardiovascular events who were up and another date was reported. We also estab-
lished that the laboratory computer had deleted the
troponin results that were beyond 6 months old
Table 1. Statistical method suggested for identifying fabricated
and that many official laboratory troponin reports
data
did not match what was in the laboratory computer.
Classification Type Method Consult notes were frequently made up on hospital
stationary, indicating that a patient had suffered an
Univariate: Statistical Proportions [1,2,5,19,20] event (e.g., a myocardial infarction) when the actual
examining summaries Means [1,2,5,19,20] hospital chart did not mention or support such an
one variable Center event rates [1,2,14,18,19,21]
outcome. Through review of hospital charts, we also
at a time Variances [2,5,14,20]
identified patients who had suffered a major perio-
Digit preference [1,2,20,25]
perative cardiovascular complication, but the sub-
Calendar checks (day of week) [1,2]
mitted case report forms indicated that these
Benfords law [1,2,25,27,28]
patients had not suffered such an event. We identi-
Skewness [1,2]
fied what appeared to be a fabricated death certifi-
Kurtosis and outliers [1,2,5,25]
cate based upon a phone call with the patients
Inliers [2,25]
daughter who informed us that the patient was alive
Statistical t-test [2,20]
and vacationing in another city. We established that
tests Chi-square tests [2,5,20]
none of the 59 patients randomized and reported to
F-test for variances [2,20,21]
have had surgery at one Iranian center actually had
Purely Histograms [2,25]
surgery at that center. These patients had surgery at
graphical qq plot [2,25]
the other Iranian centers. Furthermore, many
methods box plots [2,25]
patients were inappropriately randomized after they
Multivariate: Statistical Date comparisons [2,5,25]
had surgery, and their date of surgery was falsified
examining summaries Correlations [2,5,25]
on the case report forms. All records from the rando-
combinations Intraclass correlations [2]
mized patients at these six centers (N = 752) were
of variables Auto correlation [2]
considered to be fabricated and were omitted from
Mahaloanobiss distance [2]
the POISE Trial.
Cooks distance [2]
Fraud also was found to have occurred at three
Statistical Cluster analysis [2]
centers in Columbia, which shared a common
tests Discriminant analysis [2]
research assistant. Problems were first detected
Hotellings T2 [2]
when another center staff member failed to locate
Runs tests [2]
some consent forms for patients enrolled by a speci-
Purely Scatter plots [2,5,14,25]
graphical Chernoff faces [2,25]
fic research assistant. An on-site audit first by the
methods Star plots [2,25]
National Principal Investigator and subsequently by
the Study Principal Investigator documented the

http://ctj.sagepub.com Clinical Trials 2013; 10: 225235


228 JM Pogue et al.

following problems. Many of the patients this Table 2. Strategy for model building
research assistant had randomized could not be
1. Identify possible predictors of Based on prior publications
tracked to an actual patient, and of those patients
data fabrication at a center
who could be identified, many of them denied parti-
2. Summarize predictors at a Calculate p-values for
cipating in POISE. Of the patients who could be
center level in a unit-less form predictors comparing each
identified and had consented to POISE, many were
center versus all others
ineligible to participate in POISE and the troponin
3. Eliminate redundancy Factor analysis
values recorded for these patients were not consis-
among the predictors
tent with the timing of their actual surgery. This
4. Build possible models to Logistic regression
audit was extended to cases not associated with this
predict data fabrication
research assistant, both at the three centers at which
5. Convert model regression Utilize a points system [30]
this research assistant worked and the other eight
coefficients to scores
Colombian centers, and these cases did not demon-
6. Validate the scores externally Apply to an independent
strate any concerns. From these three centers, data
multicenter trial database
from 195 patients (26% of total randomized) were
considered to be fabricated and were omitted from
the final trial analysis [26].
The accuracy of the data from the rest of the
POISE centers was verified by on-site monitoring for
hospitals that recruited 40 or more participants and presented in the baseline characteristics table within
for any other center that was identified as an outlier the original POISE publication, when the character-
through central statistical monitoring. This on-site istic was present in at least 10% of the trial partici-
monitoring was completed at 77 centers that had pants. Other possible predictors of fraud that we
included were those with repeated physical measure-
collectively randomized 85% of all trial participants.
No major discrepancies, outside of those reported ment over time, compliance and outcome rates, data
quality/query rates, and total numbers of partici-
above, were identified between the submitted data
pants randomized. See online Appendix A (Table A1)
and the hospital records. One unreported myocar-
for a complete list of all variables initially included
dial infarction was identified out of 534 reported
as potential predictors.
primary outcomes, but no other instances of unre-
ported outcomes or fabricated data were found.
Only the 109 centers that randomized 20 trial parti- Summaries by center
cipants or more were included in this analysis, so that
we could appropriately use statistical tests to summar- Although it is the individual at a center who com-
ize data patterns within and between centers. We pur- mits fraud, most trials, including POISE [26], do not
posefully planned to use statistical summaries or tests record which center staff member is responsible for
within our models, so that they could potentially be the origins of each data item. For this reason, the
generalized across different studies, regardless of the center was chosen to be the unit of analysis for
number of centers or variables collected. We avoided these models, with all variables summarized at the
using purely graphical methods as these can have sub- center level. Furthermore, since all trials measure
jective interpretations, and we preferred models that different variables, the summaries have to be unit-
can be validated in an objective manner. less, derived from statistical distributions and tests,
rather than in their native units. For each variable,
the focus was to present ways of showing how dif-
Statistical methods ferent each center was from the others. This assess-
ment of center differences did not assume any
The analytic strategy for model building and valida-
directionality (e.g., high or low rates) but only
tion is summarized in Table 2. There are multiple
sought to quantify how distinct the data were at
valid strategies for developing risk models, and the
individual centers. The one exception to this focus
strategy we opted to use should be considered only
was the summary of repeated measurements over
as one possible option. We also decided to develop
time, where the summary statistic was already unit-
five models instead of a single one, recognizing that
less (see test (7)). The following tests were used:
data fabrication may be detected in multiple ways.
All analyses were performed in SAS 9.1 for Unix [29].
(1) Frequency comparison. For binary data (e.g., age
.70 years), the proportion with this character-
Identifying possible predictors istic at each center was compared to the overall
proportion from all other centers using a two-
Based on prior research, we selected a variety of vari- by-two Pearson chi-squared test. The probabil-
able types including the baseline characteristics ity value (p-value) from this test for each center

Clinical Trials 2013; 10: 225235 http://ctj.sagepub.com


Detecting fraud in trials 229

was used as a potential predictor for model detect fabricated data. The consistency of con-
building. tinuous measurement recorded multiple times
(2) Mean comparison. For continuous measure- in follow-up (e.g., systolic blood pressure) was
ments (e.g., systolic blood pressure), which quantified by an intraclass correlation coeffi-
were likely to be approximately normally dis- cient (ICC) calculated for each center.
tributed, a two-group Students t-test, using a
pooled standard deviation, was used to com- Table A1 (online Appendix A) lists each possible
pare the mean at each center to the overall predictor and the statistic(s) used to assess it. A total
mean of other centers. For each center, two- of 52 potential predictors were generated to be used
tailed p-value from this test was used as a to predict fabricated data.
potential predictor.
(3) Digit preference. We compared the frequency of
the last digit recorded for physical measure- Eliminating redundancy
ments over all trial participants at each center.
Given the long list of suggestions to identify fabri-
The frequency of these digits was compared to
cated data, we have no clear a priori direction as to
the overall frequency from all other centers
which predictors should be included in building a
using a two-by-two Pearson chi-squared test.
prognostic model. Additionally, in any long list of
The p-value from this test for each center was
possible predictors, there is likely to be correlation
used as a potential predictor for model build-
among them, but since all predictors are summar-
ing. This test was also used to assess the day of
ized at the center level, rather than for individual
week on which randomization occurred. This
trial participants, these correlations may be challen-
method assumes no distribution for the digits
ging to predict. Therefore, an initial principal com-
or days but merely tests whether a center is dif-
ponents factor analysis was done to identify a subset
ferent from the other centers.
of independent variables for centers to use in predic-
(4) Variance comparison. The variability of measure-
tive modeling. The exploratory principal compo-
ments at centers was compared using a folded
nent analysis with varimax rotation was used to
F-test, contrasting the variance of a possible
identify a subset of independent variables, with the
predictor at each center to the variance of the
total number of factors identified being chosen
rest of the centers. For each center, a p-value
using the Kaiser criterion of eigenvalues greater than
from this test was used as a potential predictor.
one. For each identified factor, one variable with the
(5) Distance measure. Using data from the ith trial
largest loading score was then included in the initial
participant at the jth center, we calculated a dis-
tance measure (dj ) to indicate how far away one models. This reduced set of variables was used to
centers data are from the overall mean (y) across predict fraudulent centers in a logistic regression.
all centers, standardized by the overall standard
deviation (s) [25,31]. The natural logarithm of Building possible models
distance was used as a possible predictor.
We used the best subsets of models using the branch
X yij y
2
and bound algorithm of Furnival and Wilson [32] to
dj = find models with the largest score statistic for includ-
i
s
ing different numbers of variables. The final series of
models was selected based on no significant increase
(6) Outcome probability. We calculated the outcome
in the score test for increasing the number of vari-
rate at each center adjusted for country. We then
ables in the model. These models were checked for
calculated the probability of observing an
lack of fit using the Hosmer and Lemeshow [33]
adjusted outcome rate as extreme as that observed
goodness of fit test. Prediction ability was summar-
at that center, assuming a Poisson distribution
ized as the area under the curve (AUC) with 95%
with the overall adjusted mean. This adjustment
confidence intervals (CIs) and partial AUC.
for country variation was used to control for the
commonly observed patterns of different out-
come rates in different countries, due to many Converting models to risk scores
factors, including use of concomitant medica-
tions, differing health-care systems, and a variety For ease of use, the resulting models were then con-
of other factors. The probability from the cumula- verted into simplified risk scores using a points sys-
tive probability distribution (CDF) for each center tem [30]. Here, a reference category is selected and
was used as a possible predictor. whole number points are given to the range of the
(7) Repeated measures. The correlation of repeated predictor, relative to its regression coefficient from
measurements may be a good approach to the logistic regression. We then identified cutoffs of

http://ctj.sagepub.com Clinical Trials 2013; 10: 225235


230 JM Pogue et al.

these scores to most efficiently identify the fraudu- intraperitoneal surgery frequency p-value (model 2),
lent centers. general anesthesia frequency p-value (model 3), pre-
operative angiotensin-converting-enzyme inhibitor
(ACE-I) or angiotensin II receptor blocker (ARB) fre-
Model validation quency p-value (model 4), or compliance rate CDF
p-value (model 5). For models 14 statistics, a higher
Since fabricated data sets are rare, external valida- p-value indicated a greater risk of fraud for that cen-
tion of the model was done by calculating the risk ter. Again given the high consistency of systolic
scores in an independent trial that had on-site mon- blood pressure over time, and similarity of diastolic
itoring and for which no center was identified as blood pressure at baseline, centers with data that
having fabricated data. If the fabricated data score were very similar to the overall summary statistics
was valid, we would anticipate low scores for all cen- were predicted to be at risk of fraud. For the fifth
ters within this second trial. For purposes of model model, lower CDF probability for compliance was
validation, the multicenter Heart Outcomes Preven- associated with fraud. As the center compliance rate
tion Evaluation (HOPE) Trial [34] data were used. became more improbable (either high or low rates
of compliance), there was a higher risk of being a
center with fabricated data. For this last model,
Results therefore, the points are reverse.
A total of 8722 randomized participants from 109 Table 3 displays these predictive models after con-
clinical sites were included in this analysis, 947 par- version of the logistic regression coefficients to a
ticipants with fabricated data from 9 sites and 7775 simple scoring system. Figure 1 displays these five
participants with validated data from 100 sites. risk scores for each center with fraudulent centers
From the 52 possible predictors specified, factor ana-
lysis identified 18 independent factors. From this,
the variable with the largest loading for each of the Table 3. Risk scores predicting fabricated data
18 factors was selected for inclusion in logistic
regression modeling to predict sites with fraud. See Model Terms Categories Score
Table A2 in online Appendix A for the associations Predictor 1: SBP over time intraclass 1 +0
between predictors and the reduced list included in correlations 2 +1
model building. 3 +2
Based on the score function, any possible model 4 +3
of more than three predictors did not significantly 5 +4
add to the models. The five 3-variable models with Predictor 2: DBP mean comparison 1 +0
highest score test were selected as possible predictive t-test p-values 2 +1
models. For each of these models, no significant lack 3 +2
of fit was detected. See Table A3 of online Appendix 4 +3
A for the logistic regression coefficients for the five 5 +4
models. Predictor 3: Model 1: SBP digit 1 + 0 (+ 4)
These models all included the intracluster correla- preference x2 p-values
tion coefficient for repeated systolic blood pressure Model 2: surgery 2 + 1 (+ 3)
measurements, with larger values indicating higher intrathoracic or
risk of fraud. The second term included in each intraperitoneal
model was the diastolic blood pressure comparison frequency x2 p-value
of each centers mean versus that of the other cen- Model 3: anesthesia 3 + 2 (+ 2)
ters. Here, risk increased as the t-test p-value general frequency
increased toward 1. This indicated that after adjust- x2 p-value
ing for the high ICC in the first term, a centers Model 4: ACE-I/ARB 4 + 3 (+ 1)
mean diastolic blood pressure at baseline that is very x2 p-value
similar to that of other centers predicts a risk of fab- Model 5: compliance 5 + 4 (+ 0)
ricated data. Given that there is a very high regular- outcome probability 2
ity in the values of the systolic blood pressure over CDF (points in brackets)
time (high ICC), it is unusual for the same center to
have a diastolic blood pressure mean that closely SBP: systolic blood pressure; DBP: diastolic blood pressure; ARB: angioten-
sin II receptor blocker; ACE-I: angiotensin-converting-enzyme inhibitor;
resembles other centers.
CDF: cumulative probability distribution.
The third term for each model differed. These Point reversed for model 5 only and provided in brackets.
included the following: systolic blood pressure digit Categories: intraclass correlation and p-values 1= 0.20, 2 = 0.210.40,
preference p-value (model 1), intrathoracic or 3 = 0.410.60, 4 = 0.610.80, and 5 = 0.81+ .

Clinical Trials 2013; 10: 225235 http://ctj.sagepub.com


Detecting fraud in trials 231

ROC Area: 0.95 0.90 0.91 0.90 0.94


95% CI: (0.90-1.00) (0.81-0.99) (0.84-0.98) (0.82-0.98) (0.88-1.00)

12
10
Center Scores
8
6
4

Fraud
2

None
0

Model 1 Model 2 Model 3 Model 4 Model 5

Figure 1. Five possible models.


ROC: receiver operating characteristic.

indicated in red. All five scores could discriminate relatively equivalent variables. We used HOPE dia-
well between fraudulent and validated centers with betes inclusion criteria instead of surgery or anesthe-
AUC values ranging from 0.90 (95% CI: 0.810.99) sia type, concomitant beta-blocker use instead of
for model 2 to 0.95 (95% CI: 0.901.00) for model ACE-I or ARB use, and 75% compliance to 2 years
1, with this latter model having a smaller 95% CI. instead of 80% compliance at 30 days. Figure 2
Figure 1 shows that the majority of the centers with presents the model 1 scores for these HOPE center
fraud have higher scores than those centers without alongside those for POISE. Within the HOPE data
fabricated data. set, only 2 of the 180 centers had a model 1 risk
In using these scores within an active multicenter score of at least seven (1.1%). For models 2, 4, and
trial, one would have to select a cutoff value and 5, the number of false-positive scores was 13 (7.2%),
examine or monitor all centers with this score or 11 (6.1%), and 23 (12.8%), respectively. These false-
greater. Table 4 shows the effect of using various risk positive rates are comparable to those observed in
score cutoffs on the number of centers selected by POISE (see Table 4 for score 7), and even lower for
fraud status. For model 1, examining centers with models 1 and 5.
scores of seven or above would have detected 8 of Table 5 shows the outcome rates and treatment
the 9 fraudulent centers (89%) and involve detailed effects for the POISE trial with and without the frau-
examination of 18 (18%) of the total centers within dulent data. The inclusion of the fraudulent data in
the trial, including false-positive scores for 10 cen- the trial database did lead to minor variations in
ters (10%) with no fabricated data. For model 5, 24 outcome rates, treatment estimate, 95% CIs, and
centers (22%) would have high fraud scores, 8 (89%) p-values. However, these small differences did
with fraud and 16 (16%) with false-positive high change the interpretation of which outcomes were
scores. statistically significant, at a p-value of 0.05, for the
Similar variables to those included within model primary outcome and cardiovascular death.
1 were also measured within the HOPE Trial, which
randomized 9541 participants at 281 centers in Ref.
[34]. Using the 180 centers that had randomized at Discussion
least 20 participants into the trial, we tested the var-
ious models using similar variables to those col- The process of quality assurance in multicenter trials
lected in POISE. Models 1, 2, 4, and 5 scores were will have multiple components including oversight
calculated for each center, and a score of seven or by trial committees, site training and communica-
greater was defined as a high fraud risk score. The tion, data cleaning and checking, central statistical
score for model 1 was the only risk score that con- monitoring, and on-site monitoring [1,5]. All trials
tained virtually equivalent variables to the model would benefit from the careful evaluation of how to
for POISE. For the remaining models, we selected use each of these components individually and in

http://ctj.sagepub.com Clinical Trials 2013; 10: 225235


232 JM Pogue et al.

Table 4. Sensitivity of score cutoffs to detecting centers with fraud (n = 9 of 109 centers)

Risk score cutoff 4 5 6 7 8 9

Model 1 Fraud (true positive) 9 (100%) 9 (100%) 8 (89%) 8 (89%) 6 (67%) 4 (44%)
No fraud (false positive) 42 (42%) 26 (26%) 23 (23%) 10 (10%) 4 (4%) 0 (0%)
Total high scores 51 (47%) 35 (32%) 31 (28%) 18 (28%) 10 (9%) 4 (4%)
Model 2 Fraud (true positive) 9 (100%) 7 (78%) 6 (67%) 5 (56%) 2 (22%) 2 (22%)
No fraud (false positive) 41 (41%) 22 (22%) 8 (8%) 4 (4%) 1 (1%) 1 (1%)
Total high scores 50 (46%) 29 (27%) 14 (13%) 9 (8%) 3 (3%) 3 (3%)
Model 3 Fraud (true positive) 9 (100%) 8 (89%) 6 (67%) 4 (44%) 3 (33%) 3 (33%)
No fraud (false positive) 36 (36%) 20 (20%) 14 (14%) 3 (3%) 2 (2%) 0 (0%)
Total high scores 45 (41%) 28 (26%) 20 (18%) 7 (6%) 5 (5%) 3 (3%)
Model 4 Fraud (true positive) 9 (100%) 8 (89%) 7 (78%) 4 (44%) 4 (44%) 1 (11%)
No fraud (false positive) 46 (46%) 25 (25%) 15 (15%) 6 (6%) 1 (1%) 1 (1%)
Total high scores 55 (51%) 33 (30%) 22 (20%) 10 (9%) 5 (5%) 2 (2%)
Model 5 Fraud (true positive) 9 (100%) 9 (100%) 9 (100%) 8 (89%) 6 (67%) 4 (44%)
No fraud (false positive) 65 (65%) 47 (47%) 33 (33%) 16 (16%) 7 (7%) 1 (1%)
Total high scores 74 (68%) 56 (51%) 42 (39%) 24 (22%) 13 (12%) 5 (5%)

This table shows the true-positive and false-positive counts and percentages for each model defined at various cutoffs. A true positive is a high score (at or
above the cutoff) for a center with fraud, and a false positive is a high score for a center without fraud. Percentages for centers with fraud are out of a total
of 9 centers and false positives are from 100 centers. The total number of centers with high scores per cutoff indicates the number of centers where some
action would be required by the trial management group (e.g., on-site visits).
12
10
Center Scores
8
6
4

Fraud
2

None
0

POISE HOPE

Figure 2. External validation of model 1 on a trial without fabricated data: A comparison of the distribution of center risk score in POISE
(with nine fraudulent centers) and HOPE (no fraudulent centers).
POISE: Perioperative Ischemic Evaluation: HOPE: Heart Outcomes Prevention Evaluation.

combination to arrive at a database that contains However, cost is not the only issue with traditional
minimal bias and maximizes precision for trial on-site monitoring. The process of data collection in
outcomes. trials is changing, with greater use of electronic data
The traditional model of frequent on-site monitor- capture (EDC). Paper records could no longer exist at
ing has been estimated to take up to 30% of the total sites to compare with values within the trial data-
cost of a trial and could be reduced substantially base. Information is now commonly entered directly
with the use of central statistical monitoring [4,13]. onto an electronic device during an interview with a

Clinical Trials 2013; 10: 225235 http://ctj.sagepub.com


Detecting fraud in trials 233

Table 5. Impact on the results of the POISE Trial with and without inclusion of centers with and without fraudulent data

Without fraud data With fraud data

Metoprolol Placebo Metoprolol versus placebo Metoprolol Placebo Metoprolol versus placebo

n (%) n (%) HR 95% CI p-value n (%) n (%) HR 95% CI p-value

Total participants 4174 4177 4648 4650


Primary: CV death, MI, 244 (5.8%) 290 (6.9%) 0.84 0.700.99 0.040 284 (6.1%) 328 (7.1%) 0.86 0.731.01 0.064
cardiac arrest
CV death 75 (1.8%) 58 (1.4%) 1.30 0.921.83 0.137 83 (1.8%) 59 (1.3%) 1.41 1.011.97 0.044
Non-fatal MI 152 (3.6%) 215 (5.1%) 0.70 0.570.86 \0.001 182 (3.9%) 247 (5.3%) 0.73 0.600.89 0.001
Non-fatal Cardiac arrest 21 (0.5%) 19 (0.5%) 1.11 0.602.06 0.744 24 (0.5%) 24 (0.5%) 1.00 0.571.76 0.994

POISE Trial: Perioperative Ischemic Evaluation Trial; HR: hazard ratio; CI: confidence interval; CV: cardiovascular; MI: myocardial infarction.

trial participant. The EDC record may be the only enrolled high-risk cardiovascular patients undergoing
source document that exists in many trials. The surgery with short-term treatment and follow-up to
future of trial quality assurance must by necessity 30 days. However, HOPE had a secondary prevention
move to a process that relies predominantly on cen- population, with long-term therapy and follow-up to
tralized data checking. Therefore, we need to study a mean of 4.5 years.
how best to avoid bias and excessive variability or An important limitation of this article is that the
noise by identifying important procedural errors at validation data set did not contain any case of fraud,
sites, critical data recording errors, and fraudulent and can only demonstrate the false-positive rates for
data. This article represents a first step in developing these risk scores. Further validation of these risk
an evidence-based set of quality assurance tools that scores within data sets that do contain significant
can be used to improve the quality of the research numbers of fraudulent centers will be required to
that we undertake. fully validate the current models. There is also a
In this article, we found that the best predictors of need for further validation of these models in other
fraud are the combination of high similarity of types of trials, within different research areas. It may
repeated measurements (e.g., systolic blood pres- be important to tailor these scores within specific
sure) estimated through a site ICC, and a higher research areas, but a common pattern across all areas
similarity of center baseline characteristics (e.g., dia- should include the investigation of centers which
stolic blood pressure) when compared with data have high regularity in physical measurements and
from other centers. On its own, the lack of variabil- very typical baseline characteristics. We are hopeful
ity for repeated measurement data at a center may that these risk scores may have equivalents in other
in general be the most sensitive single predictor to trials that will also lead to more effective identifica-
detect possible fraud. However, it is the combination tion of fraud. It should be noted that standard data
of great regularity within the data, accompanied by cleaning must be implemented within any trial,
very typical baseline characteristic measurements, prior to calculating any fraud risk score, as random
that is best at indicating the presence of fabricated noise in the data can easily mask the patterns we are
data. Those inventing data are creating what they trying to detect. Also, the percentage of fabricated
imagine to be a group of typical trial participants, data at a center will directly influence our ability to
whose baseline characteristics appear very normal, identify it, making the rare fabricated values almost
but the resulting means and frequencies are too impossible to detect through any statistical
similar to all the other centers. They also fail to cre- methods.
ate the natural variability in continuous physical It is important to detect fraud in trials since it
measurements over time and between individuals could add noise to trial results and reduce the power
and make these data too regular and predictable. of the trial to detect treatment effects, if they exist.
Although no prognostic model will ever be 100% In the POISE Trial [26], we found that the fabricated
accurate, the scores developed here had reasonable data did add some noise to trial results, likely due to
success at discriminating between centers with and the relative size of the fabricated data. Others have
without fabricated data. found that a small proportion of fabricated data did
We demonstrated, in an independent database, not change their multicenter trial results [6,7,15,19].
that the calculated risk scores for centers with valid It is possible that trials where the treatment is not
data are low. This finding is encouraging because the masked may be at greater risk for bias due to fabri-
POISE and HOPE trials have some important design cated data. Regardless of the effect of fraud on trial
differences. Both are cardiovascular trials, but POISE results, it is important to detect this in a trial because

http://ctj.sagepub.com Clinical Trials 2013; 10: 225235


234 JM Pogue et al.

it can weaken the publics trust in the validity of References


clinical trials [1,2,5,7].
In the future, it would be useful to build other risk 1. Baigent C, Harrell F, Buyse M, Emberson J, Altman D.
scores to identify other issues with clinical site per- Ensuring trial validity by data quality assurance and
formance, including major protocol deviations, sys- diversification of monitoring methods. Clin Trials 2008;
tematic under-reporting of adverse events or study 5: 4955.
2. Buyse M, George S, Evans S, et al. The role of biostatis-
outcomes, or problems in the consent process. One
tics in the prevention, detection and treatment of fraud
can envision central statistical monitoring in trials in clinical trials. Stat Med 1999; 18: 343551.
as a toolbox of indices for various suspected pro- 3. DeMets D. Distinctions between fraud, bias, errors, mis-
blems at centers, guiding the monitoring process. understanding, and incompetence. Clin Trials 1997; 18:
Further research in this area is needed to equip trial- 63750.
ists with the tools they need to identify those who 4. Eisenstein E, Collins R, Cracknell B, et al. Sensible
threaten the reputation of randomized controlled approaches for reducing clinical trials costs. Clin Trials
trials. Use of such scores could streamline the pro- 2008; 5: 7584.
cess of quality assurance in multicenter trials, lead- 5. Knatterud G, Rockhold F, George S, et al. Guidelines for
quality assurance in multicenter trials: A position paper.
ing to greater effectiveness and efficiency. It is only
Control Clin Trials 1998; 19: 47793.
through systematic study of trial methodology that 6. Neaton J, Bartsch G, Broste S, et al. A case of data altera-
we can identify evidence-based best practices and tion in the Multiple Risk Factor Intervention Trial
arrive at sensible guidelines for the conduct of clini- (MRFIT). Control Clin Trials 1991; 12: 73140.
cal trials [8]. 7. Peto R, Collins R, Sackett D, et al. The trials of Dr. Ber-
nard Fisher: A European perspective on an American epi-
sode. Control Clin Trials 1997; 18: 113.
Funding 8. Yusuf S, Bosch J, Devereaux P, et al. Sensible guidelines
for the conduct of large randomized trials. Clin Trials
Funding for the conduct of the POISE trial was 2008; 5: 3839.
received from the Canadian Institutes of Health 9. Morrison B, Cochran C, White J, et al. Monitoring the
Research; the Commonwealth Government of Aus- quality of conduct of clinical trials: A survey of current
tralias National Health and Medical Research Coun- practices. Clin Trials 2011; 8: 34249.
cil; the Instituto de Salud Carlos III (Ministerio de 10. Shapiro M, Charrow R. The role of data audits in detect-
Sanidad y Consumo) in Spain; the British Heart ing scientific misconduct: Results of the FDA program.
Foundation; and AstraZeneca, who provided the JAMA 1989; 261: 250511.
study drug and funding for drug labelling, packa- 11. Weiss R, Vogelzang N, Peterson B, et al. A successful sys-
ging, and shipping and helped support the cost of tem of scientific data audits for clinical trials: A report
from the Cancer and Leukemia Group B. JAMA 1993;
some national POISE investigator meetings. The
270: 45964.
HOPE trial was funded by the Medical Research 12. Friedman LM, Furberg CD, DeMets DL. Fundamentals of
Council of Canada, HoechstMarion Roussel, Astra- Clinical Trials. Springer, New York, 2010.
Zeneca, King Pharmaceuticals, Natural Source Vita- 13. Eisenstein E, Lemons P II, Tardiff B, et al. Reducing the
min E Association and Negma, and the Heart and costs of phase III cardiovascular clinical trials. Am Heart J
Stroke Foundation of Ontario. 2005; 149: 48288.
14. Bailey K. Detecting fabrication of data in a multicenter
collaborative animal study. Control Clin Trials 1991; 12:
Acknowledgments 74152.
15. Christian M, McCabe M, Korn E, et al. The National
The authors would like to acknowledge the contri- Cancer Institute audit of the National Surgical Adjuvant
bution of Dr David DeMets for his thoughtful review Breast and Bowel Project B-06. N Engl J Med 1995; 333:
and advice on an earlier version of this article. 146974.
16. Ranstam J, Buyse M, George S, et al. Fraud in medical
research: An international survey of biostatisticians. Con-
Conflict of interest trol Clin Trials 2000; 21: 41527.
POISE: Dr Yusuf has received consultancy fees, 17. Swazey J, Anderson M, Lewis K. Ethical problems in aca-
demic research. Am Sci 1993; 81: 54252.
research grants, and honoraria from AstraZeneca,
18. White C. Suspected research fraud: Difficulties of getting
which provided the study drug for the POISE trial. the truth. BMJ 2005; 331: 28188.
HOPE: Dr Yusuf was supported by a Senior Scientist 19. COMMIT (CLOpidogrel and Metroprolol in Myocar-
Award of the Medical Research Council of Canada dial Infarction Trial) Collaborative Group. Addition of
and a Heart and Stroke Foundation of Ontario clopidogrel to aspirin in 45852 patients with acute myo-
Research Chair. The other authors had no compet- cardial infarction: Randomized placebo-controlled trial.
ing interests. Lancet 2005; 366: 160721.

Clinical Trials 2013; 10: 225235 http://ctj.sagepub.com


Detecting fraud in trials 235

20. Al-Marzouki S, Evans S, Marshall T, Roberts I. Are these 27. Benford F. The law of anomalous numbers. Proc Am Phil
data real? Statistical methods for the detection of data Soc 1938; 78: 55172.
fabrication in clinical trials. BMJ 2005; 331: 26770. 28. Preece D. Distribution of final digits in data. Statistician
21. The EsPS2 Group. European Stroke Prevention Study 2. 1981; 30: 3160.
Efficacy and safety data. J Neurol Sci 1997; 151: S1S77. 29. SAS Institute. SAS, Version 9.1 (computer program),
22. Schraepler J, Wagner G. Identification, characteristics and Cary, NC, 2002.
impact of faked interviews in surveys. IZA DP No. 969, 2011. 30. Sullivan L, Massaro J, DAgostino R Sr. Presentation of
The Institute for the Study of Labor (IZA), www.iza.org. multivariate data for clinical use: The Framingham Study
23. Murphy J, Eyerman J, McCue C, Hottinger C, Kenner risk score function. Stat Med 2004; 23: 163160.
J. Interviewer falsification detection using data mining. 31. Evans S. Statistical aspects of the detection of fraud. In
In Proceedings of Statistics Canada Symposium, Ottawa, Lock S, Wells F (eds). Fraud and Misconduct in Medical
Ontario, Canada, 2005, 11-522-XIE. Research. London: BMJ Publishing Group, 1996, pp. 226
24. Svolba G, Bauer P. Statistical quality control in clinical 39.
trials. Control Clin Trials 1999; 20: 51930. 32. Furnival G, Wilson R. Regression by leaps and bounds.
25. Taylor R, McEntegart D, Stillman E. Statistical techni- Technometrics 1974; 16: 499511.
ques to detect fraud and other data irregularities in clini- 33. Hosmer DWJ, Lemeshow S. Applied Logistic Regression.
cal questionnaire data. Drug Inform J 2002; 36: 11525. John Wiley & Sons, New York, 1989.
26. Devereaux P, Yang H, Yusuf S, et al. Effects of extended- 34. The Heart Outcomes Prevention Evaluation (HOPE)
release metoprolol succinate in patients undergoing non- Study Investigators. Effect of an angiotensin-convert-
cardiac surgery (POISE trial): A randomised controlled ing-enzyme inhibitor, ramipril on cardiovascular events
trial. Lancet 2008; 371: 183947. in high-risk patients. N Engl J Med 2000; 324: 14553.

http://ctj.sagepub.com Clinical Trials 2013; 10: 225235


Reproduced with permission of the copyright owner. Further reproduction prohibited without
permission.