Professional Documents
Culture Documents
Commentary
Key words: epidemiology; oral health; outcomes assessment; quality of life; data interpretation Dr. Georgios Tsakos, Department of Epidemiology and Public Health, UCL, 1-19 Torrington Place, London WC1E 6BT, UK e-mail: g.tsakos@ucl.ac.uk Submitted 2 December 2010; accepted 1 October 2011
Assessing the subjective dimensions of oral health has become a major focus of enquiry in dentistry, and there is now a substantial body of research documenting the self-perceived oral health of patients and populations. Early contributions to this eld (16) were concerned with changing concepts of health and models of disease and its consequences. Together these provided a conceptual and theoretical rationale for the development of indices and scales to measure the constructs dened by those models. Indeed, a number of indices have been developed (7, 8) and continue to evolve. At the same time, there has been some debate about what these indices actually measure and what they should be called (8, 9). Although most have the same format, that is they assess the frequency and or severity of functional and psychosocial impacts associated with oral disorders, they have been variously labelled as sociodental indicators, subjective oral health status measures, oral health outcome measures, oral health-related quality of life measures or quality of life measures. A similar debate has taken place in medicine. As it does not appear to be easily resolved, Fitzpatrick et al. (10)
doi: 10.1111/j.1600-0528.2011.00651.x
have suggested the umbrella term patient-based outcome measures (PBOs) on the grounds that all are dependent upon what patients have to say about their health. For consistency, this is the term used in this paper, though we acknowledge its inaccuracy particularly for epidemiological studies where participants are not patients; participant-based outcomes is more appropriate.
193
Tsakos et al.
Studies exploring their potential use, in combination with clinical measures, in assessing needs for dental care. Clinical trials measuring the effectiveness of interventions, where PBO measures are used as either primary or secondary outcomes, in addition to clinical assessments. The evaluation of PBOs should be based on both theoretical and technical requirements. Theoretical requirements, discussed in a previous commentary (8), refer primarily to the theoretical models employed and precise denitions of the concepts measured and provide important background information that affects the meaning and interpretation of scores. This commentary focuses on the technical requirements of PBOs in oral health.
statistical signicance of differences are insufcient and suggest instead a more comprehensive and thoughtful approach to the reporting and interpretation of data.
Technical requirements
Depending on the context upon which they are used and the study design employed (cross-sectional or longitudinal), the main technical requirements of a PBO measure are reliability, validity and sensitivity to change (1113). Some are suitable for measuring between group differences in crosssectional population or clinic-based studies, whereas others can be more suited to measuring change in clinical trials and intervention studies. It must not be assumed that a PBO measure can perform all these tasks equally well. In essence, the measure chosen must be suited to the purpose for which it is being used and have measurement properties to match. However, in all studies using PBO measures, the fundamental aim is to detect differences between groups, either at one point in time (e.g. differences between socioeconomic groups) or over time (e.g. pre post-treatment differences). In the literature to date, the most common way of presenting data from these studies is in terms of aggregate scores along with an appropriate (or sometimes inappropriate) test of the statistical signicance of these differences. The use of single aggregate scores is not without limitations, as shown for generic PBOs (14, 15) and also in relation to clinical periodontal indicators (16), and they should be interpreted with caution. More importantly, there is limited guidance on what constitutes clinical relevance for PBO measurements and this has also practical implications, for example little help to inform power calculations for clinical trials. We argue that reporting only aggregate scores and assessing the
194
and severity) have been calculated for the OHIP-14 (18, 19) and the OIDP (19, 20). For the OHIP-14, prevalence refers to the proportion of subjects with one or more items experienced Fairly often or Very often; though, this cut-off is recognized as arbitrary. Extent is the number of items experienced Fairly often or Very often, while severity is a simple summation of the response codes to all 14 items. Prevalence, extent and intensity have also been suggested for the Child-OIDP score calculation (21). Prevalence refers to the proportion of subjects that reported one or more daily life performances affected by their oral conditions, extent indicates the number of performances affected and intensity is used to classify subjects into groups according to their highest score in any performance. Children with the same overall score may well vary in their extent and intensity scores; higher extent indicates more daily life performances affected, whereas higher intensity indicates more severe effect in at least one performance. Such different scoring formats of PBOs provide complementary information and a more sophisticated approach to scoring. Reporting ndings for different scoring formats for PBO measures should be encouraged as a rst step towards improving their interpretability. There may also be subjects inconsistently classied by the different scoring formats. Whether such inconsistencies can be resolved by altering case denitions is something that might be explored empirically. Conceptually, it may be argued that as the different scoring formats have different focus, the variable denition of cases may not necessarily be a limitation.
surement of change is complex and controversial and the use of change scores is not free from statistical limitations (22). Furthermore, change can occur in both directions, a pattern masked by aggregate change scores (23). In a trial, some individuals in the intervention or the control group may have positive change scores indicating improvement, while others may have negative change scores indicating deterioration and some may not change. Furthermore, the same mean change score can be due to relatively smaller changes in the same direction for the whole group or rather larger changes in one direction for some subjects, while others may change in the opposite direction. This applies to all outcome measures, not just PBOs. We suggest that these distinct patterns of change should be reported, rather than simply providing a mean change score that fails to recognize them. More importantly, signicant differences in scores do not provide information about the key research question, which is to understand whether the difference between groups is meaningful either from a clinical or from the patients perspective. In line with the previous critique, if the PBO scores are meaningless, differences or changes in scores are differences between meaningless estimates. They give the direction of the difference, but without any notion of scale or (more importantly) intrinsic meaning. Statistical signicance is only relevant in refuting the null hypothesis of no difference in means. A very large sample, whether in a trial or a population study, can reveal statistical signicance that has little relevance or intrinsic meaning. Turned around, when thinking about power calculations for trials or epidemiology, the critical point should be to determine what is clinically meaningful, with the sample calculated to t. In this context, interpretability is a key issue.
Interpretability
An initial distinction that needs to be made is the one between responsiveness and interpretability. Responsiveness represents a measures ability to detect change when change has or might reasonably be expected to have occurred. Responsiveness is important, but a technical issue can be dened relatively easily using statistics. On the other hand, interpretability refers to whether these changes are clinically signicant or meaningful to a person
195
Tsakos et al.
experiencing that change. It has been dened as the degree to which one can assign qualitative meaning that is, clinical or commonly understood connotations to quantitative scores (13). Assessment of responsiveness implies repeated measurements, such as in a clinical trial or clinical outcome study, while interpretability is a generic concept applying to both longitudinal and cross-sectional studies. Consequently, interpretability refers to both single scores as well as change scores of PBOs, while responsiveness is relevant only for the latter. For presentation purposes, we will initially focus on interpretability of change scores in longitudinal studies and then refer separately to interpretability in cross-sectional studies. A key concept in determining interpretability is the minimally important difference (MID).
conventional benchmarks (32), as small (0.2), moderate (0.30.7) or large (0.8) effect. These benchmarks are useful but do not provide an actual value for the MID. In addition, as both depend on the distribution and variability of PBO scores, these measures assume a normal distribution of change scores. This needs to be demonstrated rather than simply assumed. More importantly, they are sample dependent and can be considerably affected by the dispersion of observations. So, the ES (or SRM) can be large even if the mean difference in the change scores is modest, providing that there is not much dispersion in the baseline or change scores. The SEM is expressed in the same (original) units as the PBO measure and is more of a xed characteristic of a measure (31), hence not sample dependent. The value of SEM indicates what is likely to be measurement error. Therefore, any change smaller than the SEM cannot be disassociated from measurement error (26), while larger values indicate the existence of real changes. However, this does not provide concrete evidence for the MID, because differences larger than the measurement error should not de facto be considered as important or meaningful. Wyrwich et al. (30, 31) have provided empirical evidence that the SEM is almost equal to the MID in patients with cancer, and the same was the case in a study on periodontal patients (33); however, whether this is the case for other conditions or groups of patients remains to be proven. Norman et al. (34) suggested another approach based on the PBO scores distribution. By reviewing a large number of studies, they concluded that most MIDs were approximately half the standard deviation of baseline PBO scores. Obviously, this is totally empirical without any conceptual justication, but it is worth checking whether it continues to provide reasonable estimates as evidence on the MID accumulates from different studies. While distribution-based methods are internally referenced and derived solely from PBO scores, anchor-based (or externally referenced) methods use additional information to determine an external criterion of change to compare PBO scores against. In this respect, known clinical groups, population norms or subjective global transition scales can act as the reference (anchor) point. In the latter case, the MID reects the mean PBO change score for subjects reporting transition ratings indicative of minimal important change. Transition ratings are easy to use, mirror the kinds of
196
questions clinicians ask patients and offer a patient-based approach to calculating a MID. However, their use is somewhat controversial largely because their psychometric properties have been questioned (26, 35, 36), although the progressively larger PBO change scores among groups with better global transition ratings (37) provide some evidence for their construct validity. Examples of this approach to calculating MID include the work of Juniper et al. (37) using the Asthma Quality of Life measure, and Allen et al. (38) on the OHIP-20. Using an anchor-based method is helpful for longitudinal studies as it facilitates clinical interpretability while still retaining the richness of the PBO measure as an outcome.
MID
in
cross-
The concept of interpretability also applies to crosssectional studies where scores of two or more groups are compared. While the differences between groups may be statistically signicant, the problem remains as to whether or not they are of sufcient magnitude to be regarded as meaningful either clinically or to the individual. One might suggest that establishing the MID is less critical in cross-sectional studies as the PBO would rarely be set up as a single primary outcome measure. Nevertheless, we would argue that a marker of interpretability in the form of a MID can still give important clinical context, provided that it is, in turn, interpreted appropriately. The technical issues in cross-sectional studies are slightly more challenging. Two of the three distribution-based methods for the MID, the ES and the SEM, can be used with cross-sectional data. For example, in a national population survey of Canadian adults, the mean OHIP-14 severity scores were 20.1 for those with only secondary education and 18.3 for those with post-secondary education (P < 0.001). In terms of interpretability, the ES for this comparison was 0.24, i.e. small, and the SEM was 2.7. This difference does not exceed what is likely to be error, hence cannot be considered as clinically meaningful. In contrast, the respective difference in mean OHIP-14 severity score between the lowest and highest income groups was 5.8 (P < 0.001), and the related ES was 0.78; this difference exceeded the SEM and should be considered meaningful. Obviously, the use of ES and SEM in cross-sectional
studies is subject to the same limitations as for when assessing changes over time. In contrast to the distribution-based methods, no guidelines have been published for using externally referenced criteria (anchor-based methods) for calculating the MID in cross-sectional studies. Applying the same principle as for longitudinal studies, it is possible to determine anchors for the MID based on differences in mean scores between known clinical groups or oral health ratings. For example, in the same Canadian data, the mean OHIP-14 severity scores of dentate and edentate, two clinical groups whose PBO assessments could be assumed to differ to a meaningful extent, were 18.6 and 21.8, respectively (P < 0.001). This difference exceeds the SEM and gives an ES of 0.42. While the distinction between dentate and edentate is important, it may be too broad. The dentate is a diverse group, in terms of number of teeth and levels of oral diseases. Differences between more rened clinical groups would be preferable, but there is no real consensus as to which groups should be used for that purpose. Considering also the indirect theoretical relationship and relatively weak associations between clinical and PBO measures, there is currently no concrete evidence that PBOs in oral health can be linked to anything other than large clinical differences. Global ratings of either oral health or its impact on quality of life can also be employed as anchors. These are conventionally scored on ordinal scales and differences in mean PBO scores between adjacent categories can be used to estimate the MID. However, the choice of categories to use as anchors can have a marked effect on the estimated MID as differences will probably vary accordingly (in the previous study, they ranged from 5.74 at the bottom of the scale to 1.18 at the top). A potential solution is to use the mean of the mean differences. Again, much more empirical evidence is needed to indicate the potential usefulness of this approach. While the challenges for measuring the MID and interpreting PBO scores in cross-sectional studies are acknowledged, there is no reason why conclusions from such studies should not be subject to the same scrutiny as clinical trials. In the case of crosssectional studies, this paper offers a diagnosis of the problem (of PBO scores interpretability) and also some suggestions. We hope that further debate will lead to more robust solutions. And we acknowledge that further progress in terms of interpretability, for both longitudinal and cross-sectional studies, will
197
Tsakos et al.
described also vary. It seems logical that interpretability of PBO scores should be context and condition (disease) specic. For instance, it cannot be assumed that the MID for periodontal patients completing a specic PBO measure will also be relevant to patients with TMJ for the same measure. Such information should be compiled from different cross-sectional and longitudinal studies using PBOs, in the same fashion as for the re-establishment of the psychometric properties of a measure. As a result, a useful volume of knowledge on the MID for different PBO measures and populations will be gradually built up and could be stored in a database to facilitate planning future studies.
Table 1. Minimum reporting standards for studies using patient-based outcome measures Cross-sectional Description Mean Median Alternative scoring formats Change scores distribution (improvement; no change; deterioration) Interpretation Statistical signicance Effect size Standardized response mean Standard error of measurement Global ratings (oral health quality of life) Well-established clinical groups benchmarks X X (X) X X X X Longitudinal X (X) X (X) X X X X X
198
used. On the contrary, means are disproportionately affected by outliers and their reporting may mask the real situation and mislead the discussion. Once the MID is established, it is also important to know what proportion of the sample reported improvement equal or higher than the MID. Differences between groups are conventionally assessed through hypothesis tests of statistical signicance. In trials, the standard practice is to compare post-treatment scores between groups using ancova to account for the effect of baseline scores. Tests of statistical signicance are widely used but are insufcient for decision making, as they provide no information about the magnitude of the difference and whether it has any clinical or public health importance. By addressing this issue, the MID can give meaning to otherwise meaningless PBO scores and guide interpretation of these differences. The MID should be preferably calculated through different methods. For distribution-based methods, this implies calculating ES, SRM and SEM for longitudinal and ES and SEM for cross-sectional studies. For anchor-based methods, global ratings of oral health (and or quality of life) should be included in studies using PBOs; current ratings in crosssectional studies and ratings of change in longitudinal studies. In addition, well-established clinical groups or clinical benchmarks can also be used as anchors for the MID. However, clinical benchmarks require consensus about what constitutes minimum meaningful change in clinical status. And this is still an unresolved issue in most elds, including oral health. Furthermore, clinical benchmarks for calculating the MID in PBOs need to be relevant to and correspond with oral health perceptions, for example through clinical manifestations that are recognized by the person. This is not straightforward, particularly in conditions that are silent for part of their progress. After using different methods, the value of MID should be determined by triangulating on a single value or small range of values (28, 29). This is easier when the different MID estimates are close to each other but becomes more difcult when there is larger variation between them. While not all these recommendations are without limitations and applicable to all cases, it is worth expanding on the current very narrowly focussed practice and applying them for the reporting of PBOs. After all, this is a way into interpreting what may otherwise be meaningless PBO scores. Future research should also focus on psychometric properties of global transition ratings and the establishment of consensus clinical benchmarks.
Acknowledgements
David Locker initiated this commentary, had primary responsibility in writing and revised different versions of the text. The paper was nalized after his sudden death and therefore he did not see the nal version. His fellow authors wish to acknowledge the substantial contribution he made to this paper.
References
1. Cohen LK, Jago JD. Toward the formulation of sociodental indicators. Int J Health Serv 1976;6:681 98. 2. Gift HC, Atchison KA. Oral health, health, and health-related quality of life. Med Care 1995;33(11 Suppl):NS5777. 3. Locker D. Measuring oral health: a conceptual framework. Community Dent Health 1988;5:318. 4. Reisine ST. The impact of dental conditions on social functioning and the quality of life. Annu Rev Public Health 1988;9:119. 5. Reisine ST, Locker D. Social, psychological and economic impacts of oral conditions and treatments. In: Cohen LK, Gift HC editors. Disease prevention and oral health promotion. Socio-dental sciences in action. Copenhagen: Munksgaard; 1995; 3371. 6. Sheiham A, Croog SH. The psychosocial impact of dental diseases on individuals and communities. J Behav Med 1981;4:25772. 7. Slade GD (editor). Measuring oral health and quality of life. Chapel-Hill: Department of Dental Ecology, School of Dentistry, University of North Carolina; 1997. 8. Locker D, Allen F. What do measures of oral healthrelated quality of life measure? Community Dent Oral Epidemiol 2007;35:40111. 9. McGrath C, Bedi R. A national study of the importance of oral health to life quality to inform scales of oral health related quality of life. Qual Life Res 2004;13:8138. 10. Fitzpatrick R, Davey C, Buxton MJ, Jones DR. Evaluating patient-based outcome measures for use in clinical trials. Health Technol Assess 1998;2:iiv, 174. 11. Scientic Advisory Committee of the Medical Outcomes Trust. Assessing health status and quality-oflife instruments: attributes and review criteria. Qual Life Res 2002;11:193205. 12. Guyatt GH, Kirshner B, Jaeschke R. Measuring health status: what are the necessary measurement properties? J Clin Epidemiol 1992;45:13415. 13. Lohr KN, Aaronson NK, Alonso J, Burnam MA, Patrick DL, Perrin EB et al. Evaluating quality-of-life and health status instruments: development of scientic review criteria. Clin Ther 1996;18:97992. 14. Osoba D. Measuring the effect of cancer on healthrelated quality of life. Pharmacoeconomics 1995;7: 30819. 15. Simon GE, Revicki DA, Grothaus L, Vonkorff M. SF36 summary scores: are physical and mental health truly distinct? Med Care 1998;36:56772. 16. Imrey PB. Considerations in the statistical analysis of clinical trials in periodontitis. J Clin Periodontol 1986;13:51732.
199
Tsakos et al. 17. Chobanian AV, Bakris GL, Black HR, Cushman WC, Green LA, Izzo JL Jr et al. The Seventh Report of the Joint National Committee on Prevention, Detection, Evaluation, and Treatment of High Blood Pressure: the JNC 7 report. JAMA 2003;289:256072. 18. Slade GD, Nuttall N, Sanders AE, Steele JG, Allen PF, Lahti S. Impacts of oral disorders in the United Kingdom and Australia. Br Dent J 2005;8:48993. 19. Soe KK, Gelbier S, Robinson PG. Reliability and validity of two oral health related quality of life measures in Myanmar adolescents. Community Dent Health 2004;21:30611. 20. Kida IA, Astrom AN, Strand GV, Masalu JR, Tsakos G. Psychometric properties and the prevalence, intensity and causes of oral impacts on daily performance (OIDP) in a population of older Tanzanians. Health Qual Life Outcomes 2006;4:56. 21. Gherunpong S, Tsakos G, Sheiham A. The prevalence and severity of oral impacts on daily performances in Thai primary school children. Health Qual Life Outcomes 2004;2:57. 22. Locker D. Issues in measuring change in selfperceived oral health status. Community Dent Oral Epidemiol 1998;26:417. 23. Slade GD. Assessing change in quality of life using the Oral Health Impact Prole. Community Dent Oral Epidemiol 1998;26:5261. 24. Jaeschke R, Singer J, Guyatt GH. Measurement of health status. Ascertaining the minimal clinically important difference. Control Clin Trials 1989;10:40715. 25. Osoba D, King M. Meaningful differences. In: Fayers P, Hays RD, editors. Assessing quality of life in clinical trials, 2nd edn. Oxford: Oxford University Press; 2005; 24357. 26. Copay AG, Subach BR, Glassman SD, Polly DW Jr, Schuler TC. Understanding the minimum clinically important difference: a review of concepts and methods. Spine J 2007;7:5416. 27. Guyatt GH, Osoba D, Wu AW, Wyrwich KW, Norman GR. Methods to explain the clinical significance of health status measures. Mayo Clin Proc 2002;77:37183. 28. Revicki DA, Cella D, Hays RD, Sloan JA, Lenderking WR, Aaronson NK. Responsiveness and minimal important differences for patient reported outcomes. Health Qual Life Outcomes 2006;4:70. 29. Revicki DA, Hays RD, Cella D, Sloan J. Recommended methods for determining responsiveness and minimally important differences for patientreported outcomes. J Clin Epidemiol 2008;61:1029. Wyrwich KW, Tierney WM, Wolinsky FD. Further evidence supporting an SEM-based criterion for identifying meaningful intra-individual changes in health-related quality of life. J Clin Epidemiol 1999;52:86173. Wyrwich KW, Nienaber NA, Tierney WM, Wolinsky FD. Linking clinical relevance and statistical signicance in evaluating intra-individual changes in healthrelated quality of life. Med Care 1999;37:46978. Cohen J. Statistical power analysis for the behavioral sciences, 2nd edn. Hillsdale, NJ: Lawrence Erlbaum Associates; 1988. Tsakos G, Bernabe E, DAiuto F, Pikhart H, Tonetti M, Sheiham A et al. Assessing the minimally important difference in the Oral Impact on Daily Performances index in patients treated for periodontitis. J Clin Periodontol 2010;37:9039. Norman GR, Sloan JA, Wyrwich KW. Interpretation of changes in health-related quality of life: the remarkable universality of half a standard deviation. Med Care 2003;41:58292. Guyatt GH, Norman GR, Juniper EF, Grifth LE. A critical look at transition ratings. J Clin Epidemiol 2002;55:9008. Wyrwich KW, Bullinger M, Aaronson N, Hays RD, Patrick DL, Symonds T. Estimating clinically significant differences in quality of life outcomes. Qual Life Res 2005;14:28595. Juniper EF, Guyatt GH, Willan A, Grifth LE. Determining a minimal important change in a disease-specic Quality of Life Questionnaire. J Clin Epidemiol 1994;47:817. Allen PF, OSullivan M, Locker D. Determining the minimally important difference for the Oral Health Impact Prole-20. Eur J Oral Sci 2009;117:12934. Locker D, Jokovic A, Clarke M. Assessing the responsiveness of measures of oral health-related quality of life. Community Dent Oral Epidemiol 2004;32:108. Malden PE, Thomson WM, Jokovic A, Locker D. Changes in parent-assessed oral health-related quality of life among young children following dental treatment under general anaesthetic. Community Dent Oral Epidemiol 2008;36:10817. John MT, Reissmann DR, Szentpetery A, Steele J. An approach to dene clinical signicance in prosthodontics. J Prosthodont 2009;18:45560.
30.
31.
32. 33.
34.
35. 36.
37.
38. 39.
40.
41.
200
This document is a scanned copy of a printed document. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material.