You are on page 1of 9

JNNP Online First, published on April 14, 2015 as 10.

1136/jnnp-2014-310090
Neurodegeneration

REVIEW

Using visual rating to diagnose dementia:


a critical evaluation of MRI atrophy scales
Lorna Harper,1 Frederik Barkhof,2 Nick C Fox,1 Jonathan M Schott1

▸ Additional material is ABSTRACT specific dementias, most notably the MTLs, and
published online only. To view Visual rating scales, developed to assess atrophy in offer a means of quantifying (on an ordinal scale)
please visit the journal online
(http://dx.doi.org/10.1136/ patients with cognitive impairment, offer a cost-effective change at the individual patient level. Moreover,
jnnp-2014-310090). diagnostic tool that is ideally suited for implementation they can be used to enforce structured image
1 in clinical practice. By focusing attention on brain reporting and provide radiologists and non-
Dementia Research Centre,
University College London regions susceptible to change in dementia and enforcing radiology clinicians with a framework for inter-
Institute of Neurology, structured reporting of these findings, visual rating can preting imaging findings, making visual assess-
London, UK
2
improve the sensitivity, reliability and diagnostic value of ment more consistent and potentially more
Department of Radiology, VU radiological image interpretation. Brain imaging is sensitive. However, despite great diagnostic
University Medical Centre,
Amsterdam, The Netherlands recommended in all current diagnostic guidelines relating potential and widespread use in research studies,
to dementia, and recent guidelines have also visual rating scales have not been widely adopted
Correspondence to recommended the application of medial temporal lobe into routine clinical practice. In this review, we
Lorna Harper, Dementia atrophy rating. Despite these recommendations, and the examine cerebral atrophy rating scales that have
Research Centre, University
ease with which rating scales can be applied, there is been developed for use in dementia to highlight
College London Institute of
Neurology, 8-11 Queen still relatively low uptake in routine clinical assessments. the diagnostic potential of these clinically applic-
Square, London WC1N 3BG, Careful consideration of atrophy rating scales is needed able tools. Since MRI is the imaging modality of
UK; lorna.harper.11@ucl.ac.uk to verify their diagnostic potential and encourage uptake choice in dementia, we focus on the scales
among clinicians. Determining the added value of designed for this purpose and consider their diag-
Received 8 December 2014
Revised 4 February 2015 combining scores from visual rating in different brain nostic utility in terms of the sensitivity and speci-
Accepted 16 February 2015 regions may also increase the diagnostic value of ficity for disease state, and reproducibility of the
these tools. results. As an indication of the impact of each
scale, the number of published studies that have
subsequently applied each scale is provided, as
well an indication of their inclusion in multicentre
INTRODUCTION
studies or clinical trials (table 1). A full descrip-
The diagnostic value of structural neuroimaging is
tion of each scale is included in online supple-
reflected in its inclusion in the diagnostic guidelines
mentary appendix-1.
for a number of dementias, including Alzheimer’s
disease (AD),1 vascular dementia2 and frontotem-
poral dementia (FTD).3 4 Certain imaging features VISUAL RATING OF GLOBAL CORTICAL
are suggestive of underlying pathology, such as the ATROPHY
symmetrical and early medial temporal lobe (MTL)
atrophy frequently seen in typical AD, the promin-
Scale Pasquier et al12
ent parietal lobe atrophy associated with early
Increments 4
onset AD, or the asymmetrical atrophy that is often
Display T2-weighted axial
evident in patients with FTDs,5 and quantification
Reliability Inter >0.6, intra >0.7 (Cohen’s κ)
of these features can enhance their diagnostic
value. Research studies have used a variety of
imaging techniques to help distinguish patients The Pasquier scale, also known as the global cor-
with dementia from normal control participants, as tical atrophy (GCA) scale, was developed to evalu-
well as the more clinically relevant problem of dis- ate atrophy in 13 brain regions, including frontal,
tinguishing between causes of dementia. However, parieto-occipital and temporal sulcal dilation and
volumetric analysis tools and image classifier algo- dilation of the ventricles.11 Regions are assessed
rithms, which are commonly used for this purpose, separately in each hemisphere and the final score is
require specialist software and expertise, are often the sum of all scores in the 13 regions. The original
time consuming, cannot be applied to all image work was used as a tool to quantify atrophy in
types, and are, therefore, seldom used in clinical patients with stroke, based on the hypothesis that a
practice. Conversely, visual rating scales, principally greater degree of atrophy is present in patients with
developed for research purposes, can be applied stroke with dementia than in those without demen-
To cite: Harper L, directly to clinically acquired images without the tia. The hypothesis was not validated in the original
Barkhof F, Fox NC, et al.
J Neurol Neurosurg
use of additional software, and with suitable train- paper; however, subsequent studies have demon-
Psychiatry Published Online ing, can easily be used as an adjunct to standard strated that the scale may provide added value as a
First: [ please include Day clinical radiology reports. composite diagnostic marker for dementia, and in
Month Year] doi:10.1136/ Visual rating scales focus attention on brain particular, may be positively associated with vascu-
jnnp-2014-310090 regions particularly susceptible to change in lar burden.12
Harper L, et al. J Neurol Neurosurg Psychiatry 2015;0:1–9. doi:10.1136/jnnp-2014-310090 1
Copyright Article author (or their employer) 2015. Produced by BMJ Publishing Group Ltd under licence.
Neurodegeneration

Multicentre
Scale O’Donovan et al13
Increments 4

Y10
Display T1-weighted axial
Y7

Y7
N
N
N
N
N

N
N
N
Reliability Inter: 0.9, intra: 0.92 (interclass correlation coefficient)
Trials

As part of study to look at the discriminate power of estab-


Y6

Y8

Y9
N
N
N
N
N

N
N
N
lished visual rating scales for distinguishing between AD and
Applications

dementia with Lewy bodies (DLB), O’Donovan et al13 devel-


Research

oped a rating scale to assess ventricular enlargement (VEn) as a

100+
marker of GCA. Each hemisphere was rated separately for
35

13
12
0
9
3
0
0

0
1
5
enlargement of the lateral ventricles and the scores summed to
get an overall value. VEn scores were found to be significantly
Citations

higher in patients with AD and DLB than in control participants


90/60

21/59
213
350
100 (p<0.003) but similar between patients with AD and DLB.
44

31

19
1

0
0

0
1
Sensitivity and specificity of VEn for AD or DLB versus controls
was 94% and 40%, respectively, and 36% and 74% for AD
0.8/0.79–0.83 (Cκ)

0.83–0.94 (Cwκ)29

versus DLB.
0.93/0.95 (Cwκ)
0.84–0.93 (Uκ)

0.62/0.95 (Uκ)
>0.7 (Cwκ)12

Not reported
Not reported

Not reported
0.75 (Cwκ)
IntraRater

0.92 (ICC)

0.79 (Uκ)
0.8 (Cκ)

Overview of GCA scales


Cwκ, Cohen’s weighted κ; Cκ, Cohen’s κ; Fκ, Fleiss’ κ; ICC, interclass correlation coefficient; KW, Kendall’s W; STIR, short TI inversion time; Uκ, unspecified κ; N, no; Y, yes.
Axial slices provide the best general overview of brain atrophy;
however, specifying regions of interest (ROIs) to quantify such a
large, generalised area is challenging. Using the ventricles pro-
>0.7/0.62–0.71 (Cκ)

0.72–0.84 (Cwκ)29
LAC 0.06 and 0.07

duces excellent reliability; however, sensitivity and specificity


0.65–0.84 (Cwκ)
0.36–0.49 (Fκ)‡
0.75–0.94 (Uκ)

estimates indicate the scale is less useful in terms of differential


>0.6 (Cwκ)12

LAT 0.2 (kw)


Reliability*

0.71 (Cwκ)
InterRater

0.72 (Uκ)†

diagnosis, perhaps due to the considerable variation in ventricu-


0.91 (Uκ)

0.68 (Uκ)
0.64 (Uκ)
0.9 (ICC)

lar size within the healthy individuals. With 13 brain regions,


the Pasquier scale is more extensive in its coverage, although
this comes at the expense of scale reliability, which was further
confounded by the inclusion of regions susceptible to partial
T1-weighted FLAIR

volume effects. Simplification of the Pasquier scale to provide a


MR contrast

T2-weighted

T1-weighted
T1-weighted
T1-weighted
T1-weighted
T1-weighted

T1-weighted
T1-weighted
T1-weighted
T1-weighted

T1-weighted

more general impression of atrophy throughout the brain


(figure 1) resulted in increased uptake among the scientific com-
FLAIR

STIR

munity (table 1); however, the scale has been primarily used as
a component part of a larger diagnostic assessment.12 Owing to
the large brain area assessed by GCA scales, they are likely to be
Coronal Axial Sagittal

more severely confounded by age than other atrophy rating


scales, although their diagnostic value may be improved by
Imaging plane

using age-specific cut-offs.14


Coronal
Coronal
Coronal
Coronal

Coronal
Coronal
Coronal
Coronal
Sagittal

*Highest reported values—citation listed if the value is not taken from the original paper.
Axial

Axial

Axial

Axial

Axial

VISUAL RATING OF FRONTOTEMPORAL ATROPHY


increments

Scale Davies (2013)/Kipps et al17 (modified from Broe et al15)


Increments 5
Scale
Table 1 Reliability measures and imaging parameters

Display T1-weighted coronal


4‡
4

4
5
5
5
5

4
5

5
4
5
4

Reliability: Inter/intra: 0.62/0.82 (frontal), 0.71/0.83 (anterior temporal),


0.64/0.79 (posterior temporal) (Cohen’s κ)
Ventricular enlargement

Davies et al16 devised a scale based on a postmortem staging


Medial temporal
Medial temporal
Medial temporal
Medial temporal
Medial temporal
Medial temporal
Frontotemporal
Frontotemporal
Frontotemporal
Frontotemporal
Global cortical

scheme used to rate atrophy in FTD brains.15 Rating is per-


Brain region

formed at the level of the anterior temporal lobe and the lateral
Posterior

geniculate nucleus, and the highest recorded score is taken


overall (figure 2). The scale was applied to a study population
of patients clinically diagnosed with behavioural variant FTD
(bvFTD), based on the hypothesis that patients with lower
Davies et al16/Kipps et al17

Urs et al33/Duara et al34

atrophy scores have better prognosis and prolonged survival.16


†Based on CT images.
Ambikairajah et al21

‡Novel aspect only.

Favourable prognosis was defined as patients still living inde-


O’Donovan et al13

De Leon et al23 24
Scheltens et al29

pendently 3 years after diagnosis and unfavourable prognosis


Pasquier et al12

Koedam et al38
Kaneko et al35
Galton et al32
Davies et al18

Chow et al22

defined as those who had died or were in institutional care


Kim et al36

within the same time period. Sensitivity and specificity for


Scale

favourable prognosis was 80% and 75%, respectively.


Discriminant analysis also found atrophy to be the sole variable
2 Harper L, et al. J Neurol Neurosurg Psychiatry 2015;0:1–9. doi:10.1136/jnnp-2014-310090
Neurodegeneration

Figure 1 Example of the four-step (generalised) Pasquier scale for global cortical atrophy.

with significant power to predict prognosis (other variables were cortex (ACC), motor cortex and the anterior temporal pole.
sex, age, symptom duration and various clinical scores). Using three landmark identifiable slices, scoring was performed
Kipps et al17 extended this scale further to include rating of separately in each hemisphere and then averaged. The study
the posterior temporal lobe (figure 2) and described slice selec- hypothesis was that there would be a gradient of cortical
tion in greater detail. The extended scale was applied to a large atrophy increasing in severity from amyotrophic lateral sclerosis
group of patients with FTD plus some control participants to (ALS) to ALS-FTD to bvFTD, and that patients with ALS and
assess the relationship of focal brain atrophy to clinical data. All ALS-FTD would be best distinguished based on atrophy in the
control participants were rated <2; therefore, the scale was anterior temporal lobe and the ACC. This was partly validated
dichotomised with scores of 0–1 indicating normal scan appear- with bvFTD scoring significantly higher than patients with ALS
ance and 2–4 indicating a degree of cerebral atrophy. Sensitivity in all regions, and ALS-FTD scoring significantly higher than
was 100% for semantic dementia (SD), 73% for progressive patients with ALS in all regions except the OFC. Correct classifi-
non-fluent aphasia and 53% for bvFTD, based on clinical diag- cation calculated, using a logistic regression model with all
nosis of FTD syndrome. scored regions entered as independent variables, was estimated
to be 83.6% for bvFTD versus ALS and 75% for ALS-FTD
versus ALS. No significant differences in atrophy rating were
Scale Davies et al18
found between patients with ALS-FTD and bvFTD with correct
Increments 5
classification calculated as 78.8% between these two groups.
Display T1-weighted coronal
Reliability Inter: 0.71, intra: 0.76 (Cohen’s weighted κ)
Scale Chow et al22
18
Davies et al later developed a more extensive scale, which Increments 5
included 15 frontotemporal brain regions contained within four Display T1-weighted axial, sagittal and coronal
landmark identifiable slices. Specific scale criteria were adopted Reliability Inter: 0.06–0.07 (LAC), 0.2 (LAT) (Kendall’s W)
in the basal ganglia and hippocampal region (anterior, mid, pos-
terior), and the best slice was determined individually for each Based on previous findings from volumetric analysis, Chow
hemisphere to account for variation in brain orientation. The et al22 adapted the five-point scale of Davies et al18 to assess
scale is intended for use in diagnosis and localisation of function atrophy in the left anterior cingulate (LAC) and left anterior
in neurodegenerative diseases and other postoperative or post- temporal (LAT) regions. Rating was performed on five slices (2
encephalitic brain abnormalities. Discriminant analysis indicated axial slices, 2 sagittal slices, 1 coronal slice) by four raters. The
rating of the anterior fusiform distinguished SD from controls, scale was applied to a study population of normal controls, AD
while the insula was vital to distinguishing bvFTD. Multiple participants and participants with a clinical diagnosis of FTD
regions were reported to be relevant in discriminating AD from (FTD diagnosis was not further categorised). Raters were asked
controls (insula, anterior hippocampus, orbitofrontal gyri and to give a diagnosis immediately after rating. Based on the given
temporal pole), perhaps reflecting the more diffuse pattern of diagnosis, raters averaged 63% accuracy in correctly distinguish-
atrophy associated with AD. In a subsequent study, Hornberger ing AD from FTD and 59.5% accuracy in distinguishing FTD
et al19 reported rating of the orbitofrontal cortex (OFC) as a from controls.
good discriminator between AD and bvFTD, with logistic
regression analysis demonstrating correct classification in 71.3%
of patients. Devenney et al20 also used the scale to demonstrate Overview of frontotemporal atrophy scales
a lack of atrophy in C9ORF72 mutation carriers. Frontotemporal atrophy scales may be useful in the differential
diagnosis of FTD syndromes, and the scales developed around
these regions have been designed and validated specifically for
Scale Ambikairajah et al21 this purpose. In particular, the Davies, Kipps and Ambikairajah
Increments 5 scales all stem from the same postmortem staging scheme, pro-
Display T1-weighted coronal viding a reliable basis for region selection. Furthermore, slice
Reliability Inter: 0.91 (unknown κ) selection is described in detail and reference images provided,
which probably contributes to the consistently high reliability
Ambikairajah et al21 adapted the Davies/Kipps scales16–18 and among these scales (table 1). From a usability perspective, refer-
applied it to patients on an amyotrophic lateral-sclerosis-FTD ence images may be more useful when the ROI is demarcated
continuum.21 They scored four regions: OFC, anterior cingulate with a bounding box as in the Ambikairajah study. The style of
Harper L, et al. J Neurol Neurosurg Psychiatry 2015;0:1–9. doi:10.1136/jnnp-2014-310090 3
Neurodegeneration

Figure 2 Example of the five-step Kipps/Davies scale for frontal atrophy. The posterior temporal lobe reference images were included in the Kipps
study only.

reference image provided with the second Davies scale, while rating scale values ranged between ϕ-κ values of 0.87–0.89.
informative, is perhaps somewhat complicated for use in routine Using a study cohort of patients with AD, patients with mild
practice. From close examination of all reference images, the cognitive impairment (MCI) and healthy control participants,
spectrum of atrophy represented by each scale is not always uni- this study found sensitivities of 85% for mild AD, 96% for
formly distributed between scale increments. In some cases, moderate to severe AD and 78% for MCI, and a specificity of
such as the anterior temporal region,17 16 the scales may benefit 71% based on the presence of hippocampal atrophy (HA) in the
by condensing the scale to four points rather than five. control participants. They also reported that HA was associated
Asymmetry in either hemisphere is often associated with FTD5 with increasing ventricular size in all but the mild AD group,
and can be useful to help distinguish it from other causes of and that increasing HA due to age was confined to the control
cognitive impairment; therefore, the decision by Chow et al to group.
concentrate only on one hemisphere is a potential limitation. In
this study, raters were also encouraged to concentrate on LAC
Scale Scheltens et al29
and LAT regions and ignore hippocampal and parietal atrophy
Increments 5
as atrophy in the latter regions nudged the rater towards a diag-
Display T1-weighted coronal
nosis of AD. This suggests there may be reporting bias, which
Reliability Inter: 0.72–0.84, intra: 0.83–0.94 (Cohen’s weighted κ)
could be attributed to improper slice selection; moreover, this
advice may be less appropriate in a non-pathologically con-
firmed study population. The Scheltens scale focuses on three key features of MTL
atrophy, namely: the width of the choroid fissure, the width of
VISUAL RATING OF THE MTL the temporal horn and the height of the hippocampus25 (figure
3). The degree of atrophy in each of these regions is combined
to produce a score reflecting overall MTL atrophy (see online
Scale De Leon et al23 24
supplementary appendix-2). Both sides of the MTL are assessed
Increments 4
separately and in the case of asymmetry the highest score is
Display T1-weighted axial
reported. In order to assess sensitivity and specificity, the scale is
Reliability Inter: 0.72 (unknown κ)
dichotomised, with scores of 0–1 indicating the absence of AD,
and scores of 2–4 indicating the presence of AD. A sensitivity of
The De Leon scale, designed to rate hippocampal fissure dila- 81% and specificity of 67% for AD was reported in the original
tion, was one of the first described imaging markers of AD23 study, based on a clinical diagnosis of ‘probable’ AD according
and was subsequently used in a study based on both CT and to the 1984 NINCDS-ADRDA criteria26 versus age-matched
MRIs.24 Cross-modality agreement of individual hippocampal control participants. Since it was introduced, the Scheltens scale
4 Harper L, et al. J Neurol Neurosurg Psychiatry 2015;0:1–9. doi:10.1136/jnnp-2014-310090
Neurodegeneration

has been included in over 100 studies with several reporting consisted of clinically classified patients with AD, patients with
improved sensitivity, specificity and reliability over the original MCI and normal control participants. Based on mean VRS
study, even when used in a clinical setting.27 The reliability of score, AD participants had significantly ( p<0.05) higher
the scale has been reported to be robust to the clinical experi- atrophy rating in all regions compared with normal control par-
ence of the rater28 but increases as the rater gains more experi- ticipants. MCI participants had significantly (p<0.05) greater
ence with the scale itself.29 Improved performance of the scale atrophy scores in the right hippocampus and ERC bilaterally.
may be due to advances in image acquisition and display, such Patients with AD were not distinguishable from patients with
as improved scanner hardware, higher field strengths (the ori- MCI from visual rating scores in any region. Logistic regression
ginal study was performed at 0.5 T and 0.6 T) and reporting of analysis determined that the percentage of correct classification
the images from digital display over hard copy film images. was 70.2% for MCI versus controls and 72.9% for AD versus
Better understanding of the pathological phenomenon measured normal controls. Duara et al34 reported sensitivities and specifi-
by the scale has led to modification of the dichotomised scale to cities of 71% and 88% for normal controls versus patients with
account for atrophy due to ageing,25 30 which has also helped amnestic MCI, and 81% and 88% for normal controls versus
to improve performance. The Scheltens scale is included in the patients with probable AD.
research criteria for the diagnosis of AD.31
Scale Kaneko et al35
Scale Galton et al32 Increments 4
Increments 4 (novel aspect) Display short TI inversion time coronal
Display T1-weighted coronal Reliability Inter: 0.68, intra: 0.79 (unknown κ)
Reliability Inter: 0.36–0.49, intra: 0.52–0.69 (Cohen’s κ)

Kaneko et al35 developed a scale for the evaluation of MTL


Galton et al32 extended the Scheltens scale to incorporate
atrophy on a single coronal slice in which the cerebral peduncles
non-hippocampal structures. The complete scale is therefore
appear widest. The scale compares the shape and size of the
split into two parts, the first part using the Scheltens scale and
hippocampus with the surrounding cerebrospinal fluid (CSF)
the second part designed to rate the anterior, non-hippocampal
space. Perpendicular lines were drawn on both sides of the
medial ( parahippocampal gyrus) and lateral temporal structures
hippocampus to divide the CSF space into three parts: an outer
(see online supplementary appendix-2), each hemisphere was
part (temporal horn), an upper part (choroidal fissure) and an
assessed separately. To assess classification, the complete scale
inner part (ambient cistern). Raters were instructed “to put the
was dichotomised in to normal or minimal atrophy (0–1) and
hippocampus into each part of CSF space while keeping its ori-
moderate or severe atrophy (>1). Eleven per cent of controls
ginal shape and size, as with a jigsaw puzzle piece”. The refer-
demonstrated atrophy in the hippocampal region but no signifi-
ence images demonstrate a hippocampal ROI being manipulated
cant atrophy in any other regions. Only in the region of the
over the image, but it is not clear from the text if this is simply
hippocampus was atrophy significantly greater in the AD group
to illustrate the point or if this is representative of how the scale
than the control group, although only 50% of AD cases had
was applied in practice. It is also not clear how the ROI was
moderate or severe HA. The SD group showed significantly
generated. Sensitivity and specificity for patients with AD versus
greater atrophy in all regions bilaterally. The frontal variant
non-demented patients with psychiatric disorders was 88.2%
FTD (fvFTD) group demonstrated significantly greater atrophy
and 78.9%, respectively.
than controls in the temporal poles, hippocampi and right para-
hippocampal gyrus. Significantly greater atrophy in the temporal
pole region and the left parahippocampal gyrus of the SD group Scale Kim et al36
helped to distinguish them from the AD and fvFTD groups. Increments 5
The SD group demonstrated significantly more atrophy than the Display T1-weighted axial
AD group in all regions except the right hippocampus. There Reliability Inter: 0.64, intra: 0.62–0.95 (unknown κ)
were no significant differences between the AD and fvFTD
groups.
Kim et al36 adapted the Scheltens scale to rate MTL atrophy
in the axial plane, similar to the older CT-based scale of De
Scale Urs et al33/Duara et al34 Leon et al.23 The study was motivated by limited acquisition of
Increments 5 coronal images in some centres. The three main ROIs were
Display T1-weighted coronal transposed from the coronal scale into the axial plane resulting
Reliability Inter 0.75–0.94, intra: 0.84–0.93 (unknown κ) in rating of: the width of the MTL, the perimesencephalic
cistern gap (measured by the width between the brainstem and
Urs et al33 also developed a visual rating system (VRS) the MTL), and the width of the anterior temporal horn of the
intended to improve on the utility of the Scheltens scale lateral ventricle. By using a score of 2 or above to indicate HA,
through better standardisation of the technique and its applica- the sensitivity and specificity of the scale based on the area
tion. However, Duara et al34 published a study applying the under the curve (AUC) was calculated to be 76% and 80%,
system first and are often credited with its development. The respectively. However, it is not clear what the gold standard
VRS focuses on a single landmark identifiable slice at the level indicator of HA was in this case.
of the mamillary bodies (MB). This slice includes the head of
the hippocampus, the entorhinal cortex (ERC) and the peri-
Scale Lye et al37
rhinal cortex. Using the VRS software, the MB slice is displayed
Increments 4
along with reference images depicting the five levels of atrophy,
Display T1-weighted coronal
with each of the ROIs outlined to demonstrate the anatomical
Reliability Unknown
boundaries of each of the structures. The study population
Harper L, et al. J Neurol Neurosurg Psychiatry 2015;0:1–9. doi:10.1136/jnnp-2014-310090 5
Neurodegeneration

Figure 3 Example of the five-step


Scheltens scale for medial temporal
atrophy (images from The Radiology
Assistant website—http://www.
radiologyassistant.nl).

Lye et al37 rated hippocampal size on 12 slices through the lobes.38 The left and right hemispheres are assessed separately
hippocampus. Unlike other published scales, a higher score indi- and a separate score is given in each imaging plane. In the case of
cates a larger hippocampus. The scale is not described in detail, different scores in different planes, the highest score is taken. To
although reference images are provided. The scale was used to assess sensitivity and specificity, the scale was dichotomised with
investigate the relationship between hippocampal size and scores >1 considered an abnormal finding. Based on a study
memory performance in people over 80 years of age. It was not population of clinically diagnosed late and early onset AD, other
assessed for reliability or sensitivity/specificity for disease state. dementias and patients with subjective memory problems
(without cognitive impairment), the sensitivity and specificity of
Overview of MTL atrophy scales the scale for AD was 58% and 95%, respectively. In a follow-up
MTL rating scales were first developed for use with CT imaging study of postmortem-confirmed dementias, the diagnostic accur-
but are now predominantly designed for use with MRI, as the acy of the posterior atrophy (PA) scale for distinguishing between
current imaging modality of choice in the diagnosis of dementia. the study population groups (AD, frontotemporal lobar degener-
The Scheltens scale has had the biggest impact on the field ation (FTLD) controls) based on the average rating between two
(table 1), and formed the basis of the Galton and Urs/Duara raters was assessed by estimating the area under the receiver oper-
scales, which have also been used widely, as well as the recently ator curve.39 An AUC value of 0.74 was achieved between AD
developed scale by Kim et al. While the Scheltens scale focuses and control participants, 0.61 between FTLD and controls, and
on the hippocampus and the surrounding CSF space, the 0.66 between FTLD and AD. To our knowledge, this scale is the
Galton scale was designed to capture additional information only scale designed to quantify posterior atrophy.
from the surrounding sulci. However, the between-rater reliabil-
ity of the scale in these regions was poor (table 1), therefore, VISUAL RATING DESIGN, METHODOLOGY AND VALIDATION
limiting the differential diagnostic gain. The software package As table 1 illustrates, several visual rating scales have made little
used by Urs/Duara helps to operationalise the Scheltens scale by or no impact, while others have been replicated and cited exten-
limiting rating to a single consistent slice and providing detailed sively, with some of the most successful scales also included in
reference information. Adopting this approach provides excel- multicentre studies, clinical trials, and as previously mentioned,
lent reliability (table 1) and is well suited to use in research the Scheltens’ MTL atrophy scale is also recommended in recent
studies; however, the additional software overhead, and image diagnostic guidelines for AD.31 The methodology employed by
preprocessing that is likely to be involved, make it less suitable the more successful scales typically focuses not only on estab-
to use in clinical practice. Similarly, the Kaneko scale also lishing the diagnostic value of the scale in a clinically relevant
appears to use additional software making it impractical for use population, but also on the test-retest reliability of the scale and
clinically; this is further compounded by the decision to validate variability between raters. In general, the most successful scales
the scale on a non-standard MRI pulse sequence. Like the ori- provide a clear description of the rating procedure, allowing
ginal CT scales, the Kim scale is applied in the axial plane, pro- them to be easily replicated in other studies. Although not dis-
viding reasonably good reliability (table 1); however, further cussed in detail in this review, many studies that employ these
validation of its discriminatory power is required in a clinically scales demonstrate their correlation with clinical measures of
relevant study population. cognition, adding further validation to their clinical relevance,
and suggesting their potential as a marker of disease severity.
VISUAL RATING OF POSTERIOR ATROPHY Below we discuss in greater detail a number of factors implicit
in the design of visual rating scales which may determine their
Scale Koedam et al38 successful adoption in research, and ultimately their ability to
Increments 4 penetrate into clinical practice. We summarise these methodo-
Display T1-weighted sagittal, coronal; T2-FLAIR axial logical considerations and their clinical implications in table 2.
Reliability Inter: 0.65–0.84, intra: 0.93–0.95 (Cohen’s weighted κ)
Defining and displaying ROIs
The Koedam scale focuses on the posterior cingulate sulcus, pre- The brain regions selected for visual rating have the greatest
cuneus, parieto-occipital sulcus and the cortex of the parietal impact on the usefulness of the scale. Regions should be selected
6 Harper L, et al. J Neurol Neurosurg Psychiatry 2015;0:1–9. doi:10.1136/jnnp-2014-310090
Neurodegeneration

Table 2 Summary of key design decisions, associated methodological considerations and clinical implications
Design decisions Methodological considerations Clinical implications

Defining a region of Is there a good evidence base for atrophy in the region? Dictates utility and interpretation in certain clinical populations
interest Should the region be rated in >1 imaging plane? Requires three-dimensional or multiple image acquisitions and
increases time to perform rating
Is there an imaging landmark to allow consistent slice selection? Improves test-retest reliability
Displaying a region of Is the MR contrast appropriate and in common clinical use? Affects the appearance of atrophy and the sensitivity to artefacts
interest Is the appearance of the region badly affected by patient positioning? Difficult to rate certain regions or to reliably assess symmetry if
the head is tilted
Defining scale How much variation can reliably be captured? Truncated use of the scale may result in decreased diagnostic
increments value
Is there a reliable cut-off between normal and abnormal scan appearance? Affects clinical interpretation
Should the cut-off be adjusted for age?
Providing training How is each scale increment best described and are there reference images Provides a useful framework for scoring
material available to demonstrate these features?
Are there expert raters available to provide training sets? Provides confidence in ratings and a means of audit
Validating the scale What is the inter-rater/intra-rater reliability and how should it be measured? Determines suitability for use in clinical practice and comparison
with other scores
Do the scores correlate with clinical measures or other measures of atrophy? Validates clinical relevance
Is there a diagnostic gold standard available for comparison? Provides validation of diagnostic value

based on established findings from volumetric image analysis The effect of training
and/or macroscopic pathological assessment of the disease popu- Training can have a significant affect on the performance of the
lation of interest. The number of regions to be rated, the number scale. Reference images providing examples of each scale point
of imaging planes to assess (axial, sagittal or coronal) and the are particularly useful and are likely to impact positively on the
number of slices used is likely to impact on the reliability of the reliability of the scale. Reference images which include delinea-
scale, with reliability decreasing with increasing scale complexity. tion of ROIs, such as those provided by Urs et al could also
Specifying landmark identifiable slices for rating helps to ensure help to improve reliability, particularly among less experienced
consistency between raters. There is good rationale for including raters or raters without radiology expertise. Detailed descrip-
focal regions, such as the MTL, which are typically preferentially tions of the expected appearance for each point on the scale can
involved in certain conditions, for example, AD, and have been also be helpful to guide raters and improve consistency. Training
shown to correlate with clinical measures of disease severity such sets representative of the clinical or study population, prerated
as mini mental state examination (MMSE).25 Choice of MR by ‘expert’ raters, would help to ensure high observer agreement
pulse sequence affects both the appearance of atrophy and the before implementation into clinical practice or research proto-
visible extent of white matter changes and should also be speci- cols. Training sets can also be used to audit rater reliability at
fied. T1-weighted images offer good grey-white matter and CSF defined intervals or after a period of absence.2
contrast, with high resolution three-dimensional volume acquisi-
tions (that can be reconstructed in all three planes) offering the Validation
greatest utility for rating atrophy. T2-weighted images are less If rating scales are used as a method of measurement to make
reliable, since the amount of CSF can be overestimated if inferences about disease state, it is important that both the
T2-weighting is too strong. Image quality will also affect the reli- measurement technique and validation of the technique is
ability of the scale, with rating less reliable on scans that are rigorous. Test-retest studies are essential to determine the
subject to artefacts. Consistent image slice positioning will also (inter-rater/intra-rater) reliability of the scale. Appropriate stat-
help to improve the reliability of the scale. istical procedures should be applied and fully reported to
allow clear interpretation of the results and fair comparison
Scale increments with other studies. However, if used routinely, the affect of
The number of scale increments influences the level of detail cap- training and rater experience is likely to improve the reliability
tured by the scale. A balance must be struck between detailed quan- of the scale. Correlation with clinical measures of cogni-
tification and the degree of change that can be reliably differentiated tion17 25 38 and volumetric measurements18 21 27 41 are also
by visual inspection. In terms of structural neuroimaging, a four- useful to help validate the scale. Diagnostic tests should also be
point or five-point scale is most commonly used. The scale is typic- validated against an established ‘gold standard’ measurement
ally dichotomised to classify normal and abnormal scan appearance. technique. Currently, with the exception of individuals with
In both four-point and five-point scales, scale points 0 and 1 typic- genetic mutations, postmortem examination of brain tissue is
ally represent the degree of variation within the normal population, the only definitive means of establishing diagnosis in neurode-
with points 2 and above describing more obvious pathological generative dementia. In most scales described here, classifica-
change. Four-point scales force the rater to make a more definite tion of disease groups, and therefore measures of scale
choice of disease state (presence or absence), therefore, increasing sensitivity and specificity, are based on clinical diagnosis of the
specificity at the expense of sensitivity. Five-point scales on the study population.
other hand may be more sensitive to earlier stages of disease but
may also increase the number of false-positive results. In terms of CONCLUSION
the scales developed for use in the diagnosis of dementia, five-point Visual rating scales have been developed specifically to rate
scales may be particularly sensitive to the effects of ageing. Using several brain regions sensitive to atrophy in dementia. They can
age-specific cut-offs may help to improve scale accuracy.30 40 be used to provide semiquantitative measures of the degree of
Harper L, et al. J Neurol Neurosurg Psychiatry 2015;0:1–9. doi:10.1136/jnnp-2014-310090 7
Neurodegeneration

atrophy in these regions, while combining scores from several 9 Duara R, Loewenstein DA, Shen Q, et al. Amyloid positron emission tomography
rating scales can also improve classification accuracy.39 Unlike with (18)F-flutemetamol and structural magnetic resonance imaging in the
classification of mild cognitive impairment and Alzheimer’s disease. Alzheimers
quantitative volumetric measures (manual or automatic), visual Dement 2013;9:295–301.
rating scales do not require specialist software or expertise, are 10 Duara R, Loewenstein DA, Greig M, et al. Reliability and validity of an algorithm for
quick to apply and are designed specifically for use with routine the diagnosis of normal cognition, mild cognitive impairment, and dementia:
MRI. Moreover, unlike many diagnostic tests, they are not implications for multicenter research studies. Am J Geriatr Psychiatry
2010;18:363–70.
financially prohibitive, with brain imaging already recom-
11 Pasquier F, Leys D, Weerts JG, et al. Inter- and intraobserver reproducibility of
mended for all patients being investigated for dementia.1–4 cerebral atrophy assessment on MRI scans with hemispheric infarcts. Eur Neurol
Given the proven utility of some of these scales in clinical 1996;36:268–72.
trials,6–9 the relative ease with which they can be applied and, in 12 Henneman WJP, Sluimer JD, Cordonnier C, et al. MRI biomarkers of vascular
the case of the Scheltens MTL score, their incorporation into damage and atrophy predicting mortality in a memory clinic population. Stroke
2009;40:492–8.
new diagnostic criteria,31 it is perhaps surprising that visual 13 O’Donovan J, Watson R, Colloby SJ, et al. Does posterior cortical atrophy on MRI
rating scales have not had greater uptake in routine radiological discriminate between Alzheimer’s disease, dementia with Lewy bodies, and normal
assessments. Further validation of the real life utility of these aging? Int Psychogeriatr 2013;25:111–19.
rating scales to improve diagnosis in a multicentre study popula- 14 Scheltens P, Pasquier F, Weerts JG, et al. Qualitative assessment of cerebral atrophy
on MRI: inter- and intra-observer reproducibility in dementia and normal aging.
tion with postmortem-confirmed diagnosis, combined with the
Eur Neurol 1997;37:95–9.
provision of a widely available training scan set, might help to 15 Broe M, Hodges JR, Schofield E, et al. Staging disease severity in pathologically
transition these potentially useful diagnostic tools into wider use confirmed cases of frontotemporal dementia. Neurology 2003;60:1005–11.
in routine clinical practice. In addition, clinical studies looking 16 Davies RR, Kipps CM, Mitchell J, et al. Progression in frontotemporal dementia:
at the impact of visual rating on diagnosis and patient manage- identifying a benign behavioral variant by magnetic resonance imaging. Arch Neurol
2006;63:1627–31.
ment are also required to determine their true potential as a 17 Kipps CM, Davies RR, Mitchell J, et al. Clinical significance of lobar atrophy in
diagnostic tool. frontotemporal dementia: application of an MRI visual rating scale. Dement Geriatr
Cogn Disord 2007;23:334–42.
18 Davies RR, Scahill VL, Graham A, et al. Development of an MRI rating scale for
SEARCH STRATEGY multiple brain regions: comparison with volumetrics and with voxel-based
References for this review were identified by searches of morphometry. Neuroradiology 2009;51:491–503.
19 Hornberger M, Savage S, Hsieh S, et al. Orbitofrontal dysfunction discriminates
PubMed and references from relevant articles. The search terms behavioral variant frontotemporal dementia from Alzheimer’s disease. Dement
‘Dementia’ or ‘Alzheimer’s’, ‘visual rating’, ‘visual assessment’, Geriatr Cogn Disord 2010;30:547–52.
‘atrophy rating’, ‘reproducibility’ or ‘rater’, ‘MRI’, ‘magnetic 20 Devenney E, Hornberger M, Irish M, et al. Frontotemporal dementia associated with
resonance imaging’, or ‘T1-weighted’ were used. There were no the C9ORF72 mutation: a unique clinical profile. JAMA Neurol 2014;71:331–9.
21 Ambikairajah A, Devenney E, Flanagan E, et al. A visual MRI atrophy rating scale
date or language restrictions. The final reference list was gener- for the amyotrophic lateral sclerosis-frontotemporal dementia continuum. Amyotroph
ated on the basis of relevance to the topics covered in this Lateral Scler Frontotemporal Degener 2014;15:226–34.
review. 22 Chow TW, Gao F, Links KA, et al. Visual rating versus volumetry to detect
frontotemporal dementia. Dement Geriatr Cogn Disord 2011;31:371–8.
Acknowledgements The Dementia Research Centre is an Alzheimer’s Research 23 De Leon MJ, George AE, Stylopoulos LA, et al. Early marker for Alzheimer’s disease:
UK coordinating centre. The authors acknowledge the support of Alzheimer’s the atrophic hippocampus. Lancet 1989;2:672–3.
Research UK, the NIHR Queen Square Dementia Biomedical Research Unit, UCL/H 24 De Leon MJ, George AE, Golomb J, et al. Frequency of hippocampal
Biomedical Research Centre, and Leonard Wolfson Experimental Neurology Centre. formation atrophy in normal aging and Alzheimer’s disease. Neurobiol Aging
LH is supported by funding from Alzheimer’s Research UK and a UCL Impact 1997;18:1–11.
Studentship. 25 Scheltens P, Leys D, Barkhof F, et al. Atrophy of medial temporal lobes on MRI in
Contributors JMS and LH devised the original concept of the article. LH drafted “probable” Alzheimer’s disease and normal ageing: diagnostic value and
the manuscript. All authors revised and approved the final version to be published. neuropsychological correlates. J Neurol Neurosurg Psychiatry 1992;55:967–72.
26 McKhann G, Drachman D, Folstein M, et al. Clinical diagnosis of Alzheimer’s
Funding University College London. disease: report of the NINCDS-ADRDA Work Group under the auspices of
Competing interests None. Department of Health and Human Services Task Force on Alzheimer’s Disease.
Neurology 1984;34:939–44.
Provenance and peer review Not commissioned; externally peer reviewed. 27 Westman E, Cavallin L, Muehlboeck JS, et al. Sensitivity and specificity of medial
temporal lobe visual ratings and multivariate regional MRI classification in
Alzheimer’s disease. PLoS ONE 2011;6:e22506.
REFERENCES 28 Scheltens P, Launer LJ, Barkhof F, et al. Visual assessment of medial temporal lobe
1 McKhann GM, Knopman DS, Chertkow H, et al. The diagnosis of dementia due to atrophy on magnetic resonance imaging: interobserver reliability. J Neurol
Alzheimer’s disease: recommendations from the National Institute on 1995;242:557–60.
Aging-Alzheimer’s Association workgroups on diagnostic guidelines for Alzheimer’s 29 Cavallin L, Løken K, Engedal K, et al. Overtime reliability of medial temporal lobe
disease. Alzheimers Dement 2011;7:263–9. atrophy rating in a clinical setting. Acta Radiol 2012;53:318–23.
2 Wardlaw JM, Smith EE, Biessels GJ, et al. Neuroimaging standards for research into 30 Barkhof F, Polvikoski TM, van Straaten ECW, et al. The significance of medial
small vessel disease and its contribution to ageing and neurodegeneration. Lancet temporal lobe atrophy: a postmortem MRI study in the very old. Neurology
Neurol 2013;12:822–38. 2007;69:1521–7.
3 Rascovsky K, Hodges JR, Knopman D, et al. Sensitivity of revised diagnostic criteria 31 Dubois B, Feldman HH, Jacova C, et al. Research criteria for the diagnosis of
for the behavioural variant of frontotemporal dementia. Brain 2011;134(Pt Alzheimer’s disease: revising the NINCDS-ADRDA criteria. Lancet Neurol
9):2456–77. 2007;6:734–46.
4 Gorno-Tempini ML, Hillis AE, Weintraub S, et al. Classification of primary 32 Galton CJ, Gomez-Anson B, Antoun N, et al. Temporal lobe rating scale: application
progressive aphasia and its variants. Neurology 2011;76:1006–14. to Alzheimer’s disease and frontotemporal dementia. J Neurol Neurosurg Psychiatry
5 Harper L, Barkhof F, Scheltens P, et al. An algorithmic approach to structural 2001;70:165–73.
imaging in dementia. J Neurol Neurosurg Psychiatry 2014;85:692–8. 33 Urs R, Potter E, Barker W, et al. Visual rating system for assessing magnetic
6 Wilkinson D, Fox NC, Barkhof F, et al. Memantine and brain atrophy in Alzheimer’s resonance images: a tool in the diagnosis of mild cognitive impairment and
disease: a 1-year randomized controlled trial. J Alzheimers Dis 2012;29:459–69. Alzheimer disease. J Comput Assist Tomogr 2009;33:73–8.
7 Bastos Leite AJ, van der Flier WM, van Straaten ECW, et al. Infratentorial 34 Duara R, Loewenstein DA, Potter E, et al. Medial temporal lobe atrophy on MRI
abnormalities in vascular dementia. Stroke 2006;37:105–10. scans and the diagnosis of Alzheimer disease. Neurology 2008;71:1986–92.
8 DeCarli C, Frisoni GB, Clark CM, et al. Qualitative estimates of medial temporal 35 Kaneko T, Kaneko K, Matsushita M, et al. New visual rating system for medial
atrophy as a predictor of progression from mild cognitive impairment to dementia. temporal lobe atrophy: a simple diagnostic tool for routine examinations.
Arch Neurol 2007;64:108–15. Psychogeriatrics 2012;12:88–92.

8 Harper L, et al. J Neurol Neurosurg Psychiatry 2015;0:1–9. doi:10.1136/jnnp-2014-310090


Neurodegeneration
36 Kim GH, Kim JE, Choi KG, et al. T1-weighted Axial Visual Rating Scale for an 39 Lehmann M, Koedam ELGE, Barnes J, et al. Posterior cerebral atrophy in the
assessment of medial temporal atrophy in Alzheimer’s disease. J Alzheimers Dis absence of medial temporal lobe atrophy in pathologically-confirmed Alzheimer’s
2014;41:169–78. disease. Neurobiol Aging 2012;33:627.e1–627.e12.
37 Lye TC, Piguet O, Grayson DA, et al. Hippocampal size and memory function in the 40 Duara R, Loewenstein DA, Shen Q, et al. The utility of age-specific cut-offs for
ninth and tenth decades of life: the Sydney Older Persons Study. J Neurol Neurosurg visual rating of medial temporal atrophy in classifying Alzheimer’s disease, MCI and
Psychiatry 2004;75:548–54. cognitively normal elderly subjects. Front Aging Neurosci 2013;5:47.
38 Koedam ELGE, Lehmann M, van der Flier WM, et al. Visual assessment of posterior 41 Möller C, van der Flier WM, Versteeg A, et al. Quantitative regional validation of
atrophy development of a MRI rating scale. Eur Radiol 2011;21:2618–25. the visual rating scale for posterior cortical atrophy. Eur Radiol 2014;24:397–404.

Harper L, et al. J Neurol Neurosurg Psychiatry 2015;0:1–9. doi:10.1136/jnnp-2014-310090 9

You might also like