You are on page 1of 23

Biomedical Informatics

Coreference analysis in clinical notes: A multi-pass sieve with alternate anaphora resolution modules
Siddhartha Jonnalagadda, PhD NLP program, Mayo Clinic, Rochester i2b2/VA/Cincinnati NLP shared task October 21st 2011

Biomedical Informatics

Participants from Mayo Clinic

Siddhartha Jonnalagadda, PhD (Track lead) Dingcheng Li, MS Sunghwan Sohn, PhD Stephen Wu, PhD Kavishwar Wagholikar, MBBS, PhD Manabu Torii, PhD (Georgetown) Hongfang Liu, PhD (Principal Investigator)
2011 Mayo Clinic 2

Biomedical Informatics

Introduction

Track 1: Coreference resolution More work in general English

group mentions to form an entity understanding the links of related concepts critical to

analyze clinical documents and compile patient profile

Our approach: port the best methods


2011 Mayo Clinic

heuristics-based approach based on linguistic theories supervised machine learning approach classification based on ranking of all markables unsupervised machine learning approach

Biomedical Informatics

Stanford coreference system

Raghunathan K, Lee H, Rangarajan S, Chambers N, Surdeanu M, Jurafsky D, Manning C. A multi-pass

sieve for coreference resolution. In: EMNLP-2010. Boston, USA: Association for Computational Linguistics; 2010. p. 492-501. Lee H, Peirsman Y, Chang A, Chambers N, Surdeanu M, Jurafsky D. Stanford's Multi-Pass Sieve Coreference Resolution System at the CoNLL-2011 Shared Task. In: CoNLL-2011 Shared Task, 2011. Portland, Oregon: Association for Computational Linguistics; 2011. p. 7379.

2011 Mayo Clinic

Biomedical Informatics

Pronoun anaphora resolution system

Li D, Miller T, Schuler W. A pronoun anaphora resolution system based


on factorial hidden markov models. In: Proceedings of the Association for Computational Linguistics; 2011.
2011 Mayo Clinic

Biomedical Informatics

Our system architecture


Clinical narrative with markables 1. Right exact match 2. Relative pronoun or Abbreviation or Synonym
3. Head match and Word inclusion and Compatible modifiers

Vicinity Filter

Section Filter

4. Head match and Word inclusion 5. Head match and Compatible modifiers 6. Relaxed head match and Word Inclusion 7. (Stemmed head match and Stemmed bag of words match) OR Related words match 8a. FHMM-based Pronoun Sieve

Information about entities

8b. Rule-based Pronoun Sieve

Set-up A

Set-up B
2011 Mayo Clinic

Set-up C
6

Biomedical Informatics

Relationship detection order

Mentions in each document are ordered by

appearance For each sieve, coreferential relationship tested for each pair of mentions starting from the last appearing (probable) mention For each mention, a probable antecedent searched starting from the closest mention Assumption: given two exactly similar antecedents, the closer antecedent the coreferential relationship
2011 Mayo Clinic 7

Biomedical Informatics

Our system architecture


Clinical narrative with markables

Section Filter

1. Right exact match

2. Relative pronoun or Abbreviation or Synonym

3. Head match and Word inclusion and Compatible modifiers

Vicinity Filter

4. Head match and Word inclusion

5. Head match and Compatible modifiers

Information about entities

6. Relaxed head match and Word Inclusion

7. (Stemmed head match and Stemmed bag of words match) OR Related words match 8a. FHMM-based Pronoun Sieve 8b. Rule-based Pronoun Sieve

Set-up A

Set-up B
2011 Mayo Clinic

Set-up C
8

Biomedical Informatics

Section Filter
In general English, if two mentions have the same name,
more than 95% of the times the mentions corefer However, in clinical narratives, this might not be the case A problem or treatment of different patients in family medical

Section names identified by SecTag (Denny et al.) used in


two sections history of present illness and diagnosis have higher probability to be co-referred pair than two mentions associated with family history and diagnosis
2011 Mayo Clinic

history section A non-chronic problem or a test in the history of present illness A treatment in current medications unrelated to another one in discharge medications section

identifying co-referred pairs Example: two mentions associated with the same term appearing in

Biomedical Informatics

Our system architecture


Clinical narrative with markables

Vicinity Filter
Section Filter

1. Right exact match

2. Relative pronoun or Abbreviation or Synonym

3. Head match and Word inclusion and Compatible modifiers

4. Head match and Word inclusion

5. Head match and Compatible modifiers

Information about entities

6. Relaxed head match and Word Inclusion

7. (Stemmed head match and Stemmed bag of words match) OR Related words match 8a. FHMM-based Pronoun Sieve 8b. Rule-based Pronoun Sieve

Set-up A

Set-up B
2011 Mayo Clinic

Set-up C
10

Biomedical Informatics

Vicinity Filter

Unlike proper mentions, nominal mentions in the

same document could refer to completely different entities


Patient underwent a total abdominal hysterectomy in 02/90 for a 4x3.6x2 cm cervical mass felt to be a fibroid at Vanor . Pathology revealed poorly differentiated squamous cell carcinoma of the cervix with spots of vaginal margins and metastatic squamous cell carcinoma in the cardinal ligaments with extensive lymphatic invasion . She underwent exploratory laparotomy and had a bilateral salpingo-oophorectomy and appendectomy .

We reject relationships if the mentions only contain a


list of common noun words and are far apart
2011 Mayo Clinic 11

Pathology was negative for tumor and showed peritubal and periovarian adhesions .

Biomedical Informatics

Our system architecture


Clinical narrative with markables 1. Right exact match

2. Relative pronoun or Abbreviation or Synonym


3. Head match and Word inclusion and Compatible modifiers

Vicinity Filter

Section Filter

4. Head match and Word inclusion 5. Head match and Compatible modifiers 6. Relaxed head match and Word Inclusion 7. (Stemmed head match and Stemmed bag of words match) OR Related words match 8a. FHMM-based Pronoun Sieve

Information about entities

8b. Rule-based Pronoun Sieve

Set-up A

Set-up B
2011 Mayo Clinic

Set-up C
12

Biomedical Informatics

Sieves

The first sieve accepts mentions that match exactly


when aligned to the right and the antecedent has more number of words
Echocardiogram showed moderate anterior pericardial effusion of approximately 600 cc with diastolic indications of the right ventricle and low velocity paradox . She had a follow-up echocardiogram . Echocardiogram showed left ventricle at the upper limits of normal for size , low normal function , moderate to mild effusion with pericardial pressures exceeding right atrial pressures , and right ventricular pressures at various points of patient 's cycle without any change in the effusion from 06/11 .

The second sieve accepts a pair when the mention is

a relative pronoun that is governed by the antecedent. as detected by rules based on part of speech tags abbreviation list using UMLS
2011 Mayo Clinic 13

Biomedical Informatics

Our system architecture


Clinical narrative with markables 1. Left exact match 2. Relative pronoun or Abbreviation or Synonym
3. Head match and Word inclusion and Compatible modifiers

Vicinity Filter

Section Filter

4. Head match and Word inclusion 5. Head match and Compatible modifiers 6. Relaxed head match and Word Inclusion 7. (Stemmed head match and Stemmed bag of words match) OR Related words match 8a. FHMM-based Pronoun Sieve

Information about entities

8b. Rule-based Pronoun Sieve

Set-up A

Set-up B
2011 Mayo Clinic

Set-up C
14

Biomedical Informatics

Our system architecture


Clinical narrative with markables 1. Left exact match 2. Relative pronoun or Abbreviation or Synonym
3. Head match and Word inclusion and Compatible modifiers

Vicinity Filter

Section Filter

4. Head match and Word inclusion 5. Head match and Compatible modifiers 6. Relaxed head match and Word Inclusion 7. (Stemmed head match and Stemmed bag of words match) OR Related words match 8a. FHMM-based Pronoun Sieve

Information about entities

8b. Rule-based Pronoun Sieve

Set-up A

Set-up B
2011 Mayo Clinic

Set-up C
15

Biomedical Informatics

For the second and seventh sieve, we used

Sieves (contd..)

synonyms and other relationships extracted from the UMLS UMLS MRREL table synonym (for sieve 2), parent-child and narrow-broad
(sieve 7)

The seventh sieve uses the Porter Stemmer algorithm


to stem the mentions The mention pair is accepted, if the stems of the heads
are same and the rest of the words (after stemming) in one of the mention is in the other mention

2011 Mayo Clinic

16

Biomedical Informatics

Our system architecture


Clinical narrative with markables 1. Left exact match 2. Relative pronoun or Abbreviation or Synonym
3. Head match and Word inclusion and Compatible modifiers

Vicinity Filter

Section Filter

4. Head match and Word inclusion 5. Head match and Compatible modifiers 6. Relaxed head match and Word Inclusion 7. (Stemmed head match and Stemmed bag of words match) OR Related words match 8a. FHMM-based Pronoun Sieve

Information about entities

8b. Rule-based Pronoun Sieve

Set-up A

Set-up B
2011 Mayo Clinic

Set-up C
17

Biomedical Informatics

The Eighth Sieve: Pronoun resolution

Set-up C a rule-based pronoun sieve

The first seven sieves collect information about the


entity. Based on the entitys grammatical number, gender, and animacy, each pronoun is assigned to an antecedent entity.

Set-up A Li et al.s model Set-up B UNION of A and C


2011 Mayo Clinic

18

Biomedical Informatics

Training Results
Accuracy of the Machine learning based pronoun sieve
System FHMM Beth 66% Partners 63.5% Discharge 60% Progress 61.5%

Cumulative performance on development set as sieves are added


Sieves {1} {1, 2} {1, 2, 3} {1, 2, 3, 4} {1, 2, 3, 4, 5} {1, 2, 3, 4, 5 ,6} {1, 2, 3, 4, 5 ,6, 7} Set-up A = {1, 2, 3, 4, 5 ,6, 7, 8a} Set-up B = {1, 2, 3, 4, 5 ,6, 7, 8a+8b} Set-up C = {1, 2, 3, 4, 5 ,6, 7, 8b} Average .698 .710 .726 .726 .728 .729 .729 .806 .815 B3 P|R|F .869|.909|.889 .865|.916|.89 .863|.923|.892 .863|.923|.892 .856|.927|.890 .853|.929|.889 .852|.930|.889 .883|.930|.906 .874|.933|.903 MUC P|R|F .726|.335|.458 .726|.366|.487 .730|.412|.527 .730|.412|.527 .705|.436|.539 .701|.443|.543 .696|.447|.545 .691|.716|.703 .693|.789|.738 BLANC P|R|F .931|.559|.605 .927|.561|.607 .926|.567|.617 .926|.567|.617 .908|.570|.619 .906|.570|.620 .903|.570|.620 .884|.690|.754 .856|.889|.872 CEAF P|R|F .868|.654|.746 .865|.665|.752 .861|.681|.760 .861|.681|.760 .843|.684|.755 .839|.686|.755 .836|.686|.754 .801|.814|.808 .770|.843|.805

.836

0.90|.936|.918

.739|.798|.767

.937|.808|.862

.802|.843|.822
19

2011 Mayo Clinic

Biomedical Informatics

Test Results

Biomedical Informatics

Discussion

Portable, PHI-free, expert-based system The performance of the machine-learning based


pronoun sieve is 10% less than the corresponding performance for general English

Contribution of sieves similar to general English Future research work


adding features obtained from previous sieves to
FHMM
2011 Mayo Clinic 21

difference in the domains agnostic to the global properties of the entity

Biomedical Informatics

Take home

multi-pass sieve coreference resolution system


produces satisfactory performance

entity-centered approach better than a mentioncentered approach Watch out for an expert-based open-source under Apache License Any questions/comments please email or see us at AMIA
2011 Mayo Clinic

using simple rules, POS tags and semantics

22

Biomedical Informatics

Thanks!

Siddhartha Jonnalagadda, PhD (Track lead) Dingcheng Li, MS Sunghwan Sohn, PhD Stephen Wu, PhD Kavishwar Wagholikar, MBBS, PhD Manabu Torii, PhD (Georgetown) Hongfang Liu, PhD (Principal Investigator)
2011 Mayo Clinic 23

You might also like