You are on page 1of 29

Department of

Software and
Computing Systems
Research Group of Language
Processing and Information
Systems
WSD,Textual entailment, People
Disambiguation and Affective Text
Sonia Vzquez Prez
Wolverhampton
July 2007
2
Outline
WSD with WordNet Domains (English, Spanish, Catalan)
WSD pattern based (Spanish)
Textual Entailment (English)
Participations
Answer Validation Exercise (Textual entailment)
Affective text (Establishing sentiments from text snippets)
Web People Search (Name disambiguation)

3
WSD with WordNet Domains
Knowledge-driven system
WordNet Domains
New resource: Relevant domains


Synset

Domain

Noun

Gloss

05266809

Music

Music#1

an artistic form of
auditory

04417946

Acoustics

Music#2

any agreeable
(pleasing

00351993

Music
Free_time

Music#3

a musical diversion;
his music

05105195

Music

Music#4

a musical
composition in

04418122

Music

Music#5

the sounds produced
by...

00755322

Law

Music#6

punishment for one's
actions;

4
WSD with WordNet Domains
Relevant Domains
Measures used:




Establish the relevance of each domain with respect to each
word
) Pr(
) | Pr(
log ) , (
2
w
D w
D w MI =
) Pr(
) | Pr(
log ) | Pr( ) , (
2
w
D w
D w D w AR =
Mutual Information
Association Ratio
5
WSD with WordNet Domains
Relevant domains for music:
Noun

Domain

A.R.

Music

Music

0.240062

Music

Free_time

0.093726

Music

Acoustics

0.072362

Music

Dance

0.065254

Music

University

0.046024

Music

Radio

0.042735

Music

Art

0.020298

Music

Telecommunication

0.006069







6
WSD with WordNet Domains
Establishing the correct sense of each word
Context vector
Collection of the most representative domains in context
Sense vector
Collection of the most representative domains for each word sense
These vectors are obtained from the information of the glosses

7
WSD with WordNet Domains
Example (context vector):
There are a number of ways in which the
chromosome structure can change, which will
detrimentally change the genotype and phenotype of
the organism

Target word
Context
|
|
|
|
|
|
|
|
|
|
|
|
|
|
.
|

\
|
=
... ...
0.00371308 y Meteorolog
05 - 1.66327e Geology
7 0.00017985 Chemistry
1 0.00022653 Physiology
05 - 1.29592e Anatomy
05 - 1.77959e Zoology
06 - 3.20408e Botany
0.00402855 Ecology
0.00102837 Biology
CV
8
WSD with WordNet Domains
Example (sense vector):
genotype#1: a group of organisms sharing a specific genetic
constitution

|
|
|
|
|
|
|
|
|
|
|
.
|

\
|
=
... ...
0.005297 s Linguistic
0.006510 on Alimentati
0.014251 Sociology
0.016451 y Archaeolog
0.019687 Bowling
0.047627 Biology
0.084778 Ecology
SV
9
WSD with WordNet Domains
Determination of word sense:
Cosine measure








Selected sense: the closest to 1


=
= =
=
n i
n i n i
VS VC
VS VC
VS VC
.. 1
.. 1
2
.. 1
2
*
*
) , cos(
genotype #1 = 0.00804111
genotype #2 = 0.00340548
10
WSD pattern based

Syntagmatic + paradigmatic relations
obras para rgano
piano
violn
Syntagmatic relations
Paradigmatic relations
guitarra
Local context
Syntagmatic patterns: L1 R L2
1) Structure:
L1,L2 = lexical content words (N,V,Adj;Adv)
R= lexico-syntactic pattern (functional words) (Prep, Conj)
2) Frequency in corpus
Morphosyntactic pattern list
Sense characterization
Adaptation of EWN: Sense X
i
D
i

1) Content: variants related to X
i
in EWN:
syno/hyper/hypo/mero/coord
2) Disjunctive condition: D
i
D
j
=
11
WSD pattern based

Input: a sentence with X0
Step 1: Syntagmatic pattern (S0) identification

Structure filtering
Frequency filtering
MORPHOSYNTACTIC
PATTERNS LIST FOR N
Step 2: The disambiguation of X0
in the syntagmatic pattern S0
2a) Extraction of the paradigm
corresponding to the position of X0 in
S0: P0 = {p
j
}
2b) Identification of paradigmatic
relations inside P0:
Finding of words related to X
2c) Sense(s) assignement to X0 in S0
Step 3: Sense labelling

Pattern s0s occurrences finding
Sense labelling of X in the S0s occurrences
ADAPTATION OF
EUROWORDNET
C
O
R
P
U
S

Output: occurrences sense-tagged
for X in S0
12
WSD pattern based
Syntactic pattern:

X R Y


X, Y: Words with lexical content (N, V, Adj, Adv)
R: Functional words (Prep, conj, o C)
Basic patterns:
N, N
N C N
N P N
N A
N V
A N
V N


[ X=pasaje-N R= Y=areo-Adj]
[ X=canal-N R=de-Pr Y=televisin-N]
N Noun
R Adverb
A Adjetive
V Participle verb
C* Conjunction
D Determiner
Conjunctions= {y e o u}
13
WSD pattern based
Each basic pattern has discontinuous realisations in texts.
We pre-establish morphosyntactic schemes for the search of
patterns; e.g.:
N (((R) R) A/V) , ((D) D) (((R) R) A/V) N
N (((R) R) A/V) C* ((D) D) (((R) R) A/V) N
N (((R) R) A/V) P ((D) D) (((R) R) A/V) N
N ((R) R) A (C* ((R) R) A/V)
N ((R) R) V (C* ((R) R) A/V)
(A/V C* ((D) D) (((R) R)) A N
(A/V C* ((D) D) (((R) R)) V N
The units between brackets are optional, those separated by a bare
are alternatives for a position.

14
WSD pattern based
For each search scheme, we define decomposition rules
in order to extract the basic patterns.
Example:





Each unit of the sequence is considered also at the
lemma level.
NAC*A


NA

NA

Coronas danesas y suecas
Corona danesa Corona sueca
15
WSD pattern based
Sense discriminators obtained from EWN:
Selection of all nouns related to each sense along the different
lexical-semantic relations.
Elimination of the common elements between different senses.
Disjunctive sets of nouns for the senses of a word.
Commutative test:
Hypothesis: If two words can commute in a given context, they
have a good probability to be semantically close.
Application: If the ambiguous word can be substituted with a
sense discriminator inside a syntactic pattern, then it has the
sense corresponding to that discriminator.
The algorithm operates with words from a sense-
untagged corpus
16
WSD pattern based
Commutative Test Algorithm
X R - Y __ R - Y




X
k
R - Y X
k

d
ij
d
i0j
d
nj
SD
1
SD
i0
SD
n
X_i0 R - Y
X
_
? R - Y
Corpus
YES
NO
17
WSD pattern based
WSD module has two heuristics:
H1: Commutative Test Algorithm applied on the paradigmatic
information (the nouns obtained from substituting the ambiguous
occurrence in the pattern).
H2: Commutative Test Algorithm applied on the syntagmatic
information (the nouns obtained from the sentence).
The two heuristics act as voters for the sense
assignment.

18
WSD pattern based
Example:
Los enormes y continuados progresos cientficos y tcnicos de la Medicina
actual han logrado hacer descender espectacularmente la mortalidad
infantil, erradicar multitud de enfermedades hasta hace poco mortales,
sustituir mediante trasplante o implantacin del
cuerpo inutilizadas y alargar las expectativas de vida.

1. Input text POS-tagging.
2. Syntactic patterns identification.
2.1. Use of search schemes.
2.2. Use of decomposition rules.
3. Extraction of information.
3.1. From corpus.
3.2. From sentence.
rganos daados o partes rganos daados o partes
NACN
NA NCN
rgano daado rgano o parte
Scheme
Decomposition
Rules
Final
Result
mediador, terreno, chfer,
rbol, cabeza, planeta, parte,
incremento, totalidad,
guerrilla, programa, mitad,
pas, temporada, artculo,
tercio
progreso, cientfico,
mortalidad, multitud,
enfermedad, mortal,
trasplante, implantacin,
rgano, parte, cuerpo,
expectativa, vida
From corpus
From sentence
4. Extraction of Sense Discriminators.
Sense 1: rgano vegetal, espora, flor, pera, manzana, bellota, hinojo, semilla,
poro, pleo, carpforo, ...
Sense 2: agencia, unidad administrativa, banco central, servicio secreto,
seguridad social, FBI, ...
Sense 3: parte del cuerpo, trozo, msculo, rin, oreja, ojo, glndula, lbulo,
trax, dedo, articulacin, rasgo, faccin, ...
Sense 4: instrumento de viento, instrumento musical, mecanismo, aparato,
teclado, pedal, corneta, ...
Sense 5: peridico, publicacin, medio de comunicacin, mtodo, serie, serial,
nmero, ejemplar, ...
Sense Discriminators Sets
5. Commutative Test.
6. Final sense asignment
rgano#3: A fully differentiated structural and functional unit in
an animal that is specialized for some particular function. S1 SD1 = C
S1 SD2 = C
S1 SD3 = C
S1 SD4 = C
S1 SD5 = C
S2 SD1 = C
S2 SD2 = C
S2 SD3 = C
S2 SD4 = C
S2 SD5 = C
Heuristic 1 Heuristic 2
19
Textual Entailment
Problem: Semantic Variability
Affect:
Question Answering
Information Extraction
Information Retrieval
Document Summarization

Solution:
Recognising Textual Entailment

20
Textual Entailment
Example:
Text:
Across the Atlantic, on July 13, a radical Islamic cleric named Ali Al-Timimi
was sentenced to life in prison, in Virginia, for soliciting treason.
Hypothesis:
Ali Al-Timimi is imprisoned in Virginia.

Hypothesis:
Words of the same semantic field usually appear in the same context
Needs lots of information
Can establish relations among synonym words or among words
pertaining to different subhierarchies
Hidden relations (latent)
21
Textual Entailment
Mathematical representation with a
[ term domain ] matrix
Dimension reduction with Singular Value Decomposition
(SVD)
Obtaining a new semantic space

Application:
British National Corpus (BNC)
RTE2 corpus
WordNet Domains

22
Textual Entailment
British National Corpus (BNC)
A collection of 4000 documents
National newspapers, specialized articles
[ term document ] matrix
Two experiments
Relevant Documents
Selection of the 20 most relevant documents
Similarity measure
Number of common documents
Relevant words
Selection of the 800 most relevant words
Similarity measure
Number of words in common

23
Textual Entailment
Corpus of RTE2 task
Text-Hypothesis pairs from the data set of RTE2
Two experiments:
Semantic space with Text sentences
Build matrix with the information provided by the Text
sentences
The 20 most relevant Text sentences
Semantic space with Hypothesis sentences
Build matrix with the information provided by the Hypothesis
sentences
The 20 most relevant Hypothesis sentences
24
Textual Entailment
WordNet Domains
Semantic space [ term domains ]
Words of glosses of WordNet
Example:
Music |music#1| an artistic form of auditory communication
incorporating instrumental or vocal tones
Acoustics |music#2| agreeable sound
Free_time, Music |music#3| a musical diversion; "his music was his
central interest"
Factotum |music#4| punishment for one's actions; "you have to
face the music"
25
Textual Entailment
Application of the cosine measure
Co-occurrence vectors
Grammatical relations
Co-occurrence among words with a specific syntactic relation
Non grammatical relations
Co-occurrence among words in a n-words window
Document co-occurrence relations
Overlap in a set of documents
Two approaches:
Based on corpus information
Based on Relevant Domains resource

26
Textual Entailment
Based on documents
BNC corpus
Similarity between each T-H pair
Vectors of around 4000 attributes


|
|
.
|

\
|
=
w
w
n
N
idf log


= =
=
=
n i n i
n i
H T
H T
H T
.. 1
2
.. 1
2
.. 1
*
*
) , cos(
27
Textual Entailment
Based on Relevant Domains
Example:
T- At the other side of the country, Linden, N.J. is part of an industrial
corridor of chemical plants and oil refineries that some federal officials
refer to as "the most dangerous two miles in America.
H- Chemical plants and oil refineries are located in New Jersey.

engineering 3.52583e-05
mechanics 6.35667e-05
hydraulics 0.000138369
chemistry 0.000231547
geology 2.60917e-05
geography 7.75556e-05
geometry 0.000429758
physics 3.09861e-05
.
T Vector
engineering 0.000141033
mechanics 3.10667e-05
hydraulics 0.000395011
chemistry 0.000896033
geology 2.05556e-05
geography 1.82333e-05
physics 6.15556e-05
atomic_physic 0.000168833
.
H Vector
cos(T,H)=0.83
28
Affective Text

Module 1
Detemine part of speech tags of each word
in the headline, in order to determine the
content words (noun, verb, adjective,
adverb) and to remove stop words and
punctuation marks.
) ( ) (
) , (
) , (
cw hits e hits
cw e hits
cw e MI
i
i
i
=
e
i
{anger,
disgust, fear, joy,
sadness, surprise}
cw, content words of
a headline
Module 2
Use different search engines to calculate
the MI score.
Three different searches:
1. All content words with each emotion
2. Only all content words
3. Only each sentiment
Module 3
Calculation of final results
29
Web People Search

You might also like