Professional Documents
Culture Documents
Tuesday, June 23rd, 2009 Digital Humanities 2009 University of Maryland, College Park
Roger Bilisoly, Ph.D. Department of Mathematical Sciences Central Connecticut State University New Britain, Connecticut
Text similarity is used in information retrieval (IR) to help match queries and texts. This can be used to measure distances between two texts. There are many ways to perform IR. Vector Space Models: the term-document matrix allows a geometric approach in high dimensional spaces (e.g., R20,591 for Poes short stories). Probabilistic Models: find maxi P(Texti | Query). Language Models: find maxi P(Query | Language Modeli). See Grossman and Frieder (2004) for these and additional approaches. IR is well tested. Search engines such as Google.com are profitable and competitive.
However, the IR approaches have drawbacks: 1. Above text distances lack intuitive appeal: e.g., angles in 20,591 dimensional space are hard to comprehend directly. Hence, linking text distances to a humans experience of reading can be difficult. 2. There are problems with both sparseness and complexity of language.
Second Goal: Find clusters that are understandable to humans. This researcher uses formal concept theory.
Formal concepts have the form {{objects}, {attributes}}. See Corpineto and Romano (2004) for a detailed exposition. In the example below rows represent authors (objects) and columns represent attributes. Idea: look for maximal submatrices that contain all 1s. This can be done efficiently by Ganters algorithm. For programming details see Fu and Nguifo (2004).
Poet Poe Stowe Dickens Eliot Whitman 1 0 0 0 1 Short Stories 1 0 0 0 0 Novelist 0 1 1 1 0 USA 1 1 0 0 1 UK 0 0 1 1 0 Male 1 0 1 0 1 Female 0 1 0 1 0
{{Poe, Whitman}, {Poet, USA, Male}} is in red above Other examples: {{Poe}, {Poet, Short Stories, USA, Male}} {{Stowe, Eliot}, {Novelist, Female}} {{Stowe, Dickens, Eliot}, {Novelist}} {{Poe, Dickens, Whitman}, {Male}}
Novelist
English
Whitman, Poet
Stowe
Dickens
Eliot
Complication: Word rates can be a function of text length. Example: Compare the word diversity (an inverse rate) of The Black Cat to The Unparalleled Adventures of One Hans Pfaall.
The top line represents The Black Cat and the bottom Hans Pfaall. Clearly the former is higher in its range even though the ending values are 3.17 < 5.61. An approximate solution is to consider sets of stories close in size. Three groups are used here: 2001 to 3000 words; 3001 to 4200; and 4201 to 6000. (This includes 44 of Poes short stories).
The final value for The Black Cat is 3.17 tokens per type. For Hans Pfaall it is 5.61.
See Section 4.6 and Figure 4.6 of Bilisoly (2008) and chapter 1 of Baayen (2001)
A Galois lattice for Poe stories (as objects) and word groups (as attributes).
Words are grouped by five themes, each of which is evocative for a human.
Death: death, corpse, dead, murder, died, die, deceased, Body: eyes, head, hand, body, feet, heart, face, eye, Spiritual: soul, god, spirit, heaven, moral, angel, devil, Horror: horror, terror, fear, horrible, anxiety, fearful, Family: family, wife, mother, daughter, father, uncle,
For the incidence matrix, let 1 = story in top 25%, 0 otherwise (other percentiles have been tested).
Horror Spiritual Red Death, Amontillado, 4 Beasts, Imp of Perverse, Imp of Perverse, Eleonora, Morella, Tell-Tale Heart, Morella, Eiros and Charmion, Eiros and Charmion Frenchmans Sling
Body Red Death, Amontillado, Tell-Tale Heart, Eleonora, Morella, Frenchmans Sling
Death 4 Beasts, Red Death, Imp of Perverse, Tell-Tale Heart, Eleonora, Morella
Morella appears in all 5 word groups (making it quintessential Poe). Tell-Tale Heart and the Red Death appear in 3.
The Tell-Tale Heart is about a man who kills his older roommate, hides the body under the floor, then the police visit, which causes him to crack and shows them the body.
This story ranks 3rd (of 13) in Death (5.12 per K), 1st in Body (18.18 per K), 1st in Horror (9.32 per K). This story is, in fact, considered one of Poes iconic tales. It forms a concept with Morella and The Masque of the Red Death. These three stories have the same genre: horror.
Morella is the narrators wife who is obsessed with mysticism, dies during childbirth, the daughter grows up to resemble Morella more and more, and upon baptism she cries I am here, and then dies herself.
Poe has several stories about wives who die: Berenice, Ligeia, The Oval Portrait, The Oblong Box and Eleonora (who dies before she and narrator can marry). So this plot is one Poe has explored several times. Ranks 1st in Death (8.94 per K), 3rd in Body (13.18 per K), 1st in Spiritual (8.47 per K), 2nd in Horror (5.65 per K), and 2nd in Family (5.18 per K).
This Morella paragraph has all five word groups: spiritual, body, family, death and horror.
And as years rolled away, and I gazed day after day upon her holy, and mild, and eloquent face, and poured over her maturing form, day after day did I discover new points of resemblance in the child to her mother, the melancholy and the dead. And hourly grew darker these shadows of similitude, and more full, and more definite, and more perplexing, and more hideously terrible in their aspect. For that her smile was like her mother's I could bear; but then I shuddered at its too perfect identity, that her eyes were like Morella's I could endure; but then they, too, often looked down into the depths of my soul with Morella's own intense and bewildering meaning. And in the contour of the high forehead, and in the ringlets of the silken hair, and in the wan fingers which buried themselves therein, and in the sad musical tones of her speech, and above all -- oh, above all, in the phrases and expressions of the dead on the lips of the loved and the living, I found food for consuming thought and horror, for a worm that would not die.
References
Word Frequency Distributions R. Harald Baayen (2001) Practical Text Mining with Perl Roger Bilisoly (2008) The Use of Color Words by Edgar Allan Poe PMLA, 45(2), Wilson Clough (1930) Concept Data Analysis: Theory and Applications Claudio Corpineto and Giovanni Romano (2004) A Lattice Algorithm for Data Mining Huaiguo Fu and Engelbert Mephu Nguifo (2004) http://www.cril.univ-artois.fr/~mephu/fu-mephu_ISI_04.pdf Information Retrieval: Algorithms and Heuristics David A. Grossman and Ophir Frieder (2004) Poe Poe Poe Poe Poe Poe Poe Daniel Hoffman (1998) The Collected Tales and Poems of Edgar Allan Poe The Modern Library (1992) The Works of Edgar Allan Poe, Volumes 1 through 5 Edgar Allan Poe (2000) Project Gutenberg, EText Nos. 2147-2151. http://www.gutenberg.org/browse/authors/p#a481
Note that 12 slides in a 4 by 3 rectangle fills an area of 33 by 34 inches, which easily fits in square meter. A few useful slides to bring with (but not to make part of the poster) are included after this slide.
{a
A : xRa, x
X}
{o O : oRy, y Y }
X X
O, Y A Y and Y X