You are on page 1of 16

Clustering the Short Stories of Edgar Allan Poe Using Word Groups and Formal Concept Analysis

Tuesday, June 23rd, 2009 Digital Humanities 2009 University of Maryland, College Park

Roger Bilisoly, Ph.D. Department of Mathematical Sciences Central Connecticut State University New Britain, Connecticut

Why analyze Edgar Allan Poe?


Poe wrote in many styles and used many themes: see Hoffman (1998). Some styles: Horror (The Black Cat), Detective (The Murders in the Rue Morgue), Satire (A Predicament), Science Fiction (The Unparalleled Adventure of One Hans Pfaall), etc. Poes styles are distinctive. Should be easy for a computer to cluster. There are only ~70 short stories. These take up only ~750 pages in Poe (1992). Poe is available on Web for free. He lived 1809-1849, so his original publications are out of copyright. Project Gutenberg has Poe (and many other authors) at http://www.gutenberg.org/wiki/Main_Page, Poe (2000).

First Goal: Pick a text metric.

Text similarity is used in information retrieval (IR) to help match queries and texts. This can be used to measure distances between two texts. There are many ways to perform IR. Vector Space Models: the term-document matrix allows a geometric approach in high dimensional spaces (e.g., R20,591 for Poes short stories). Probabilistic Models: find maxi P(Texti | Query). Language Models: find maxi P(Query | Language Modeli). See Grossman and Frieder (2004) for these and additional approaches. IR is well tested. Search engines such as Google.com are profitable and competitive.
However, the IR approaches have drawbacks: 1. Above text distances lack intuitive appeal: e.g., angles in 20,591 dimensional space are hard to comprehend directly. Hence, linking text distances to a humans experience of reading can be difficult. 2. There are problems with both sparseness and complexity of language.

Second Goal: Find clusters that are understandable to humans. This researcher uses formal concept theory.
Formal concepts have the form {{objects}, {attributes}}. See Corpineto and Romano (2004) for a detailed exposition. In the example below rows represent authors (objects) and columns represent attributes. Idea: look for maximal submatrices that contain all 1s. This can be done efficiently by Ganters algorithm. For programming details see Fu and Nguifo (2004).
Poet Poe Stowe Dickens Eliot Whitman 1 0 0 0 1 Short Stories 1 0 0 0 0 Novelist 0 1 1 1 0 USA 1 1 0 0 1 UK 0 0 1 1 0 Male 1 0 1 0 1 Female 0 1 0 1 0

{{Poe, Whitman}, {Poet, USA, Male}} is in red above Other examples: {{Poe}, {Poet, Short Stories, USA, Male}} {{Stowe, Eliot}, {Novelist, Female}} {{Stowe, Dickens, Eliot}, {Novelist}} {{Poe, Dickens, Whitman}, {Male}}

Example Continued: All the concepts together form a Galois lattice.


These concepts are only true for the incidence matrix on the preceding slide. Adding authors or attributes usually changes the formal concepts.
American Male Female

Novelist

English

Whitman, Poet

Poe, Short Story

Stowe

Dickens

Eliot

We see that Dickens is a Male and English (going upwards) and


all English are Novelists (going upwards again). We see that Males are Whitman and Dickens (going downwards), as well as Poe (going downwards again).

Complication: Word rates can be a function of text length. Example: Compare the word diversity (an inverse rate) of The Black Cat to The Unparalleled Adventures of One Hans Pfaall.
The top line represents The Black Cat and the bottom Hans Pfaall. Clearly the former is higher in its range even though the ending values are 3.17 < 5.61. An approximate solution is to consider sets of stories close in size. Three groups are used here: 2001 to 3000 words; 3001 to 4200; and 4201 to 6000. (This includes 44 of Poes short stories).

Word diversity = (# tokens)/(# types)

The final value for The Black Cat is 3.17 tokens per type. For Hans Pfaall it is 5.61.

See Section 4.6 and Figure 4.6 of Bilisoly (2008) and chapter 1 of Baayen (2001)

A Galois lattice for Poe stories (as objects) and word groups (as attributes).
Words are grouped by five themes, each of which is evocative for a human.
Death: death, corpse, dead, murder, died, die, deceased, Body: eyes, head, hand, body, feet, heart, face, eye, Spiritual: soul, god, spirit, heaven, moral, angel, devil, Horror: horror, terror, fear, horrible, anxiety, fearful, Family: family, wife, mother, daughter, father, uncle,

Word groups were formed using a thematic thesaurus


These are available online: e.g., WordNet 2.1. Dimensionality using groups is much lower, and frequencies are much higher, so sparseness no longer a problem. Word groups easier to link to literary ideas, and words have been analyzed by human critics, e.g., see Clough (1930).

For the incidence matrix, let 1 = story in top 25%, 0 otherwise (other percentiles have been tested).

Death, Body, Spiritual, Horror, Family Morella

Galois Lattice for Poe


Death, Body, Spiritual, Family Eleonora, Morella

Body, Spiritual, Family Eleonora, Morella, Frenchmans Sling

Death, Spiritual, Family 4 Beasts, Eleonora, Morella

Body, Horror, Family Amontillado, Morella

Death, Spiritual, Horror Imp Perverse, Morella

Death, Body, Horror Red Death, Tell-Tale Heart, Morella

Spiritual, Family 4 Beasts, Eleonora, Morella, Frenchmans Sling

Spiritual, Horror Imp of Perverse, Morella, Eiros and Charm.

Body, Family Amontillado, Eleonora, Morella, Frenchmans Sling

Body, Horror Red Death, Amontillado, Tell-Tale Heart, Morella

Death, Horror Red Death, Imp of Perverse, Tell-Tale Heart, Morella

Death, Spiritual 4 Beasts, Imp Perverse, Eleonora, Morella

Death, Body Red Death, Tell-Tale Heart, Eleonora, Morella

Family 4 Beasts, Amontillado, Eleonora, Morella, 3 Sundays, Frenchmans Sling

Horror Spiritual Red Death, Amontillado, 4 Beasts, Imp of Perverse, Imp of Perverse, Eleonora, Morella, Tell-Tale Heart, Morella, Eiros and Charmion, Eiros and Charmion Frenchmans Sling

Body Red Death, Amontillado, Tell-Tale Heart, Eleonora, Morella, Frenchmans Sling

Death 4 Beasts, Red Death, Imp of Perverse, Tell-Tale Heart, Eleonora, Morella

Morella appears in all 5 word groups (making it quintessential Poe). Tell-Tale Heart and the Red Death appear in 3.
The Tell-Tale Heart is about a man who kills his older roommate, hides the body under the floor, then the police visit, which causes him to crack and shows them the body.
This story ranks 3rd (of 13) in Death (5.12 per K), 1st in Body (18.18 per K), 1st in Horror (9.32 per K). This story is, in fact, considered one of Poes iconic tales. It forms a concept with Morella and The Masque of the Red Death. These three stories have the same genre: horror.

Morella is the narrators wife who is obsessed with mysticism, dies during childbirth, the daughter grows up to resemble Morella more and more, and upon baptism she cries I am here, and then dies herself.
Poe has several stories about wives who die: Berenice, Ligeia, The Oval Portrait, The Oblong Box and Eleonora (who dies before she and narrator can marry). So this plot is one Poe has explored several times. Ranks 1st in Death (8.94 per K), 3rd in Body (13.18 per K), 1st in Spiritual (8.47 per K), 2nd in Horror (5.65 per K), and 2nd in Family (5.18 per K).

This Morella paragraph has all five word groups: spiritual, body, family, death and horror.
And as years rolled away, and I gazed day after day upon her holy, and mild, and eloquent face, and poured over her maturing form, day after day did I discover new points of resemblance in the child to her mother, the melancholy and the dead. And hourly grew darker these shadows of similitude, and more full, and more definite, and more perplexing, and more hideously terrible in their aspect. For that her smile was like her mother's I could bear; but then I shuddered at its too perfect identity, that her eyes were like Morella's I could endure; but then they, too, often looked down into the depths of my soul with Morella's own intense and bewildering meaning. And in the contour of the high forehead, and in the ringlets of the silken hair, and in the wan fingers which buried themselves therein, and in the sad musical tones of her speech, and above all -- oh, above all, in the phrases and expressions of the dead on the lips of the loved and the living, I found food for consuming thought and horror, for a worm that would not die.

References
Word Frequency Distributions R. Harald Baayen (2001) Practical Text Mining with Perl Roger Bilisoly (2008) The Use of Color Words by Edgar Allan Poe PMLA, 45(2), Wilson Clough (1930) Concept Data Analysis: Theory and Applications Claudio Corpineto and Giovanni Romano (2004) A Lattice Algorithm for Data Mining Huaiguo Fu and Engelbert Mephu Nguifo (2004) http://www.cril.univ-artois.fr/~mephu/fu-mephu_ISI_04.pdf Information Retrieval: Algorithms and Heuristics David A. Grossman and Ophir Frieder (2004) Poe Poe Poe Poe Poe Poe Poe Daniel Hoffman (1998) The Collected Tales and Poems of Edgar Allan Poe The Modern Library (1992) The Works of Edgar Allan Poe, Volumes 1 through 5 Edgar Allan Poe (2000) Project Gutenberg, EText Nos. 2147-2151. http://www.gutenberg.org/browse/authors/p#a481

Appendix: Core Mathematica Code for Ganters Algorithm


primeA[v_,r_]:=Module[{maxP, product}, (* Output is an Attribute *) product = v.r; maxP = Fold[Plus,0,v]; Return[Map[If[#maxP,1,0]&,product]]] nextA[v_,r_]:=Module[{i, new, first}, (* Output is an Attribute *) Do[first = v*Table[If[i<i0,1,0],{i,1,Length[v]}]; first[[i0]] = 1; new = primeA[primeO[first,r],r]; If[newv,Continue[], Null]; If[compare[v,new]>-1, Continue[], Null]; If[Min[first[[1;;i0-1]]-new[[1;;i0-1]]] >= 0, Break[], Null], {i0, Length[v], 1, -1}]; Return[new]] compare[v1_,v2_]:=Module[{ans, idiff=0}, (* Consider v1 and v2 as binary numbers. Then v1 > v2 returns 1, equality returns 0, and v1 < v2 returns -1 *) Do[If[v1[[i]] == v2[[i]], Null, idiff=i; Break[] ], {i, 1, Length[v1]}]; Return[If[idiff>0, Sign[v1[[idiff]]-v2[[idiff]] ], 0] ] ]

Note that 12 slides in a 4 by 3 rectangle fills an area of 33 by 34 inches, which easily fits in square meter. A few useful slides to bring with (but not to make part of the poster) are included after this slide.

Ganters Algorithm: The operator


Let X be a subset of O, then define

{a

A : xRa, x

X}

Let Y be a subset of A, then define

{o O : oRy, y Y }

Concepts of the context (O, A, R) are pairs of sets (X,Y) where

X X

O, Y A Y and Y X

Example: Concept = {{Stowe, Dickens, Eliot}, {Novelist}}


Definition from Concept Data Analysis by Carpineto and Romano

Inclusiveness of Concept Lattices

This is for a Galois lattice that includes all 70 short stories

Poes Short Stories


1. The Unparalleled Adventures of One Hans Pfaall 2. The Gold Bug 3. Four Beasts in One 4. The Murders in the Rue Morgue 5. The Mystery of Marie Rogt 6. The Balloon-Hoax 7. MS. Found in a Bottle 8. The Oval Portrait 9. The Purloined Letter 10. The Thousand-and-Second Tale of Scheherezade 11. A Descent into the Maelstrm 12. Von Kempelen and his Discovery 13. Mesmeric Revelation 14. The Facts in the Case of M Valdemar 15. The Black Cat 16. The Fall of the House of Usher 17. Silence -- a Fable 18. The Masque of the Red Death 19. The Cask of Amontillado 20. The Imp of the Perverse 21. The Island of the Fay 22. The Assignation 23. The Pit and the Pendulum 24. The Premature Burial 25. The Domain of Arnheim 26. Landor's Cottage 27. William Wilson 28. The Tell-Tale Heart 29. Berenice 30. Eleonora 31. Ligeia 32. Morella 33. A Tale of the Ragged Mountains 34. The Spectacles 35. King Pest 36. Three Sundays in a Week 37. The Devil in the Belfry 38. Lionizing 39. X-ing a Paragrab 40. Metzengerstein 41. The System of Doctor Tarr and Professor Fether 42. How to Write a Blackwood article 43. A Predicament 44. Mystification 45. Diddling 46. The Angel of the Odd 47. Mellonta Tauta 48. The Duc de L'Omlette 49. The Oblong Box 50. Loss of Breath 51. The Man That Was Used Up 52. The Business Man 53. The Landscape Garden 54. Maelzel's Chess-Player 55. The Power of Words 56. The Colloquy of Monas and Una 57. The Conversation of Eiros and Charmion 58. Shadow -- A Parable 59. Philosophy of Furniture 60. A Tale of Jerusalem 61. The Sphinx 62. Hop Frog 63. The Man of the Crowd 64. Never Bet the Devil Your Head 65. Thou Art the Man 66. Why the Little Frenchman Wears his Hand in a Sling 67. Bon-Bon 68. Some words with a Mummy 69. Literary Life of Thingum Bob Esq. 70. Morning on the Wissahiccon

You might also like