You are on page 1of 24

AUTO TEXT SUMMARIZATION

15th Nov, 2008.

CONTENTS:
Theory of Text Summarization Techniques Demonstration Conclusion


Theory of Text Summarization Techniques Demonstration Conclusion

DEFINITION:

A summary text is a derivative of a source text condensed by selection and/or generalization on important content.

NEED:
The growth of the World Wide Web has spurred the need of an efficient Summarization Tool. It is almost impossible to read most, even if not all of the newly published papers. When working on a research project, the time spent on reading literature review seems endless.

The goal of this project is to design a domain independent, automatic text extraction system to alleviate, if not totally solve, this problem.

HISTORY:
1950s: Luhns auto-extracts statistical system 1970s: Domain based systems 1980s: Systems inspired by cognitive science theories 1990s: Use of different methods; domain-independent summarization


Theory of Text Summarization Techniques Demonstration Conclusion

WORD SCORING:

Words are scored using the following heuristics: 1. 2. 3. 4. 5. 6. Stop Words ( a, an, of, the ) Cue Words ( hence, conclude, summary ) Basic Dictionary Words Proper Nouns Keywords ( if any ) Word Frequency

SENTENCE SCORING
Primary Score is calculated by taking a sum of all individual word scores. This suffers from a dependence on sentence lengths. Final Score = Primary Score * avg. length current length

DATA STRUCTURES USED:


1. Lists 2. Red Black Trees 3. Hash tables Extensive use of Files has been made. An analogy between the times required by Hash Tables and Red Black Trees has also been drawn.

10

OVERVIEW OF TECHNIQUES:

Scoring Algorithm
Word
Basic Dictionary Words

Sentence

Stop Words

Cue Words

Proper Nouns

Keywords

Frequency

Primary Score

Final Score

11


Theory of Text Summarization Techniques Demonstration Conclusion

12

SAMPLE INPUT:
Barack Hussein Obama II (pronounced /b?'r??k h?'se?n o?'b??m?/; born August 4, 1961) is the Presidentelect of the United States and the junior United States Senator from Illinois. Obama is the first African American to be elected President of the United States. He is a graduate of Columbia College of Columbia University and Harvard Law School, where he was president of the Harvard Law Review. Obama worked as a community organizer and practiced as a civil rights attorney before serving three terms in the Illinois Senate from 1997 to 2004. He taught constitutional law at the University of Chicago Law School from 1992 to 2004. Following an unsuccessful bid for a seat in the United States House of Representatives in 2000, he announced his campaign for the United States Senate in January 2003, won a primary victory in March 2004, and was elected to the Senate in November 2004. Obama delivered the keynote address at the Democratic National Convention in July 2004. As a member of the Democratic minority in the 109th Congress, he helped create legislation to control conventional weapons and to promote greater public accountability in the use of federal funds. He also made official trips to Eastern Europe, the Middle East, and Africa. During the 110th Congress, he helped create legislation regarding lobbying and electoral fraud, climate change, nuclear terrorism, and care for returned United States military personnel. On February 10, 2007, he announced his candidacy for President of the United States, and on June 3, 2008, he was named the presumptive nominee of the Democratic Party after a 17-month-long primary campaign. He became the President-elect after defeating Republican presidential candidate John McCain in the general election on November 4, 2008, and is due to be sworn in as President of the 13 United States on January 20, 2009. On November 13, Obama announced his resignation from the United States Senate effective November 16, 2008.[1]

WHAT OUR CODE SELECTED !!


Barack Hussein Obama II (pronounced /b?'r??k h?'se?n o?'b??m?/; born August 4, 1961) is the President-elect of the United States and the junior United States Senator from Illinois. Obama is the first African American to be elected President of the United States. He is a graduate of Columbia College of Columbia University and Harvard Law School, where he was president of the Harvard Law Review. Obama worked as a community organizer and practiced as a civil rights attorney before serving three terms in the Illinois Senate from 1997 to 2004. He taught constitutional law at the University of Chicago Law School from 1992 to 2004. Following an unsuccessful bid for a seat in the United States House of Representatives in 2000, he announced his campaign for the United States Senate in January 2003, won a primary victory in March 2004, and was elected to the Senate in November 2004. Obama delivered the keynote address at the Democratic National Convention in July 2004. As a member of the Democratic minority in the 109th Congress, he helped create legislation to control conventional weapons and to promote greater public accountability in the use of federal funds. He also made official trips to Eastern Europe, the Middle East, and Africa. During the 110th Congress, he helped create legislation regarding lobbying and electoral fraud, climate change, nuclear terrorism, and care for returned United States military personnel. On February 10, 2007, he announced his candidacy for President of the United States, and on June 3, 2008, he was named the presumptive nominee of the Democratic Party after a 17-month-long primary campaign. He became the President-elect after defeating Republican presidential candidate John McCain in the general election on November 4, 2008, and is due to be sworn in as 14 President of the United States on January 20, 2009. On November 13, Obama announced his resignation from the United States Senate effective November 16, 2008.

SCREEN-CASTING !!
Coz nothing demonstrates the code better, than the code itself !!

15

HASH TABLE VS. RED BLACK TREE


-FOR SMALL FILES

( <1000 WORDS )

1400 1200 1000 800 600 400 200 0 Run 1 Run 2 Run 3 Hashing Run 4 RBTree Run 5 Run 6
16

HASH TABLE VS. RED BLACK TREE


-FOR LARGE FILES

( >10000 WORDS )

25000 20000 15000 10000 5000 0 Run 1 Run 2 Run 3 Run 4 RBTree Run 5 Run 6 Hashing
17

HASH TABLE VS. RED BLACK TREE


-FOR DIFFERENT WORD SIZES

30000 25000 Time (in ms) 20000 15000 10000 5000 0 ~500 ~1000 ~3000 ~6000 ~10000~13000 Words in Document
18

Hashing RBTree


Theory of Text Summarization Techniques Demonstration Conclusion

19

WHAT WAS MOST DIFFICULT:


The removal of appropriate punctuations, s, etc posed a lot of trouble. To arrive at the best weights for the different words took weeks of trials and errors. To get the best hash function, we tried several possibilities, none of which, unfortunately, were efficient. Without any standard reference, we had to do everything from scratch, wholly on our own.
20

LIMITATIONS
Without the use of NLP, the generated summary suffers from lack of cohesion and semantics. It is difficult to relate pronouns to their corresponding nouns in the summary. Also, for texts containing multiple topics, the generated summary might not be balanced. Though there are some kinks, the program works fairly well, and is comparable to the summarization tool in MS Office Word 2003.
21

GIVEN MORE TIME


We could have implemented GUI, which would have made it much more easier for the user. Some more scoring algorithms based on the placement of sentences could have been implemented. Different summaries, based on user requirements could have been implemented. For ex. for a newspaper article, the opening and closing paragraphs are more important. Reading from .doc and .html files could have been implemented. An HTML interface (online tool ) could have been made.
22

FUTURE DEVELOPMENTS

The possibilities are endless. With Natural Language Processing : a. Newspaper headlines can be generated. b. Forms can be filled up. c. Bio-data can be generated.

23

Thank you

24

You might also like