Professional Documents
Culture Documents
CONTENTS:
Theory of Text Summarization Techniques Demonstration Conclusion
Theory of Text Summarization Techniques Demonstration Conclusion
DEFINITION:
A summary text is a derivative of a source text condensed by selection and/or generalization on important content.
NEED:
The growth of the World Wide Web has spurred the need of an efficient Summarization Tool. It is almost impossible to read most, even if not all of the newly published papers. When working on a research project, the time spent on reading literature review seems endless.
The goal of this project is to design a domain independent, automatic text extraction system to alleviate, if not totally solve, this problem.
HISTORY:
1950s: Luhns auto-extracts statistical system 1970s: Domain based systems 1980s: Systems inspired by cognitive science theories 1990s: Use of different methods; domain-independent summarization
Theory of Text Summarization Techniques Demonstration Conclusion
WORD SCORING:
Words are scored using the following heuristics: 1. 2. 3. 4. 5. 6. Stop Words ( a, an, of, the ) Cue Words ( hence, conclude, summary ) Basic Dictionary Words Proper Nouns Keywords ( if any ) Word Frequency
SENTENCE SCORING
Primary Score is calculated by taking a sum of all individual word scores. This suffers from a dependence on sentence lengths. Final Score = Primary Score * avg. length current length
10
OVERVIEW OF TECHNIQUES:
Scoring Algorithm
Word
Basic Dictionary Words
Sentence
Stop Words
Cue Words
Proper Nouns
Keywords
Frequency
Primary Score
Final Score
11
Theory of Text Summarization Techniques Demonstration Conclusion
12
SAMPLE INPUT:
Barack Hussein Obama II (pronounced /b?'r??k h?'se?n o?'b??m?/; born August 4, 1961) is the Presidentelect of the United States and the junior United States Senator from Illinois. Obama is the first African American to be elected President of the United States. He is a graduate of Columbia College of Columbia University and Harvard Law School, where he was president of the Harvard Law Review. Obama worked as a community organizer and practiced as a civil rights attorney before serving three terms in the Illinois Senate from 1997 to 2004. He taught constitutional law at the University of Chicago Law School from 1992 to 2004. Following an unsuccessful bid for a seat in the United States House of Representatives in 2000, he announced his campaign for the United States Senate in January 2003, won a primary victory in March 2004, and was elected to the Senate in November 2004. Obama delivered the keynote address at the Democratic National Convention in July 2004. As a member of the Democratic minority in the 109th Congress, he helped create legislation to control conventional weapons and to promote greater public accountability in the use of federal funds. He also made official trips to Eastern Europe, the Middle East, and Africa. During the 110th Congress, he helped create legislation regarding lobbying and electoral fraud, climate change, nuclear terrorism, and care for returned United States military personnel. On February 10, 2007, he announced his candidacy for President of the United States, and on June 3, 2008, he was named the presumptive nominee of the Democratic Party after a 17-month-long primary campaign. He became the President-elect after defeating Republican presidential candidate John McCain in the general election on November 4, 2008, and is due to be sworn in as President of the 13 United States on January 20, 2009. On November 13, Obama announced his resignation from the United States Senate effective November 16, 2008.[1]
SCREEN-CASTING !!
Coz nothing demonstrates the code better, than the code itself !!
15
( <1000 WORDS )
1400 1200 1000 800 600 400 200 0 Run 1 Run 2 Run 3 Hashing Run 4 RBTree Run 5 Run 6
16
( >10000 WORDS )
25000 20000 15000 10000 5000 0 Run 1 Run 2 Run 3 Run 4 RBTree Run 5 Run 6 Hashing
17
30000 25000 Time (in ms) 20000 15000 10000 5000 0 ~500 ~1000 ~3000 ~6000 ~10000~13000 Words in Document
18
Hashing RBTree
Theory of Text Summarization Techniques Demonstration Conclusion
19
LIMITATIONS
Without the use of NLP, the generated summary suffers from lack of cohesion and semantics. It is difficult to relate pronouns to their corresponding nouns in the summary. Also, for texts containing multiple topics, the generated summary might not be balanced. Though there are some kinks, the program works fairly well, and is comparable to the summarization tool in MS Office Word 2003.
21
FUTURE DEVELOPMENTS
The possibilities are endless. With Natural Language Processing : a. Newspaper headlines can be generated. b. Forms can be filled up. c. Bio-data can be generated.
23
Thank you
24