You are on page 1of 6

A View On Natural Language Processing And Text Summarization

ABSTRACT Text summarization plays an important role in the area of n atural language processing and text mining. Text summarization aims to create a compressed summary while retaining the main characteristics of the original set of documents. Many approaches use statistics and machine learning techniques to extract sentences from documents. This paper presents a survey of recent natural language processing techniques and text summarization techniques. We explore evaluation strategies and metrics, features, approaches and problems in text summarization.

1. INTRODUCTION The Internet has come to be of much use primarily because of the support given by Information Retrieval (IR) tools. However with the exponential growth of the information on the Internet, a second level of abstraction of information from the results of the first round of IR becomes necessary. That is, the large number of documents returned by IR system need to be summarized. Currently this is the primary application of summarization. The many other uses of summarization are almost obvious: Information extraction, as against document-retrieval, automatic generation of comparison charts, Just-In-Time knowledge acquisition, finding answers to specific questions, a tool for information retrieval in multiple languages, biographical profiling, to name a few. In this paper we present an introduction to various methods of Natural Language Processing Techniques and Automatic text summarization. 2. NATURAL LANGUAGE PROCESSING Natural Language Processing (NLP) is an area of research and application that analyze how computers are used for understanding and manipulating natural language text or speech to achieve the desired tasks. The goal of NLP researchers is to create an appropriate tools and techniques to make computer systems understand and manipulate natural languages by gathering the knowledge on how people understand and use languages [1]. There are more useful goals for NLP; most of them are related to the particular application for which it is being employed. The aim of the NLP system is to represent the exact meaning and purpose of the users inquiry, which can be expressed in a usual language as if they were speaking to a reference librarian. Moreover, the contents of the documents that are being searched will be represented at all their levels of meaning so that a true match between need and reply can be found, despite how they are represented in their surface form [2]. Researchers mainly focus on techniques that have been developed in Information Retrieval, while most try to influence both IR approaches and some features of NLP [13] [14][15]. In recent years there has been an explosion of on-line unstructured information in multiple languages, thus natural language processing technology such as automatic document summarization have become increasingly important for the information retrieval applications. 3. TEXT SUMMARIZATION Generally, text summarization [11, 12, 17-20] is the process of reducing a given text content into a shorter version by keeping its main content intact and thus conveying the actual desired meaning [3]. The summarization task can be classified as, either generalized summary or query-specific summary. A query- specific summary presents the information which is salient to the given queries, whereas a generalized summary provides an overall sense of the documents content. It may be a collection of sentences carefully picked from the document or can be a formed by synthesizing new sentences representing the information in the documents. Summaries may be classified by any of the following criteria [4]: Detail: Indicative/informative Granularity: specific events/overview Technique: Extraction/Abstraction Content: Generalized/Query-based Approach: Domain/Genre specific/independent.

3.1. Evaluation Strategies and Metrics To understand the strength and weakness of various approaches of summarization, methods and metrics for evaluation of summarization is used. Human judgment of the quality of a summary varies from person to person. For example, in a study conducted by Goldstein, et al. [5], when a few people were asked to pick the most relevant sentences in a given document, there was very little overlap of the sentences picked by different persons. Also, human judgment usually does not find concurrence on the quality of a given summary. Hence it is difficult to quantify the quality of a summary. However, a few indirect measures may be adopted that indicate the usefulness and completeness of a summary [4, 6-8], such as: 1. Can a user answer all the questions by reading a summary, as he would by reading the entire document from which the summary was produced? 2. What is the compression ratio between the given document and its summary? 3. If it is a summary of multiple documents with temporal dimension, does it capture the correct temporal information? 4. Redundancyis any information repeated in the summary? Qualities of summary that are usually difficult to measure are: 5. Intelligibility 6. Cohesiveness 7. Coherence 8.Readability(depends on cohesion/ coherence/ intelligibility) A metric is said to be intrinsic or extrinsic depending on whether the metric determines the quality based on the summary alone, or based on the usefulness of the summary in completing another task [9]. For example, 1 above is an extrinsic metric. An example of intrinsic measure is the cosine similarity of the summary to the document from which it is generated. This particular measure is not of very useful, since it does not take into effect the coverage of information or redundancy. With such a measure, a trivial way for improving the score would be to take the entire document as its summary. A metric that is commonly employed for extractive summaries is that proposed by Edmundson [10]. Human judges hand pick sentences from the documents to create manualextractive summaries. Automatically generated summaries are then evaluated by computing the number of sentences common to the automatic and manually generated summaries. In Information Retrieval terms, these measures are called Precision and Recall. This method is currently the most used method for evaluating extractive summaries [9, 21-23]. For an experimental study of various evaluation metrics see [23]. 3.2. Features of Text Summarization Sentence extraction methods for summarization normally work by scoring each sentence as a candidate to be part of summary, and then selecting the highest scoring subset of sentences. Some features that often increase the candidacy of a sentence for inclusion in summary are [[9, 10,31] and references therein]: Content Words or Keyword-occurrence: Selecting sentences with keywords that are most often used in the document usually represent theme of the document. Content words or Keywords are usually nouns and determined using term frequency(tf) inverse document frequency(idf) measure. Title-keyword: Sentences containing words that appear in the title are also indicative of the theme of the document Sentence Location heuristic: In Newswire articles, the first sentence is often the most important sentence; in technical articles, last couple of sentences in the abstract or those from conclusions is informative of the findings in the document [24]. Indicative phrases or Cue Phrase: Sentences containing any cue phrase (e.g. in conclusion, this letter, this report, summary, argue, purpose, develop, attempt etc.) are most likely to be in summaries. Short-length cutoff: Very large and very short sentences are usually not included in summary. Upper-case word feature: Sentences containing acronyms or proper names are included.

Proper Noun feature: Proper noun is name of a person, place and concept etc. Sentences containing proper nouns are having greater chances for including in summary. Biased Word Feature: If a word appearing in a sentence is from biased word list, then that sentence is important. Biased word list is previously defined and may contain domain specific words. Font based feature: Sentences containing words appearing in upper case, bold, italics or underlined fonts are usually more important. Sentence-to-Sentence Cohesion: For each sentence s compute the similarity between s and each other sentence s of the document, then add up those similarity values, obtaining the raw value of this feature for s. The process is repeated for all sentences. Sentence-to-Centroid Cohesion: For each sentence s as compute the vector representing the centroid of the document, which is the arithmetic average over the corresponding coordinate values of all the sentences of the document; then compute the similarity between the centroid and each sentence, obtaining the raw value of this feature for each sentence. Discourse analysis: Discourse level information [28], in a text is one of good feature for text summarization. In order to produce a coherent, fluent summary, and to determine the flow of the author's argument, it is necessary to determine the overall discourse structure of the text and then removing sentences peripheral to the main message of the text. These features are important as, a number of methods of text summarization are using them. These features are covering statistical and linguistic characteristics of a language. Occurrence of non-essential information: Some words are indicators of non-essential information. These words are speech markers such as because, furthermore, and additionally, and typically occur in the beginning of a sentence. This is also a binary feature, taking on the value true if the sentence contains at least one of these discourse markers, and false otherwise[30]. While the above features increase the score of a sentence to be included in the summary, those that reduce its score are: Pronouns: Pronouns such as she, they, it cannot be included in summary unless they are expanded into corresponding nouns. Redundancy in summary: Anti-redundancy was not explicitly accounted for in earlier systems, but forms a part of most of the current summarizers. This score is computed dynamically as the sentences are included in the summary, to ensure that there is no repetitive information in the summary. The following are two examples of anti redundancy scoring, when a new sentence is added to the summary: Scale down the scores of all the sentences not yet included in the summary by an amount proportional to their similarity to the summary generated so far [5, 22,25]. Recompute the scores of all the remaining sentences after removing the words present in the summary from the query/centroid of document [26]. 3.3. Text Summarization Approaches Abstraction of documents by humans is complex to model as is any other information processing by humans. The abstracts differ from person to person, and usually vary in the style, language and detail. The process of abstraction is complex to be formulated mathematically or logically [27]. In the last decade some systems have been developed that generate abstractions using the latest natural language processing tools. These systems extract phrases and lexical chains from the documents and fuse them together with generative tools to produce a summary (or abstraction). A comparatively less complex approach is to make an extractive summary in which sentences from the original documents are selected and presented together as a summary. 3.3.1 Problems with Extractive Methods: Extracted sentences usually tend to be longer than average. Due to this, part of the segments that are not essential for summary also get included, consuming space. Important or relevant information is usually spread across sentences, and extractive summaries cannot capture this (unless the summary is long enough to hold all those sentences). Conflicting information may not be presented accurately. Pure extraction often leads to problems in overall coherence of the summarya frequent issue concerns dangling anaphora. Sentences often contain pronouns, which lose their referents when extracted out of context. Worse yet, stitching together decontextualized extracts may lead to a misleading interpretation of anaphors (resulting in an inaccurate representation of source

information, i.e., low fidelity). Similar issues exist with temporal expressions. These problems become more severe in the multi-document case, since extracts are drawn from different sources. A general approach to addressing these issues involves post-processing extracts, for example, replacing pronouns with their antecedents, replacing relative temporal expression with actual dates, etc. 3.3.2 Problems with Abstractive Methods: The biggest challenge for abstractive summary is the representation problem [29]. Systems capabilities are constrained by the richness of their representations and their ability to generate such structuressystems cannot summarize what their representations cannot capture. In limited domains, it may be feasible to devise appropriate structures, but a general-purpose solution depends on open-domain semantic analysis Systems that can truly understand natural language are beyond the capabilities of todays technology. It has been shown that users prefer extractive summaries instead of glossed-over abstractive summaries [36]. This is because extractive summaries present the information as-is by the author, and would allow the users to read between-the lines information. Sentence synthesis is not a well-developed field yet, and hence the machine generated automatic summaries would result in incoherence even within a sentence. In case of extractive summaries, incoherence occurs only at the border of two sentences. 4. REVIEW OF RELATED WORKS A handful of research works available in the literature deals about the automatic text summarization for single document and multiple documents. Here, we briefly reviewed some of the recent related works available in the text summarization field. Dragomir R. Radev et al. (2007) have presented a multi-document summarizer, MEAD, which created summaries by employing cluster centroids generated by topic detection and tracking system. It discussed two techniques, a centroid-based summarizer, and an evaluation scheme on the grounds of sentence utility and subsumption. The assessment was subjected to single and also multiple document summaries. In the end, they elaborated about two user studies that test the models of multi-document summarization. Florian Boudin et al.(2006) have presented a technique to topic-oriented multi-document summarization. It analyzed the efficacy of employing additional information about the document set all together, in addition to individual documents. The NEO-CORTEX, a multi-document summarization system on the basis of the available CORTEX system was furnished. Results are accounted for experiments with a document base created by the NIST DUC-2005 and DUC-2006 data. It was also showed that NEO-CORTEX was a competent system and realized better performance on topic-oriented multi-document summarization task. Fu Lee Wang et al. (2006) have presented a multi-document summarization system to obtain the critical information from terrorism incidents. News articles of a terrorism happening were arranged into a hierarchical tree structure. Fractal summarization model was used to produce a summary for all the news stories. Yan Liu et al. [16] have proposed a document summarization framework via deep learning model, which has demonstrated distinguished extraction ability in document summarization. The framework consists of three parts of concepts extraction, summary generation, and reconstruction validation. A query-oriented extraction technique was proposed to concentrate information distributed in multiple documents to hidden units layer by layer. Then, the whole deep architecture was fine-turned by minimizing the information loss in reconstruction validation part. According to the concepts extracted from deep architecture, dynamic programming was used to seek most informative set of sentences as the summary. Experiments on three benchmark dataset demonstrated the effectiveness of the proposed framework and algorithms. Shasha Xie and Yang Liu (2009) have used a supervised learning approach for the summarization task and also used a classifier to determine whether to choose a sentence in the summary based on an affluent set of features. They have addressed two important problems related with the supervised classification approach. Firstly a diverse sampling technique has been proposed to handle the imbalanced data problem for the task in which the summary sentences are the minority class. Secondly a regression model has been used rather than binary classification for reframing the extractive summarization task in order to deal with human annotation disagreement problem.

Allan Borra et al. (2010) have aimed to develop a system that would be able to summarize a given document while still maintaining the reliability and saliency in the text. To achieve this, two main existing methods such as keyword extraction and discourse analysis based on Rhetorical Structure Theory (RST) have been included in ATS by the system architecture. Rafeeq Al-Hashemi (2010) has used an extractive technique to solve the problem with the idea of extracting the keywords, even if it does not exist explicitly within the text. The main role of their proposed project is the design of the keyword extraction subsystem which supports to select the more meaningful sentences to be in the summary. Their model contain four stages, mainly for the purpose of eliminating the stop words, extracting the keywords, ranking the sentences based on the keywords available in the sentences and finally for reducing the sentences using KFIDF measurement. Gianluca Demartini et al. (2010) have created an entity labeled corpus with temporal information beyond the TREC 2004 Novelty collection. They have developed and analyzed several features, and have demonstrated that an articles history could be used to enhance its summarization. The task of Entity Summarization (ES) is: given a query, a significant document and possibly a set of previous related documents (the history of the document), retrieve a set of entities which best summarizes the document. In the popular work on natural language processing, the authors of (Collobert and Weston, 2008) developed and employed a convolutional Deep Belief Network (DBN), as the common model to simultaneously solve a number of classic problems including part-of-speech tagging, chunking, named entity tagging, semantic role identification, and similar word identification. More recent work reported in (Collobert, 2010) further developed a fast purely discriminative approach for parsing based on the deep recurrent convolutional architecture called Graph Transformer Network. A similar multi-task learning technique with DBN is used in (Deselaers et al., 2009) to attack a machine transliteration problem, which may be generalized to a more difficult machine translation problem. The most interesting recent work on applying deep learning to natural language processing appears in (Socher et al., 2011), where a recursive neural network is used to build a deep architecture. The network is shown to be capable of successful merging of natural language words based on the learned semantic transformations of their original features. This deep learning approach provides an excellent performance on natural language parsing. The same approach is also demonstrated by the same authors to be successful in parsing natural scene images. The applications of DBN and deep auto encoder to document indexing and information retrieval (Salakhutdinov and Hinton, 2007). It is shown that the hidden variables in the last layer not only are easy to infer but also give a much better representation of each document, based on the word-count features, than the widely used latent semantic analysis. Using the compact code produced by deep networks, documents are mapped to memory addresses in such a way that semantically similar text documents are located at nearby address to facilitate rapid document retrieval. This idea is explored for audio document retrieval and some class of speech recognition problems with the initial exploration reported in (Deng et al, 2010). In the approach of (Jagadeesh J, Prasad Pingali, Vasudeva Varma, 2007) information retrieval techniques are combined with summarization techniques in producing the summary extracts. This approach incorporates a new notion of sentence importance independent of query into the final scoring. The sentences are scored using a set of features from all sentences, normalized in a maximum score and the final score of a sentence is calculated using a weighted linear combination of the individual feature values. The top scoring sentences are selected for the summary until the summary length reaches the desired limit. A new feature information measure captures the sentence importance based on the distribution of its constituent words in the domain corpus. The formula consists of two parts: a. a query dependent ranking of a document /sentence. b. the explicit notion of importance or prior of a document /sentence. This allows query independent forms of evidence to be incorporated into the ranking process. Given a set of training document and their extractive summaries, the summarization process is modeled as a classification problem: sentences are classified as summary sentences and nonsummary sentences based on the features that they possess. The classification probabilities are learnt statistically [3] from the training data, using Bayes rule:

P (s<S | F1, F2... FN) = P (F1, F2, ..., FN | sS) * P (sS) / P (F1, F2,..., FN) Where s is a sentence from the document collection, F1, F2FN are features used in classification. S is the summary to be generated, and P (s< S | F1, F2, ..., FN) is the probability that sentence s will be chosen to form the summary given that it possesses features F1,F2FN. FastSum (Frank Schilder, Ravikumar Kondadadi, 2008) is based on word-frequency features of clusters, documents and topics. Summary sentences are ranked by a regression Support Vector Machine. The method involves sentence splitting, filtering candidate sentences and computing the word frequencies in the documents of a cluster, topic description and the topic title. All sentences in the topic cluster are ranked for summarizability. The topic contains a topic title and a topic description. The former is a list of key words or phrases describing the topic, and the later contains the query or queries. The features used are word-based and sentence-based. Word-based features are computed based on the probability of words for the different containers. Sentence-based features include the length and position of the sentence in the document. Because of adopting Least Angle Regression, a new approach for selecting features, FastSum can rely on a minimal set of features leading to fast processing times, e.g. 1250 news documents per 60 seconds. 5. CONCLUSION This survey paper is concentrating on extractive summarization methods. An extractive summary is selection of important sentences from the original text. The importance of sentences is decided based on statistical and linguistic features of sentences. Many variations of the extractive approach [33] have been tried in the last ten years. However, it is hard to say how much greater interpretive sophistication, at sentence or text level, contributes to performance. Without the use of NLP, the generated summary may suffer from lack of cohesion and semantics. If text contains multiple topics, the generated summary might not be balanced. Deciding proper weights of individual features is very important as quality of final summary is depending on it. We should devote more time in deciding feature weights. The biggest challenge for text summarization is to summarize content from a number of textual and semi structured sources, including databases and web pages, in the right way (language, format, size, time) for a specific user. The text summarization software should produce the effective summary in less time and with least redundancy. Summaries can be evaluated using intrinsic or extrinsic measures. While intrinsic methods attempt to measure summary quality using human evaluation and extrinsic methods measure the same through a task based performance measure [32] such the information retrieval oriented task.

You might also like