Final Chapter 1

1
From times immemorial, humans physically carry out

summarization. Of late, the amount of data is zooming at rocking pace
because of internet and other parallel sources. To overshoot this problem, text
summarization is highly indispensable to address the thorny issue of the
congestion of information. Text summarization extends a helping hand to
preserve the text data by toeing certain rules and regulations for effective
employment of text data. Text summarization represents the procedure of
gathering a textual document, getting content from it, and furnishing the
essential content to the client in a capsule form and in a way which is user-
friendly or as per the needs of the application. The method reduces the
hassles of data congestion as it provides on a platter an abridged review for
analysis rather than the completely complete textual document. From the
preliminary phases of text summarization, it helps the client to locate the
pertinent data, which he is tracing. Incidentally, text summarization has
assumed the attire of an intermediary between the client and data parked in
various documents. Of late, there is a plethora of investigations carried out
with an eye on launching further proficient text summary generation
algorithms.
The onset of on-line publishing Interest in text mining started with

advent of on-line publishing, the zooming influence of the Internet and the
quick growth development of electronic government (e-government). The
Internet and E-government have considerable access to mammoth data, which
leads to further challenges such as effectual use of such mega volumes. With
the quick swell in web data and the rocking advancement of science and
technology, many an experimenter has burnt his midnight lamp devising ways
2
and means for improvement on web mining, text mining, information

extraction, knowledge discovery and information retrieval. Nevertheless, the
traditional data reclamation methods are showing their palm as regards
gathering of fruitful data efficiently. With result, the summarization of
documents with all categories of data emerges even more essential. So, in a
scenario of information overload particularly genuine data, there are a lot of
mechanical processing techniques with huge volumes of the data, especially
the compacted depiction techniques of text documents summaries.
With the exponential growth of the Internet and electronic

government services a gigantic quantity of electronic documents are
accessible online. This awful bombardment of electronic documents has put
the genuine clients at sea, making it extremely hard for them to mine vantage
points from them. Conversely, the user faces music in scanning through the
pertinent and useful segments of the documents, in view of its colossal size.
No wonder, novel methods with the acumen of processing data effectively
have been thickly sought after.
Locating the utmost relevant data concealed in textual Web

documents is habitually a daunting effort. In fact, the mammoth size of
electronic documents, which clients can retrieve from the Web, is generally
very hard to find without the assistance of automatic or semi-automatic tools.
To overwhelm the problem a concerted effort is made for the development of
text summarization tools. Summarizers devote their attention on creating a
succinct depiction of a textual document compilation. As the Internet is
brought to use in an incredibly exhaustive manner, the quantity of data in the
public domain goes on mounting, making the major chunk of data redundant.
Therefore, novel techniques capable of processing data effectively have
become extremely desired by the clients. Document summarization has
3
appeared as a vital tool in the hands of researches to overwhelm the menace in

technological scenarios.
Automatic Document Summarization(ADS) method is one such

means of assisting people locate data successfully and resourcefully. The
target of automatic summarization involves taking a source document, mine
content from it, and offer the most pertinent content to the client in a shorter
form. Hence, nowadays clients require accessibility to strong text
summarization systems, which are competent to efficiently abridge data
located in various documents into a concise, readable synopsis, or summary.
1.1 DATA MINING
Data mining is a multidisciplinary research field which plays a vital

role in today's real world. It is an essential area of research, due to the
availability of abundant data in the majority of applications. This immense
amount of data processed to extract important information and knowledge
since they are not distinct. Data mining is the process of discovering the
interesting data, knowledge and trends from huge amount of data. Various
data sources such as text files, spreadsheets, tables, or some other storage
format are used for the data mining process. Knowledge Discovery in
Databases process consisting of the following list of processes:
 Data Cleaning
 Data Selection
 Data Transformation
 Data Integration
 Data Mining
4
 Pattern Evaluation
 Knowledge presentation.
Data cleansing task concerns with missing and redundant data in

the source file, since the real-world data may be incomplete, inconsistent and
erroneous. Data cleaning task perform variety of techniques to solve it. Data
integration is used to create a common source data by combining the
heterogeneous data from various different sources. In the data selection
process, the appropriate data is retrieved from the data source for data mining
purposes. The data transformation process converts source data into the
suitable format for data mining with the help of data management tasks. Data
mining is an intelligent technique which is used to mine the useful data.
Pattern evaluation is helpful in identifying the interesting patterns based on
the measures given. Knowledge representation is used for visualizing the
discovered knowledge to the user.
Data mining has various application areas, including banking,

biology, e-commerce, etc. These are the most famous and classic applications.
On the other hand contains the new data mining application processing spatial
data, multimedia data, time data, and World Wide Web. One of the well
known and largest data source is World Wide Web. It holds variety of
documents which is handled by millions of people. The entire size of overall
documents can be interpreted in several terabytes. Those documents are
availed to millions of computers by telephone lines, fiber optic and wireless
modems. Based on this development, the demand for retrieving valuable
information from this massive amount of data is increasing every day. This
show the way to a new versatile area called text mining.
1.2 TEXT MINING

5
The phrase “text mining” is commonly used to represent any

research area that analyzing the large quantity of natural language text and
discovers lexical or linguistic usage patterns to retrieve nearly useful
information. Text mining is an emerging area which foster powerful
connection with NLP, Data mining, machine learning, IR and Knowledge
management. Text mining is promising in World Wide Web because almost
all the available data in the web pages are text data. It is the new research
field in knowledge discovery. Mooney & Razvan (2007) defined text mining
as the process of determining new, previously unknown knowledge from
unstructured or semi structured textual resources.
Text mining process combines the following sequential tasks.
 Text preprocessing
 Text transformation
 Attribute selection
 Pattern discovery
 Interpretation or evaluation
Text preprocessing is the initial step used to remove the unwanted

text from the text documents. Text transformation is used to represent the text
document by the terms and term frequency. Attribute selection used to select
the attribute which are significant to the corresponding sentence of the given
document. Pattern discovery used to discover the knowledge from the given
text documents. Evaluation is used to generate the final result.
Applications that are developed and can be employed in the text

mining process are:
6
 Text summarization produce compressed text for the

given lengthy text document to user need size
 Categorization identifies the important themes of the

given documents based on predefined set of topics.
 Clustering groups the related documents but not based

on predefined set of topics,
 Concept linkage tools links the associated documents

based on the shared concepts
 Topic tracking forecast the documents of interest to the

user with the help of previously viewed documents by the user
based on user profiles
 Information visualization are used to represent a large

textual sources in a visual hierarchy
 Question answering deals how to locate the best answer

to a given question.
1.3 NATURAL LANGUAGE PROCESSING
Natural Language Processing (NLP) is an upcoming area of

research and application that examine how computers are used for
understanding and manipulating natural language text or speech to attain the
desired tasks. The goal of NLP researchers is to create an appropriate tools
and techniques to make computer systems understand and manipulate natural
languages by gathering the knowledge on how people understand and utilize
languages.
7
There are additional useful goals for NLP; most of them are related
to the specific application for which it is being exploiting. The aim of NLP
systems is to describe the precise meaning and intension of the user query that
is given in the regular language of the user. Moreover, the contents of the
documents that are being investigated will be represented at all their levels of
meaning so that a true match between need and reply can be found, in spite of
how they are represented in their surface form.
Researchers generally focus on techniques which are developed in

Information Retrieval, while most try to influence both IR approaches and
certain features of NLP. In recent years there is an explosion of on-line
unstructured information in multiple languages, thus natural language
processing technology such as automatic document summarization have
become progressively more significant for the information retrieval
applications.
1.3.1 NLP Applications
If any application that make use of text is candidate for NLP. The
following are some of the list of common Applications in NLP. It includes, it
tends to have direct real-world applications, while, and more commonly serve
as sub-tasks that are used to aid in solving larger tasks.
 Summarization: Produce a readable summary from a large

text documents.
 Information Retrieval: Informational Retrieval systems

retrieve the documents which are requested by the user in the
user query.
8
 Information Extraction: It deals with indentifying actual

data from large amount of data which is the sub task of
searching or Information Retrieval system.
 Question Answering: With respect to user query it avails a

list of relevant documents which provides either the actual
data based on query or the documents which avails the actual
data for the given query.
 Machine Translation: It is used to translate the one form of

human language to another form of human language
automatically.
 Dialogue System: It is used to converse with a human in a

logical way. It is a ubiquitous application of the future.
1.3.2 NLP for Automatic Document Summarization
NLP tools are used to enhance the quality of retrieval process. ADS
uses the NLP tools and follows the same procedure as if in Informational
Retrieval systems. The most important criteria in ADS is the extracting the
sentences which can be performed by NLP tools such as co-reference
resolution, discourse analysis or named entity recognition.
The most common NLP tools used in ADS are light stemmers, root
stemmers, standard stopwords, domain specific stopwords, parser, Parts-Of-
Speech(POS) Tagger, Word segmentation, Sentence breaking. NLP tools can
integrate with the other model to provide the efficient tools for the ADS,
information retrieval and information extraction. The tools are often used to
perform redundancy elimination, to find relationships and similarities
9
between the given sentences, generates words to produce summaries or to

connect the given sentences for abstractive summaries.
1.4 AUTOMATIC DOCUMENT SUMMARIZATION
A summary is defined as the significant text that is extracted from

single or multiple text documents and the size should be less than half of its
original documents. When summarization task is performed by the computer
automatically then it is known as automatic document summarization.
The aim the summary is to preserve the significant information,

should eliminates redundancy, generated summary should be shorter in size
and the summary can be retrieved from single or multiple input documents.
Document summarization is a superb means to abridge enormous

size of data into a shorter form by choosing the most pertinent data and
steering clear of the surplus facts. Right from the 1950s, automatically
creating summaries from mammoth text corpora has been experimented in
both data reclamation and natural language processing and it represents the
task of generating abridged model of texts, handing out to the greedy and
time-starved clients the significant data in the original text with reduced time
utilization.
Automatic document summarization is an indispensable method to

overwhelm this menace and has one of the topics that have researched for a
long time in natural language processing. ADS represent the task of
decreasing the dimension of documents while offering the relevant semantic
content. Its object is to locate a summary of a document without having the
need for scanning through the whole document. Content is mined from a
database, and the most significant data is offered to the client in a capsule and
in a suitable way tailor-made to suit his requirements. The innovative
10
technique is competent to be employed in manifold domains like data

reclamation, intelligence gathering, data mining, text mining, and indexing.
The technique of automatic document summarization gathers a

partly configured source text from manifold texts written on the identical
subject, mines data contents from it, and offers the most valuable information
to the client in a way, which gives maximum solace to the client. Of late,
without having to scan mammoth size of documents, search engines like
Google, Yahoo!, AltaVista, and the like bless clients with a conglomeration of
documents of their interest and offer a summary of each document in a
nutshell paving the way for easy location of the preferred documents. Text
summarization, Generally is home to a three-pronged procedure such as,
choice of relevant segments of text, compilation of the data for several chosen
segments and generalization of the relevant data, and in the end, staging of the
ultimate summary text. This procedure can be effectively employed in many
applications like data reclamation, intelligence gathering, data mining, text
mining, and indexing.
Hence, it is now increasingly felt that clients require accessibility to

vigorous text summarization techniques, endowed with the acumen of
efficiently downsizing data parked in various documents into a concise,
readable synopsis, or summary.
With this end in view, automatic document summarization takes a

partly formed source text from one or more texts written on the identical
topic, mines data from it, and offers the most vital contents to the client tailor-
made to suit his requirements.
11
1.5 TYPES OF SUMMARIZATION
Summarization can be performed based on number of source

documents, summary construction methods, triggers and processing levels of
the documents.
1.5.1 Summarization Based on Number of Source Documents
In accordance with the number of documents to be abridged, the

summary is generally classified into two types such as a single-document and
a multi-document.
1.5.1.1 Single Document Summarization
Single Document Summarization (SDS) is capable of abridging

only one document into a concise version. It involves extracting sentences
from the original documents. Each sentence present in a document will be
sequential ordered and have logical relationship between them, which is
utilized by SDS.
1.5.1.2 Multi-Document summarization
Multi-Document summarization (MDS) has the prowess of

shortening a group of associated documents into a single summary. It can be
monolingual, if the source documents are from same language or multilingual,
if the source documents are from different language. MDS should identify and
handle the redundancy, recognize the significant difference between the
documents, summary should be consistent.
It furnishes a domain outline of a topic explaining the resemblance

and/or divergence between various documents and kinship between diverse
12
segments of data in a variety of documents, and gives carte-blanche to the

clients to go to unfathomable levels for added data on domain of delight.
MDS is the task of producing a summary by dipping documents in

dimension while preserving the core essence of the original documents. In
view of the fact that data congestion is generate because of several documents
sharing identical subjects, automatic multi-document summarization has
nowadays become the centre point of attraction. With the terrific growth of
documents on the web, numerous summarization applications are also in the
offing. However, the MDS function has metamorphosed into a Herculean task
rather than abridging a document, however it mammoth it may be. This
phenomenon surfaces out of unavoidable thematic variety within a mega set
of documents. A superb summarization method is targeted at blending the
vital topics with totality, comprehensibility, and brevity. MDS on the other
hand, discharges the function of mining data pertinent to an implied or overt
subject from various documents written thereon. In view of the fact that it
blends and assimilates the data across vast documents, it is endowed with the
skill of knowledge synthesis and knowledge discovery, and thus can be
engaged for knowledge acquisition.
A good MDS system should hold the following properties namely,

compressed summary, preserving Information, proper syntax, portable.
 Compressed Summary: the generated summary length

should be altered according to the compression rate given by
the user.
 Preserving Information: While summarizing the document,

important features of the document should be preserved.
 Proper Syntax: Generated summary should have the proper

grammatical sentences.
13
 Portable: The MDS system will be trained by various sets of

training data. So the system should be portable to various
domains.
1.5.2 Summarization Based on Construction Methods
The document summarization can be broadly categorized into two

types based on the construction of the summary such as abstractive
summarization and extractive summarization.
1.5.2.1 Abstractive summarization:
Abstractive methods tend to create innovative sentences from data

mined from the source documents and presume the employment of further
advanced techniques of the linguistic and semantic appraisal. The abstractive
summary requires linguistic data for understanding, interpreting and
investigative the given text. It greatly relies on the computational power of
NLP. The representation of natural language by the system is the biggest
dispute for abstractive summary. NLP technique is employed for parsing,
decrease of words, to produce text summary in abstractive summarization. Of
late, NLP has emerged as a cost-effective method, though it is plagued by
dearth of accuracy.
Problems with Abstractive Summarization Methods are
 Users have a preference on extractive summaries rather than

abstractive summaries because the extractive summaries avails the
information as if the author depicts in original document.
 Due to lack of linguistic representation the end result of the

Abstractive summaries avails inconsistency within a sentence.
14
 Abstractive method is not simple because it needs semantic

understanding of the given text documents.
 Ongoing research on abstractive summarization must still deal

with issues such as scarcity of training data, appropriate integration
of syntax even when the input data comes from a noisy genre, and
compressions involving lexical substitution and paraphrase.
1.5.2.2 Extractive Summarization:
Extractive summaries are created by selecting the important

sentences from the original documents which is based on statistical analysis,
surface level features such as word or phrase frequency, location and so on. It
is conceptually simple and easy to implement.
Extractive summaries show flexibility and take a shorter duration of

time for completion vis-a-vis abstractive summarization. The extractive
summarization deems the entire sentence in a matrix shape, and thus,
according to certain feature vectors, all the essential or pertinent sentences are
mined. A feature vector is an n-dimensional vector of numerical features
characterizing certain object. The vital target of text summarization based on
mining method is the selection of suitable sentences tailor-made to suit the
likes of the client. The former methods, though, assure generation of
summaries analogous to human creation, are constrained by the growth of
natural language comprehension. The further extensively employed mining
methods are centered on passage extraction (generally – sentences, from now
on the common designation of the method – sentence extraction) which grade
sentences, mine sentences with maximum scores, and then create the
summary. The outcome of the summarization systems according to sentence
extraction is far from the ideal – the coherent summary created by the skilled
expert. Nevertheless, superior summarization systems call for the intricate
15
software, and have inferiority efficiency and habitually thrust indispensable

constraints on fashion and subjects of the source text. An extraction-based
summary is concerned with just choosing sentences or text segments from the
source document.
Problems with Extractive Summarization methods are:
 It should identify and handle Redundancy.
 It may arise dangling anaphora problem were the sentences

will always holds the pronouns, which drop the object or idea to
which a word refer. To solve this problem it requires post
processing, so that pronouns can be replaced by their antecedents,
relative temporal expression can be replaced with actual dates and
so on.
 Conflicting information cannot be extracted exactly.
 Significant or related information is generally spread across

sentences, and extractive summaries can never incarcerate this,
unless the summary is long enough to hold all those sentences.
1.5.3 Based on Triggers
Document summarization methods are normally categorized into

two kinds based on triggers, such as generic summarization and query-based
summarization.
1.5.3.1 Query-based summary:
Query-relevant text summaries are constructive for answering such

questions as whether a specified document is related to the user's query, and if
related, which part(s) of the document is related. As query-relevant summaries
16
are query biased, they do not offer an overall sense of the document content,
and therefore, are not suitable for substance summary. A query-relevant
summary is prejudiced in favor of a specific question or subject. The query-
based summarization furnishes the summaries intimately linked to the query.
A query-focused summary offers the data that is most pertinent to the
specified queries. When compared to generic summarization, which must
comprise the core data vital to the original documents, the vital aim of query-
focused Document summarization is to generate from the documents a
summary that is capable of answering the requirement for data given in the
subject or explaining the subject.
1.5.3.2 Generic Summarization:
Generic text summarization automatically forms a condensed

version of one or more documents that incarcerates the essence of the
documents. The generic summarization refines the summarized text and offers
the significant semantic content of specific documents, as a document’s
substance may have many themes, generic summarization methods focus on
extending the summary’s diversity to present wider coverage of the substance.
The most advantageous for documents which are pertinent to the summarizer
model former viz. supervised approaches are rooted on algorithms employing
mega size of human-made summaries. With the result, they do not essentially
create a suitable summary for documents, which are different from the model.
Moreover, when clients alter their motive of summarization or the traits of
documents, it is even more essential to rebuild the guidance data or regicide
the model. Unsupervised methods, on the other hand, have no need for
guidance data like artificial summaries to guide the summarizer. A superb
summary is anticipated to conserve the subject data existing in the documents
to the extent possible, and simultaneously comprise the least possible
redundancy, termed as information richness and diversity, correspondingly. In
17
automatic document summarization, the choice task of the discrete concepts

contained in the document is labeled as diversity. The diversity is highly
essential to check the redundancy in the summarized document and thus
generate further suitable summary.
1.5.4 Based on Processing Levels
From the evaluation of legendary approaches, summarization can

be further categorized as surface, entity and discourse level.
1.5.4.1 Surface level
This approach will make use of shallow features in order to retrieve

the significant information. Some of these features are:
 Thematic feature which is based on the frequency of a word

occurred in the given document. If a sentence has a frequent word
then it is said to be important, hence the sentence is extracted.
 Location feature refers the position of the text which is to be

extracted for the summary. There are 2 types namely,
o Lead Method: Here by the assumption that first sentences

are most important, thereby holds only the extraction of
first sentences.
o Title based method: Extracts the sentences which match

the words present in title and headings.
1.5.4.2 Entity level
This is used to establish the internal representation of text by

constructing the text entities and their relationship.
18
Entity relationship includes:
 Similarity arise when two words hold a common stem or form,

it is evaluated by linguistic techniques.
 Proximity helps to build the relationship between the entities

using the distance between the entities.
 Thesaural relationship describes the synonyms, hyponymy and

meronymy.
 Coreference is used to link the expression.
 Syntactic relations are evaluated by the parse tree.
1.5.4.3 Discourse level
In order to attain the communication goals, the overall structure of

the text and their relationship is constructed here. Information needed for
this level are:
 Format of the documents like hypertext markup, document

outline are considered.
 Threads of topics as they are exposed in the input text were

used.
 Rethorical structure of text is used to represent the

argumentative or narrative structure of the given text. The main aim
is to construct the coherence structure of the text such that the
centrality of text will manifest their importance.
19
1.6 FEATURES FOR SUMMARIZATION
The features will be defined as the characteristics for each sentence,

which will be help full for identifying the sentences according to the
relevance. The aim for the feature extraction is to provide importance to
the each sentence under consideration. Some of the features used in text
summarization are:
 Title words in the sentence: It computes the number of title

words in the sentence.
 Initial sentence in the paragraph: It checks whether it is the
initial sentence in the paragraph.
 Final sentence in the paragraph: It checks whether it is the final
sentence in the paragraph.
 Thematic words in the sentence: It computes the number of
thematic words in the sentence.
 Based on relative position of sentence: it is used find the
position of the sentence such as first sentence, last sentence and
so on.
 Highlight words in sentence: It locates the number of highlight
words in the sentence.
 Positive keywords in sentence: It is the keyword that is
frequently included in the summary.
 Negative keywords in sentence: It is the keywords that unlikely
occur in the summary.
 Sentence centrality: the vocabulary overlaps between this
sentence and other sentences in the document.
 Sentence resemblance to the title: Sentence resemblance to the
title is the vocabulary overlap between this sentence and the
document title.
 Sentence inclusion of name entity: The sentence that contains
numerical data is an important and usually included in the
document summary.
20
 Sentence relative length: This feature is employed to penalize

sentences that are too short, since these sentences are not
expected to belong to the summary.
 Bushy path of the node: It is defined as the number of links
connecting it to other nodes (sentences) on the map.
 Latent semantic feature: The semantic sentence is a sentence
that characterizes relationships between sentences that are based
on semantics.
 Based on Term weight of words forming sentences, number of
times a word appears in the text.
 Based on concept-extraction of sentence, the concept feature
from the text document is extracted using the mutual
information and windowing process. In windowing process, a
virtual window of size ‘k’ is moved over document from left to
right. Here we want to find out the co-occurrence of words in
same window
 Based on POS Tagging for a sentence, The POS tagging stands
for the parts of speech tagging. This word processing algorithm
deals categorizing the words based on their speech category.
The speech category includes noun, verbs, adverb, etc.
1.7 EVALUATION METRICS FOR TEXT SUMMARIZATION
The evaluation parameters help to identify the feasibility of the

generated summary. Evaluation can be categorized as intrinsic and
extrinsic. Intrinsic evaluation is based on the system by itself and they
access the coherence and summary information. Extrinsic evaluation
checks how the system affects the completion of other process and access
the like relevance evaluation, reading comprehension, and so on.
The evaluation parameters used for the proposed method text

summarization are Recall, Precision and F-measure.
21
 Recall: Recall is the ratio of number of retrieved sentence to the

number of relevant sentence. The recall is used to measure (equation
(4.17)) the reliability of the proposed text summarization method.
S Re t  S Re l
Re call 
S Re t (4.17)
Where, Sret and Srel are the number of retrieved and relevant sentences
respectively.
 Precision: The ratio of retrieved sentences to relevant sentences

based on the relevant sentences is given as the precision measure
(equation (4.18)).
S Re t  S Re l
Pr ecision  (4.18)
S Re l
 F-measure: The precision values and the recall values are

considered for finding the F-measure value for the total dataset. Thus the
F-measure can be expressed in equation (4.19)
2  Re call  Pr ecision
F  measure 
Re call  Pr ecision
1.8 GENERAL ARCHITECTURE:
The figure presents the general architecture of the automatic

document summarization method which consists of three phases: (i)
preprocessing (ii) feature vector extraction (iii) summarization process.
INPUT DOCUMENTS
PREPROCESSING
FEATURE VECTOR EXTRACTION
SUMMARIZATION PROCESS
SUMMARY
22
Preprocessing:
In order to achieve the excellent summarization system,

preprocessing should be done efficiently. Here the input documents utilized
for ADS is exposed to a set of preprocessing steps like, sentence
segmentation, tokenization, stop words removal and stemming.
Feature Vector Extraction
The preprocessed input documents are used as input for feature

vector extraction. It is performed for every sentence in each input document.
The features will be defined as the characteristics for each sentence, which
will be help full for identifying the sentences according to the relevance. The
selected feature vectors will be used to construct a sentence matrix. The
matrix contains the features vectors of each sentence. The row of the matrix
represents the sentences present in the document and column represents the
features extracted from the input document.
Summarization Process
Extracted feature vector is the input for the summarization. With

the help of the summarization techniques, resultant summary is be generated
as the output.
Thus above is the general architecture for summarization which

followed in the thesis detailed.
1.9 PROBLEM STATEMENT

23
At present, there is a plethora of data presented by way of internet

and parallel sources. To efficiently tackle the relevant data, there is an
indispensable necessity for a device for extracting appropriate group of
sentences from the specified documents. Summarization of content is a must-
have to attain ground-breaking data while addressing mammoth group of
documents. The onset of World Wide Web with a big bang has effected a sea
of change in our lives that it is impossible for us to spend even a second,
devoid of data. It is humanly impossible to commit to memory the ins and
outs of each piece of information. With the result, summarization of text
documents has begun to play a significant part in information gathering.
Content is mined from a database, and the most noteworthy data is

offered to a client in a miniature form and in a way, which is most customer-
friendly, or as per the requirement of the application. This procedure is
capable of being employed in several like information retrieval, intelligence
gathering, information extraction, text mining, and indexing.
Text summarization, in essence, is the significant task of cutting

down specified text content into an abridged model by preserving its
fundamental content integral and thereby expressing the preferred notion.
Single document summarization is a special task, which focuses only on a
solitary document, whereas, Multi-document summarization represents the
technique of abridging not a single document, but a conglomeration of linked,
documents, into a single synopsis. It is easier said than done. At time, it
becomes extremely difficult to realize the preferred objective. A major chunk
of the parallel methods utilized in single-document summarization is followed
suit in multi-document summarization. In fact, there surface certain glaring
anomalies in as much as the degree of redundancy prevalent in a set of
topically-associated articles is significantly higher than the redundancy degree
within an article, because each and every article properly represent the most
24
pertinent point in addition to the requisite collective backdrop. No wonder,

anti-redundancy techniques emerge in the arena to cast their critical role. The
compression ratio, which is the ratio of the summary dimension to the
dimension of the whole document set, tends to be appreciably lower for a
gigantic compilation of topically linked documents than for single document
summaries. When compression demands are on the surge, summarization
looms as an alarmingly large challenge. The co-reference issue in
summarization poses further serious issues for multi-document vis-a-vis
single-document summarization. The quality of synopsis is susceptible for
those traits as regards the way in which the sentences are graded in
accordance with the used traits. Accordingly, the evaluation of the
effectiveness of each trait will go a long way in helping the system to
differentiate the attributes into superior and inferior priority.
A novel technique is brought to light that effectively tackles

extracted, query based multi document summarization. The furnished
standard is that the input document has to consist of a general topic. The
summary must be created by the proposed approach in harmony with the
distinct constraints like compression rate and user query. Then, a technique is
given regarding a multiple document summarization methodology backed by
deep learning algorithm, which is used to train the Restricted Boltzmann
Machine. Deep learning algorithm is used to retrieve the important concepts
layer by layer effectively. Fuzzy system is used to afford the balanced weight
for the extracted features using inference rules. Genetic and particle swarm
algorithm is integrated to optimize the rules produced by the fuzzy system. A
sequence of measures is integrated for realizing the accuracy in summary
creation by the innovative system, which engenders a summary from
numerous documents with diverse number of sentences. The proposed system
mainly concentrates on the (i) centrality, is used to retrieve the sentences from
25
wide scope of the input document set and (ii) diversity, avails priority to
incorporate the dissimilar sentence from the input document set.
1.10 ORGANIZATION OF THE THESIS
Chapter Two investigates the literature survey on prominent methods related

to automatic multi document summarization.
Chapter Three describes the automatic multi-document summarization

through deep learning algorithm using Restricted Boltzmann Machine.
Chapter Four focus the deep learning algorithm integrated with fuzzy for
multi-document text summarization
Chapter Five discusses the deep learning algorithm integrated with fuzzy and
genetic particle swam optimization for multi-document text summarization
Chapter Six concludes the finding of this research work and provides future
direction of research.
1.11 SUMMARY

Final Chapter 1

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Final Chapter 1

Uploaded by

Copyright:

Available Formats

1

From times immemorial, humans physically carry out

The onset of on-line publishing Interest in text mining started with

and means for improvement on web mining, text mining, information

With the exponential growth of the Internet and electronic

Locating the utmost relevant data concealed in textual Web

appeared as a vital tool in the hands of researches to overwhelm the menace in

Automatic Document Summarization(ADS) method is one such

1.1 DATA MINING

Data mining is a multidisciplinary research field which plays a vital

Data cleansing task concerns with missing and redundant data in

Data mining has various application areas, including banking,

1.2 TEXT MINING

The phrase “text mining” is commonly used to represent any

Text mining process combines the following sequential tasks.

Text preprocessing is the initial step used to remove the unwanted

Applications that are developed and can be employed in the text

 Text summarization produce compressed text for the

 Categorization identifies the important themes of the

 Clustering groups the related documents but not based

 Concept linkage tools links the associated documents

 Topic tracking forecast the documents of interest to the

 Information visualization are used to represent a large

 Question answering deals how to locate the best answer

1.3 NATURAL LANGUAGE PROCESSING

Natural Language Processing (NLP) is an upcoming area of

Researchers generally focus on techniques which are developed in

1.3.1 NLP Applications

 Summarization: Produce a readable summary from a large

 Information Retrieval: Informational Retrieval systems

 Information Extraction: It deals with indentifying actual

 Question Answering: With respect to user query it avails a

 Machine Translation: It is used to translate the one form of

 Dialogue System: It is used to converse with a human in a

1.3.2 NLP for Automatic Document Summarization

between the given sentences, generates words to produce summaries or to

1.4 AUTOMATIC DOCUMENT SUMMARIZATION

A summary is defined as the significant text that is extracted from

The aim the summary is to preserve the significant information,

Document summarization is a superb means to abridge enormous

Automatic document summarization is an indispensable method to

technique is competent to be employed in manifold domains like data

The technique of automatic document summarization gathers a

Hence, it is now increasingly felt that clients require accessibility to

With this end in view, automatic document summarization takes a

1.5 TYPES OF SUMMARIZATION

Summarization can be performed based on number of source

1.5.1 Summarization Based on Number of Source Documents

In accordance with the number of documents to be abridged, the

1.5.1.1 Single Document Summarization

Single Document Summarization (SDS) is capable of abridging

1.5.1.2 Multi-Document summarization

Multi-Document summarization (MDS) has the prowess of

It furnishes a domain outline of a topic explaining the resemblance

segments of data in a variety of documents, and gives carte-blanche to the

MDS is the task of producing a summary by dipping documents in

A good MDS system should hold the following properties namely,

 Compressed Summary: the generated summary length

 Preserving Information: While summarizing the document,

 Proper Syntax: Generated summary should have the proper

 Portable: The MDS system will be trained by various sets of

1.5.2 Summarization Based on Construction Methods

The document summarization can be broadly categorized into two

1.5.2.1 Abstractive summarization: