Professional Documents
Culture Documents
Abstract
Automatic text summarization is a technique which compresses large text into a shorter text which includes the important
information. Hindi is the top-most language used in India and also in a few neighboring countries there is a lack of proper
summarization system for Hindi text. Hence,in this paper, we present an approach to the design an automatic text summarizer for
Hindi text that generates a summary by extracting sentences. It deals with a single document summarization based on machine
learning approach. Each sentence in the document is represented by a set of various features namely- sentence paragraph
position, sentence overall position, numeric data, presence of inverted commas, sentence length and keywords in sentences. The
sentences are classified into one of four classes namely- most important, important, less important and not important. The classes
are in turn having ranks from 4 to 1 respectively with 4indicating most important sentence and 1 being least relevant
sentence . Next a supervised machine learning tool SVMrank is used to train the summarizer to extract important sentences, based
on the feature vector. The sentences are ordered according to the ranking of classes. Then based on the required compression
ratio, sentences are included in the final summary. The experiment was performed on news articles of different category such as
bollywood, politics and sports. The performance of the technique is compared with the human generated summaries. The average
result of experiments indicates 72% accuracy at 50% compression ratio and 60% accuracy at 25% compression ratio.
Keywords: Hindi Text Summarization; Supervised Machine Learning; SVM; Text Mining; Sentence Extraction;
Summary Generation
--------------------------------------------------------------------***---------------------------------------------------------------------1. INTRODUCTION
The rapid growth of the Internet has resulted in enormous
amount of information in natural language that has become
increasingly more difficult to access efficiently. In the fast
moving world its difficult to read all the text-content. The
technique is being developed through research since last 50
years and with the increased use of the Internet and
electronic documents the need for fast and reliable
summarization has undergone a rapid growth. Automatic
text summarization is a technique which compresses large
text to a shorter text which includes the important
information, subjects and various elements from the original
text. The computer is given a text and returns a summary of
the original text. Text summarization can be useful in day to
day life. For instance, review of a movie, headlines of news,
abstract summary of technical paper and review of book.
Also, it is more and more being used in the commercial
sector such as, data mining of text databases, web-based
information retrieval, telephone communication industry, in
word processing tools, etc. Automatic text summarization is
one the important steps for information managing tasks. The
problem of selecting the most important parts of the text is
solved by it.
The output of summary can be of two types: Extractive
summaries and abstractive summaries. Extractive
summaries are produced by extracting the whole sentences
from the source text. The importance of sentences is
determined based on statistical and linguistic features of
_______________________________________________________________________________________________
Volume: 05 Issue: 06 | Jun-2016, Available @ http://ijret.esatjournals.org
361
3. PROPOSED METHOD
The goal of automatic text summarization is to select the
most important sentences of the Hindi text document. The
proposed method uses various statistical features to find
most important sentences. Also it makes use of SVMrank a
machine learning tool to rank the sentences. The general
outline of the methodology used for this task is described
below.
Input: A text file (original -Og).
Output: A summarized text(S) of original (Og), as per
compression ratio.
1. Read Input text File -Og
2. Pre-process the file Og . //Preprocessing step
2.1 Segment text file into sentences.
2.2 Tokenize each sentence into words.
2.3 Remove stop-words.
3. //Processing step
3.1 Extract following features from Og file Sentence Paragraph Position (f1),
_______________________________________________________________________________________________
Volume: 05 Issue: 06 | Jun-2016, Available @ http://ijret.esatjournals.org
362
3.2 Processing
3.1 Pre-Processing
In pre processing stage of this proposed technique, the text
is first split into sentences, then sentences are further broken
into words and then stop words are removed. Preprocessing
stage involves 3 steps 1) Segmentation 2) Tokenization 3)
Stop words removal.
3.1.1 Segmentation
3.1.2 Tokenization
In this tokenization part, sentences are broken down into
words. In Hindi, sentences are tokenized by identifying the
space and comma between the words. So the list is created
in which it contains elements as words or also called tokens
which are maintained for further processing.
Eqn.1
In Eqn 1,Si is the ithsentence in a paragraph, n= Total no of
sentences in a paragraph and i ranges from 0 to n.
_______________________________________________________________________________________________
Volume: 05 Issue: 06 | Jun-2016, Available @ http://ijret.esatjournals.org
363
Eqn.2
Eqn.6
Eqn.3
4. EXPERIMENTS
Eqn.4
Eqn.5
3.3 Extraction
4.2 Evaluation
It is required to check accuracy, relevance and usefulness of
the summary created by an automatic summarizer. For
evaluating results of summarization, accuracy has been used
as evaluation parameters. The generated summaries are
evaluated against human generated summaries. The experts
were asked to create summaries of our collected dataset.
Accuracy is the percentage from which the sentences are
correctly classified for the inclusion in the summary [14].
Accuracy ()=
_______________________________________________________________________________________________
Volume: 05 Issue: 06 | Jun-2016, Available @ http://ijret.esatjournals.org
364
Accuracy
50%
57%
75%
72%
75%
63%
75%
83%
67%
40%
66%
Sentences in
Summary
11
13
10
20
17
10
17
13
13
13
Correctly Classified
Sentence
8
10
7
15
13
7
12
10
10
10
Accuracy Sentences in
Summary
73%
5
77%
7
70%
5
75%
10
77%
9
70%
5
71%
9
77%
7
77%
6
77%
7
74%
Correctly Classified
Sentence
4
4
4
6
5
3
7
5
3
3
Accuracy
80%
57%
80%
60%
56%
60%
78%
72%
50%
43%
64%
_______________________________________________________________________________________________
Volume: 05 Issue: 06 | Jun-2016, Available @ http://ijret.esatjournals.org
365
1
22
2
24
3
19
4
24
5
26
6
22
7
25
8
17
9
22
10
15
Average
Sr
No.
11
12
10
12
13
11
13
8
11
8
7
9
7
9
11
8
9
4
8
6
Bollywood
71
66
Politics
74
63
sports
71
50
4. CONCLUSION
This paper discusses single document automatic text
summarization for Hindi text using machine learning
technique. As expected it has become clear from the
experimental results, the performance in each of the
subtasks directly affects the ability to generate high quality
summaries. Also it is noteworthy that the summarization is
more difficult if we need more compression.
In future, more features like named entity recognition, cue
words, context information, world knowledge etc, can be
added to improvise the technique. Also, same technique can
be applied on domains other than news and later we can
study the effects of various domain characteristics on the
suggested features and overall performance of the technique.
It can also be extended to work on multiple documents. Also
it would be interesting to find other suitable machine
learning classifiers other than SVM for the task.
ACKNOWLEDGMENT
We would like to thank Mrs. Usha Parikhand Mr. Aashish
Shah, expert Hindi language teachers, for donating their
valuable time for helping in evaluation of summaries.
64%
75%
70%
75%
85%
73%
69%
50%
73%
75%
71%
6
6
5
6
7
6
6
4
6
4
2
2
2
3
5
4
3
2
2
3
33%
33%
40%
50%
72%
67%
50%
50%
33%
75%
50%
REFERENCES
[1]. Hans Peter Luhn. The automatic creation of literature
abstracts, IBM Journal of research and development,
2(2):159165, 1958.
[2]. P. B. Baxendale, Machine-made Index for Technical
Literature -An Experiment, Journal of Research and
IBM Development, vol. 2, no. 4, pp. 354-361, October
1958.
[3]. Harold P Edmundson. New methods in automatic
extracting, Journal of the ACM (JACM), 16(2):264
285, 1969.
[4]. Julian Kupiec, Jan Pedersen and Francine Chen, A
trainable document summarizer, Proceedings of the
18th annual international ACM SIGIR conference on
Research and development in information retrieval,
ACM, pp. 68-73, 1995.
[5]. Gupta Vishal and Gurpreet Singh Lehal. A survey of
text summarization extractive techniques. Journal of
Emerging Technologies in Web Intelligence, vol. 3,
pp.258-268,2010.
[6]. Chin-Yew Lin, Training a selection function for
extraction, Proceedings of the eighth international
conference
on
Information
and
knowledge
management, ACM, pp. 55-62, 1999.
[7]. M.J Conroy and O'leary, Text summarization via
hidden markov models, In Proceedings of SIGIR '01,
pp. 406-407, 2001.
[8]. Md Haque, Suraiya Pervin and others, Literature
Review of Automatic Single Document Text
Summarization Using NLP, International Journal of
Innovation and Applied Studies, vol.3, no.3, pp. 857865, 2013.
[9]. Ladda Suanmali, Mohammed Salem Binwahlan and
Naomie Salim, Sentence features fusion for text
summarization using fuzzy logic, Ninth International
Conference on Hybrid Intelligent Systems, IEEE, vol.1,
pp.142-146, 2009.
_______________________________________________________________________________________________
Volume: 05 Issue: 06 | Jun-2016, Available @ http://ijret.esatjournals.org
366
BIOGRAPHIES
Nikita Desai is Associate professor in IT
dept since July 2008. She has total 12+
years of teaching experience. Her major
expertise is in Compiler construction,
Natural language processing and algorithm
analysis.
Ms Prachi Shah was a post graduate
student at Masters program in Information
Technology department of Dharmsinh
Desai university. She received her MTech
degree in May, 2016 .
_______________________________________________________________________________________________
Volume: 05 Issue: 06 | Jun-2016, Available @ http://ijret.esatjournals.org
367