You are on page 1of 11

1 Introduction

A summary can be defined as a text that is produced from one or more texts, that contain a
significant portion of the information in the original text(s), and that is no longer than half of the
original text(s). Text summarization is the process of distilling the most important information
from a smyce (or smyces) to produce an abridged version for a particular user (or user) and task
(or tasks). When this is done by means of a computer, i.e. automatically, I call this Automatic
Text Summarization (ATS).

Automatic text summarization is the technique, where a computer summarizes a text. A text is
entered into the computer and a summarized text is returned, which is a non redundant extract
from the original text. The technique has its roots in the 60's and has been developed during 30
years, but today with the Internet and the WWW the technique has become more important.

Today basically, there are two common approaches to achieve the text summarization objective.
The first approach tries to analyze the text, and to rewrite or rephrase it in a short way. The
second approach, tries to extract the key sentences from the text by using text ranking algorithm,
and then tries to put them together properly. But the first approach didnt achieve any substantial
results until today.

Automatic text summarization can be used for:

Summarizing newspaper text (for jmynalist, business intelligence, technology


intelligence, etc)

Summarizing reports (for parliament members, investigators, businessmen, etc)

Search engines to extract keyword and to obtain summaries of the found text

Search in foreign languages and obtain an automatic summary of the machine translated
text

Extracting keyword and summaries of email for SMS in mobile phones

1
Summarizing text which has been downloaded from the Internet from a WAP mobile
phone

Letting a computer read summarized www pages on a mobile phone.

In Ethiopia various attempts have been done to develop automatic text summarization (ATS)
systems for Amharic language. The attempts are done to summarized texts automatically by the
computer by using different approaches and algorithms to display a summarized text in Amharic
language; hence this project would help the users to achieve their text summarization objective
simply without any difficulty.

Ethiopia is a linguistically diverse country where more than 80 languages are used in day-to-day
communication. Although many languages are spoken in Ethiopia, Amharic is dominant in that it
is spoken as a mother tongue by a substantial segment of the population and it is the most
commonly learned second language throughout the country.

The language is the official language of the federal government of the country. According to the
1998 census of the country (ECSA, 1998), Amharic is the first language of more than 17 million
people and second language for more than 5 million people.

2 Statement of problem

Information overload is a problem in this information era due to the mass production of
information in many formats which is enhanced by the internet technology. Amharic text
documents are part of this mass production.

Amharic being the official spoken and written language of Ethiopia is used to produce text
documents for readers. These text documents are available digitally and the amount is highly
increasing every day. Even Google now supports searching for documents with Amharic fonts
using queries in Amharic font. Interested users are now spending time in searching for Amharic.
There are few researches conducted locally in the area of automatic text summarization applying
different methodologies. To find the best one that is compatible, effective and efficient for

2
Amharic text more research has to be done documents online, which provides lots of text
documents creating user information overload.

This project is part of the effort that should be done to fill the gap in the area. It aims to
Automatic text summarization for Amharic language by using python programming language.

3 Objective of the project

The main aim of this project was to develop an Automatic Text Summarization (ATS) for
Amharic language. The summarization process is included only text document but other multi-
media data types are not included in the project.

4 Literature review

The first research Amharic News Text Summarization was done by Kamil Nuri, in 2004. The
system was developed by using natural language processing techniques and statistical methods.
Title words, head sentences, he ad sentence words, paragraph starting sentence, cue phrases and
high frequency key words are used as extraction features. The system evaluation of the research
showed 74.4% precision and 58% recall (Kamil, 2004). He used nine news articles that were in
printed format, since by the time the research was conducted there were no web based Amharic
news service providers. However, since he used Perl programming language which does not
support Amharic alphabets, he transliterated the nine news articles. That means he did not use
Amharic alphabets directly to the system which makes the system not to be directly implemented
to the real world Amharic text summarizing application.

The second work, The Application of Machine Learning Technique for Automatic Text
Summarization. The Case of Amharic News Text, is done by Teferi Andargie in 2005. He used
predefined features like location of a sentence in a document, title words occurring in the
sentence, and cue words occurring in the sentence that are found to be a good indicator in giving
an optimum summary. He used a corpus of 480 news articles in the experiment which was used
3
by Saba (2001), Theodros (2003) and Samuel (2004) for testing different retrieval models. A
manual summary at 30% extraction rate was prepared for the 480 articles. The nave Bayes
algorithm is used to classify sentences as a summary or not a summary based on the feature
vectors (Teferi, 2005). A prototype is developed which extracts sentences to a desired
compression level. The result of the experiment shows that the location features gives the best
result in the classification of sentences when using individual features.

The other study, Automatic Text Summarization for Amharic Legal Judgments is a work done
by Helen Adane in 2006. The study focuses on producing prototype system of text
summarization on Amharic legal judgments using Python programming tool (Helen, 2006). The
methodology employed is an extraction technique on Amharic legal judgments rendered by the
supreme court of Ethiopia. To evaluate the performance of the system a random automatic
summary is generated using the same extraction rates (10% and 20%) by the system for the
selected legal judgments. Using extrinsic evaluation technique, the performance of the system
summary and the random summary were compared with an ideal (manual) summary which is
manually prepared summary by legal experts. The result showed that the sentences extracted by
the system summary using different extraction features are much closer to the manual summary
(Helen, 2006). The domain of the research, legal judgment, writing style is inverted pyramid or
downward triangular which states the most important part of the paragraph on the last sentence
of the paragraph.

5 Implementation

The project mainly used python 2.5.4 programming tool to develop and implement my system. I
prefer python because it has a great deal of features to write my codes to implement the above
tasks. Regarding to the program it takes the content text or smyce text and that perform the above
tasks and the system automatically generate the summary.

In text summarization system there are many tasks which are performed on the text. The text
summarizer that I develop in my project tries to extract the key sentences from the text, and then

4
tries to put them together properly. Some of the tasks that are performed in my project to develop
the Automatic Text Summarization system are:

Segmentation of the text into paragraphs ,sentences and words/tokens


Calculating the intersection of two sentences
Building sentences dictionary
Generating the summary

1. Segmentation of the text into paragraphs, sentences and words

Segmenting the text into paragraphs, sentences and words is dividing of the text into a discrete
paragraphs, sentences and words/tokens. In splitting the text into paragraphs, sentences and
words my project first reads or find the content of the text and then split it into its respective
paragraphs, sentences and words. The followings are some of the sample code that used to split
the text into paragraphs, sentences and words/tokens.

# segmenting a text into sentences

def split_content_to_sentences(self, content):

content = content.replace("\n","::")

return content.split(":: ")

# segmenting a text into paragraphs

def split_content_to_paragraphs(self, content):

return content.split("\n\n")

# segmenting the sentence into words/tokens

s1 = set(sent1.split(" "))

s2 = set(sent2.split(" "))

2. Calculating the intersection of two sentences

5
This is used to find the common sentences that used to rank the sentences as a key for the
summarization. In this task there is a function that takes two sentences as an argument and
returns a score for the intersection betIen the sentences if the sentences have intersection
otherwise return zero. In the first task I split each sentence into words/tokens, and then count
how many common tokens that the sentences have, and then normalize the result with the
average length of the two sentences to find the score for the intersection betIen the sentences and
also in this step formatted sentences are identified and rank all the sentences according to their
score. Formatted sentences are sentence that are found after the removal of non-alphabetic
characters from the sentences. The followings are some of the sample code that used to calculate
the intersection of the sentences.

#Caculate the intersection betIen 2 sentences

def sentences_intersection(self, sent1, sent2):

# split the sentence into words/tokens

s1 = set(sent1.split(" "))

s2 = set(sent2.split(" "))

# If there is not intersection, just return 0

if (len(s1) + len(s2)) == 0:

return 0

# I normalize the result by the average number of words

return len(s1.intersection(s2)) / ((len(s1) + len(s2)) / 2)

# Format a sentence - remove all non-alphbetic chars from the sentence

# I'll use the formatted sentence as a key in my sentences dictionary

def format_sentence(self, sentence):

sentence = re.sub(r'\W+', '', sentence)

6
return sentence

# ranking the sentences

def get_senteces_ranks(self, content):

# Split the content into sentences

sentences = self.split_content_to_sentences(content)

# Calculate the intersection of every two sentences

n = len(sentences)

values = [[0 for x in xrange(n)] for x in xrange(n)]

for i in range(0, n):

for j in range(0, n):

values[i][j] = self.sentences_intersection(sentences[i], sentences[j])

3. Building sentences dictionary

This is used to convert the content/ the smyces text into a dictionary that contains a key and a
rank. It receives my text as input, and calculates a score for each sentence. Basically the
calculation performed in two times.

In the first time the text splits into sentences, and store the intersection value betIen each two
sentences as a matrix. In the second time I calculate an individual score for each sentence and
store it in a key-value dictionary, where the sentence itself is the key and the value is the total
score. I do that just by summing up all its intersections with the other sentences in the text
excluding itself and then the best sentences are selected/got according to the sentences
dictionary.

The followings are some of the sample code that used to build the sentences dictionary.
# Build the sentences dictionary

# The score of a sentences is the sum of all its intersection

7
sentences_dic = {}

for i in range(0, n):

score = 0

for j in range(0, n):

if i == j:

continue

score += values[i][j]

sentences_dic[self.format_sentence(sentences[i])] = score

return sentences_dic

# Return the best sentence in a paragraph

def get_best_sentence(self, paragraph, sentences_dic):

# Split the paragraph into sentences

sentences = self.split_content_to_sentences(paragraph)

# Ignore short paragraphs

if len(sentences) < 2:

return ""

# Get the best sentence according to the sentences dictionary

best_sentence = ""

max_value = 0

for s in sentences:

strip_s = self.format_sentence(s)

8
if strip_s:

if sentences_dic[strip_s] > max_value:

max_value = sentences_dic[strip_s]

best_sentence = s

return best_sentence

4. Generating the summary

The last task which is performed by my project is generating a final summery. The final
summery is generated automatically since I spited the text into paragraphs, sentences and words,
then the intersection was calculated and then the sentences dictionary was build that contain the
best sentences according to their intersection score and finally the summery is generated by
selecting the best sentence from each paragraph according to the sentences dictionary. In
addition the summary it also generated the ratio betIen the summary length and the original text
length.

The followings are some of the sample code that used to generate the summary.

# to generate the summary

def get_summary(self, title, content, sentences_dic):

# Split the content into paragraphs

paragraphs = self.split_content_to_paragraphs(content)

# Add the title

summary = []

summary.append(title.strip())

summary.append("")

# Add the best sentence from each paragraph

9
for p in paragraphs:

sentence = self.get_best_sentence(p, sentences_dic).strip()

if sentence:

summary.append(sentence)

return ("\n").join(summary)

6 Challenges of the project

One of the biggest challenges that Ive encountered in the process of doing this project was lack
of reference related to text summarization in Amharic language which used as a further reference
and lack of experience to use the required tools used to develop the text summarization system.
One of the other challenges that made this task hard to deal with was shortage of time that is
given to accomplish it.

7 Limitations of the project

The followings are the limitations of the project which are not included in my project because of
the above challenges that are encountered during the project.

I didnt use any software package such as NTLK tool to do the segmentation
automatically.
My project doesnt perform stemming and stop word removal.
I didnt use any evaluation method other than calculating the ratio of the original and
summarized texts.

8 Future Work

The project accomplished the basic goals in which it is proposed to perform, but because of the
above challenges I would like to recommend a couple of feature which didnt developed by this
project and recommended for others to do on the following lists of recommendations :

10
It is important to develop the system by using software packages such as by using
Ntlk tool.
It is better to perform on abstraction summarization other than extraction
summarization.
It is better to perform stemming and stop word removal to increase the performance
of the system.
There is no GUI interface for the system, I recommend if someone interested can
develop it.

11