You are on page 1of 125

School of Computing

Jimma Institute of Technology


Jimma University

Amharic Language Query Processing in Database


Using Natural Language Interface

By
Smegnew Asemie

August, 2008
Jimma, Ethiopia

School of Computing
Jimma Institute of Technology
Jimma University
Amharic Language Query Processing in Database
Using Natural Language Interface

Smegnew Asemie

A Thesis Submitted to the School of Computing of Jimma


University in Partial Fulfillment for the Degree of Master of Science
in Information Technology

August, 2008
Jimma, Ethiopia

JIMMA UNIVERSITY
SCHOOL OF COMPUTING
Amharic Language Query Processing in Database
Using Natural Language Interface By
Smegnew Asemie

Name and signature of members of the examining board


Name

Title

Signature

Date

1. ___________________________________ Chairperson _______________ __________


2. Getachew Mamo________________ Advisor

__________________ __________

3. ___________________________________ Examiner _______________ __________

DEDICATION
This thesis is dedicated to my sister Mulusew Asemie and her hasband Abebe Alemu. They are
always a great role model for me because of their commitment and who always encourages me to
make the best effort to realize my dreams.

ACKNOWLEDGEMNT
Before all, I praise the almighty God and his mother St. Mary for making everything the way it
is. My greatest gratitude is extended to my advisor Mr. Getachew Mamo (assistant professor)
and my Co -advisor Mr. Tefery Kebebew for the positive encouragement before the work was
started and constructive comments and guidance after the work has been started. They sacrifice
their time for ongoing discussions during my difficult time of selecting a title for the study.
It is my pleasure also to express my thanksto Dr. Millita Luke and Debela tesfaye for giving a
constructive idea on the area at the time of title selection and after.
I would also thank my special friend Abeba Hailemaryam for all what she did for the past two
years. She is treating me as what a best sister treated her brothers and she teach me how can make
a friend.
My deepest thanks go to all my friends and my class meets especially Workineh Tesema, Zerihun
Olana, Andualem chekol, and Birhanu Ambes, our relationship is more than a friend. Beside this I
would like to acknowledge all staff members of school of computing for their help to complete
this research on time.
Special thanks also goes to my families mainly for my mother Dejitnu Abate, my sisters Mulusew
Asemie, Tirusew Asemie, Birtukan Asemie and my brother Gedefaw Mola, who have been
behind me in supporting and encouraging me through difficult times.

ii

Table of Contents
ACKNOWLEDGEMNT .................................................................................................................... ii
LIST OF TABLES ......................................................................................................................... vi
LIST OF FIGURES ...................................................................................................................... vii
LIST OF ACRONYMS ................................................................................................................... viii
ABSTRACT ...................................................................................................................................... x
CHAPTER ONE ............................................................................................................................. 1
INTRODUCTION .......................................................................................................................... 1
1.1.

Background .......................................................................................................................... 1

1.2.

Statement of the Problem ..................................................................................................... 3

1.3.

Objective of the Study ......................................................................................................... 4

1.3.1.

General objective ...................................................................................................................... 4

1.3.2.

Specific Objective ...................................................................................................................... 4

1.4.

Methodology ........................................................................................................................ 5

1.4.1.

Literature Review ...................................................................................................................... 5

1.4.2.

Data Collection .......................................................................................................................... 5

1.4.3.

Research Method ...................................................................................................................... 5

1.4.4.

Tools and Techniques................................................................................................................ 5

1.4.5.

Evaluation.................................................................................................................................. 6

1.5.

Scope of the Study ............................................................................................................... 6

1.6.

Significance of the Study ..................................................................................................... 7

1.7.

Organization of the study ..................................................................................................... 8

CHAPTER TWO ............................................................................................................................ 9


LITERATURE REVIEW ............................................................................................................... 9
2.1.

Survey on Natural Language Processing (NLP) Applications ............................................ 9

2.2.

Information Retrieval (IR) System ...................................................................................... 9

2.3.

Question Answering (QA) System .................................................................................... 10

2.3.1.

Open & Closed Domain QA System ........................................................................................ 10

2.3.2.

Statistical or Semantic QA System .......................................................................................... 12

2.3.3.

Dialogue System ...................................................................................................................... 13

2.3.4.

Database System ..................................................................................................................... 14

iii

2.4.

Natural Language Interface to Database (NLIDB) ............................................................ 15

2.4.1.

Components of NLIDB ............................................................................................................. 16

2.4.2.

Techniques Used for Developing NLIDB ................................................................................. 16

2.4.2.1.

Pattern-Matching Systems .............................................................................................. 16

2.4.2.2.

Syntax-Based Systems ..................................................................................................... 17

2.4.2.3.

Semantic Grammar Systems ........................................................................................... 18

2.4.2.4.

Intermediate Representation Languages........................................................................ 19

2.4.3.

Advantage and Disadvantage of NLIDB ................................................................................ 20

2.4.4.

Most commonly used Architecture of NLIDB ......................................................................... 22

2.5.

Related works..................................................................................................................... 25

CHAPTER THREE ...................................................................................................................... 30


THE AMHARIC WRITING SYSTEM ........................................................................................ 30
3.1.

Introduction ........................................................................................................................ 30

3.2.

Amharic Alphabet ....................................................................................................................... 31

3.3.

Amharic Punctuation .................................................................................................................. 32

3.4.

Amharic numbers ........................................................................................................................ 32

3.5.

Syntactic Structure of Amharic ................................................................................................... 32

3.6.

Problems in Retrieving Amharic Text .......................................................................................... 33

3.6.1.

Redundancy of Some Characters ........................................................................................ 33

3.6.2.

Formation of Compound Words ......................................................................................... 33

3.6.3.

Existence of Irregular Spelling ............................................................................................. 34

3.7.

Amharic Software ....................................................................................................................... 34

CHAPTER FOUR ......................................................................................................................... 36


METHODS AND ALGORITHMS .............................................................................................. 36
4.1.

Architecture of the System................................................................................................. 36

4.2.

Natural Language Interface for Text Retrieval from DB................................................... 37

4.3.

Query pre-processing ......................................................................................................... 37

4.4.

Semantic analysis ............................................................................................................... 42

4.5.

Mapping of the User Query ............................................................................................... 42

4.6.

SQL Query Generation ...................................................................................................... 45

4.6.1.

Algorithm to Handle Rules ...................................................................................................... 53

iv

4.7.

SQL Query Execution ........................................................................................................ 57

4.8.

Result ................................................................................................................................. 58

CHAPTER FIVE .......................................................................................................................... 59


IMPLEMENTATION, RESULTS and DISCUSSIONS .............................................................. 59
5.1.

Introduction ........................................................................................................................ 59

5.2.

The User interface startup and operation ........................................................................ 59

5.3.

Experiment on Query for Selection of the Whole Table ............................................................. 60

5.4.

Queries for Selection of Certain Column .................................................................................... 62

5.5.

Queries with a Single Condition .................................................................................................. 63

5.6.

Queries with Multiple condition ................................................................................................. 64

5.7.

Join Queries................................................................................................................................. 66

5.8.

Aggregate Function ..................................................................................................................... 67

5.9.

Groping and Ordering Quires ............................................................................................ 68

5.10.

Evaluation of the System................................................................................................ 70

5.10.1.

Analysis of Results ................................................................................................................... 70

5.10.1.1.

Databases used for Evaluation ........................................................................................ 70

5.10.1.2.

Analysis Based on Question Category ............................................................................. 71

5.10.1.3.

OVERALL MEASUREMENT ............................................................................................... 95

CHAPTER SIX ............................................................................................................................. 99


Conclusions, Recommendation and Future Works ....................................................................... 99
6.1.

Conclusions ........................................................................................................................ 99

6.2.

Recommendation and Future Research Works ................................................................ 101

REFERENCE .............................................................................................................................. 103

LIST OF TABLES
Table 2. 1: Sample database table ................................................................................................. 17
Table 3. 1: The Amharic Character Representation...................................................................... 31
Table 4. 1: Sample stemming table ............................................................................................... 40
Table 4. 2: A Sample for compound words .................................................................................. 42
Table 4. 3: Table_ Handling Table ............................................................................................... 43
Table 4. 4: Column_ Handling Table ........................................................................................... 43
Table 4. 5: Conditional_ Word Table ........................................................................................... 44
Table 5. 1: Employee table structure ............................................................................................ 71
Table 5. 2: Department table structure .......................................................................................... 71
Table 5. 3: Employee on education structure................................................................................ 71
Table 5. 4: List query request and results ..................................................................................... 76
Table 5. 5: Accuracy of List Query .............................................................................................. 76
Table 5. 6: Single conditional query and results ........................................................................... 81
Table 5. 7: Accuracy of single conditional query ......................................................................... 81
Table 5. 8: Multiple conditional query and results ....................................................................... 88
Table 5. 9: accuracy of multiple condition query ......................................................................... 88
Table 5. 10: Aggregate function query and results ....................................................................... 94
Table 5. 11: Accuracy of aggregate function ................................................................................ 95

vi

LIST OF FIGURES
Figure 2. 1: Vertical Taxonomy of Information Retrieval model. .............................................. 10
Figure 2.2: intermediate representation language architecture ..................................................... 19
Figure 2. 3: commonly used architecture of NLIDBs................................................................... 23
Figure 4. 1: Architecture of Amharic Language Interface for database...................................................... 36

vii

LIST OF ACRONYMS
AI: Artificial Intelligence
ASCII: American Standard Code for Information Interchange
ATN: Augmented Transition Network
CLIR: Cross Language Information Retrieval
DB: DataBase
DBMS: Data Base Management System
EAGLi: Engine for Answers in Genomics Literature
ECoSA: Ethiopian Computer Standards Association
Elfsoft: English Language Frontend Software
eLSSNL: eLibrary Searching System by Natural Language
FAM: File Access Manager
GINLIDB: Generic Interactive Natural Language Interface to Database
HCI: Human Computer Interaction
HMM: Hidden Markov Model
IDA: Intelligent Data Access
IE: Information Extraction
INLAND: Informal Natural Language Access to Navy Data
IQR: Intermediate Query Representation
IR: Information Retrieval
MASQUE: Modular Answering System for Queries in English

viii

NLI: Natural Language Interface


NLIDB: Natural Language Interface to Database
NLP: Natural Language Processing
PDA: Personal Digital Assistant
PLANES: Programmed Language-based Enquiry System
QA: Question Answering
SQL: Structured Query Language
START: SynTactic Analysis using Reversible Transformations

ix

ABSTRACT
In the present computing world, computer based information technologies have been extensively
used to help many organizations, private companies, academic and education institutions to
manage their processes and information systems. Information systems are used to manage data.
A general information management system that is capable of managing several kinds of data,
stored in the database is known as Database Management System (DBMS). Database
Management System is a collection of interrelated data and set of programs to access those data.
Database systems are designed to manage large bodies of information.
To access the database without the knowledge of SQL, since 1976, different research scholars
doing a research on the area of natural language interface to database (NLIDB). As the name
suggests an NLIDB allows an ordinary user to ask query to database in natural language. In this
paper, we propose Amharic Natural Language Interface to Database (ANLIDB). Here, the
request is simple like asking a human to do so in a local language (Amharic). In this paper, we
are dealing with Amharic language alphabet, punctuation, syntactic structure and problems in
retrieving Amharic text.
In this paper, we have designed and developed an interface in the local language so that user can
easily use that system without the knowledge of English language and SQL. So, in order to
address this issue we have developed an algorithm to efficiently map Amharic language into
Structured Query Language (SQL). We are divided the algorithm into three parts an algorithm to
handle select query, an algorithm to handle conditional query and an algorithm to handle
aggregation.
The algorithm has been implemented in Java and tasted on Human Resource Management
(HRM) database containing Employee, Department and Employee on education table. The
prototype can handle list query, condition query, and aggregate function. The accuracy of the
system is measured in term of precision percentage with two classes that identifies query
response as: Correct and Incorrect. The prototype system achieves a good performance and the
overall efficiency of system is observed to be 91%.

Keywords: NLIDB, Amharic Language Interface to Database, Natural Language Processing


(NLP)

xi

CHAPTER ONE
INTRODUCTION
1.1. Background
Since a long time ago, information has been playing an important role in our lives; most people
try to get the information they need before making a decision. Recently, with the growth of
technologies such as computers and laptops, personal digital assistant (PDAs), cellular phone,
and the internet, information can be accessed almost anywhere, at any time, by anybody,
including those who do not necessarily have computer backgrounds. One of the major sources of
information is database. Database contains a collection of related data, stored in a systematic way
to model the part of the world. In order to extract information from a database, one needs to
formulate a query in such a way that the computer will understand and produce the desired
output. However, not everybody is able to write such queries, especially those who lack a
computer background [1].
A language is the primary means of communication used by humans. Natural Language
Processing (NLP) is a technique which can make a computer to understand a natural language
and easily communicate with a human being. Also, it is becoming one of the most active
techniques used in a human computer interaction (HCI). In the context of Human Computer
Interaction (HCI), there are many NLP applications such as Information Retrieval Systems,
Information Extraction, Speech Recognition, Language Translator, Question Answering, (QA)
Natural Language Interface to Database (NLIDB), and Dialog Systems [4].
In our day to day activity computer has an important role to minimize workload and to complete
tasks in time. Unlike most user-computer interfaces, a Natural Language Interface allows users
to communicate fluently with a computer system with very little preparation. Even though
Natural Language may be the easiest symbol system for people to learn and use; it has proved to
be the hardest for a computer to master.

Internet is the largest data provider in todays date and it caters to users of all kinds. The
vastness of data makes it mandatory that data is saved in an organized manner so that it is easy to
search, retrieve and maintain [1]. For this purpose the most logical and commonly used storage
method is by the use of databases. To access these data from the database the knowledge of
database language is required.
To enable database queries to be performed by users with little or no SQL querying abilities,
companies like Elfsoft (English Language Frontend Software which has developed SQL Tutor)
have analyzed the abilities of Natural Language Processing to develop products for people to
interact with the database in simple English. This enables a user to simply enter queries in
English to the Natural Language Database Interface; a kind of application is known as a Natural
Language Interface to a Database (NLIDB) [1].
NLIDB deals with representation of user request to database in his/her native language. NLIDB
then maps the user request in standard SQL to retrieve desired results from the target database.
The purpose of this interface/system is to facilitate access by the user through hiding
complexities of database query language syntax. Thus the user writes his/her request similar to
email message and submit to NLIDB system. The system then understands the request and
translates it in accurate database query so that the precise results can be retrieved. Hence, an easy
to use user interface comes into picture which would facilitate diverse users to access data. There
is a need to design and develop an interface in the local language so that a user without the
knowledge of English language and SQL can easily use the system.
The problem of natural language access to a database is divided into two sub-components [2]:
Linguistic component and database component. Linguistic component, translates the natural
language input sentence into a formal query, and then after a database search generates a natural
language response. Next Database component performs traditional Database Management
functions. Questions entered in natural language are translated into a formal query. This query
is then processed by the database management system and after processing, the result is return
back to the natural language component where generation routines produce a surface language
version of the response.

1.2. Statement of the Problem


Database is a technology that stores the data in a logical and organized manner. To access
information from this database technology, one needs to have knowledge of database query
language such as SQL. Since the novice user may not be aware of the syntax of SQL and
structure of database, s/he may not be able to write the SQL queries. As a result, computer users
have always needed for ways to help them minimize the communication gap between them and
computer. Natural Language Interfaces to Databases (NLIDBs), a system where the users can
access information stored in a database by typing requests in some natural language (e.g.
English), is one of the mechanism that assist a more increased and easier computer-human
interaction [2], [7].
Building computational models with human language processing abilities requires knowledge of
how humans acquire store and process language. It also requires the knowledge of world and of
language. Companies have related the problem of extracting data from a Database Management
System (DBMS) by using the tools like MS Access, Oracle and others. A person with no
knowledge of Structured Query Language (SQL) may find himself or herself handicapped in
corresponding with these tools. This indicates the need to develop products for people to interact
with the database in their own native language [8].
Amharic was the national language of Ethiopia until 1983 E.C [49]. Currently it is the official
working language of the Federal Democratic Republic of Ethiopia and thus has official status
nationwide and the official or working language of several of the states/regions within the federal
system, including Amhara, Gambela, Benishangule Gumuz, Oromia and the multi-ethnic
Southern Nations, Nationalities and Peoples region. The language is spoken as a mother tongue
by a large segment of the population in the northern and central regions of Ethiopia and as a
second language by many others. It is the second most spoken Semitic language in the world
next to Arabic [50]. One of the major differences between Amharic and Semitic languages like
Arabic and Hebrew is that Amharic is written from left to right as of English [51]
Since computer operating systems and all most all application software available until today
have been developed for a use convenient to only English language, accessing them through
other kind of languages like Amharic has become very difficult. For example, to list the

medicine from the table information for a particular disease, one has to create Query like select
medicine from information where disease = Cold. However, to do so, a user who doesnt know
SQL and cant use English language may not be able to access the database. So, to make
database applications easy to use for these people, we have present a model and an algorithm to
convert Amharic sentence into SQL and retrieve relevant text from a Relational Database. It has
been use HRM database with Employee, Department, and Employee on education table as a case
study for developing a natural language query processing from a database system. The query has
been asked in Amharic sentence for retrieving relevant information from databases.
So far, different techniques such as pattern matching, syntax based, semantic grammar based and
Intermediate Representation Language systems have been used to develop NLIDB. Among these
techniques, the study employed pattern matching and similarity checking for developing
Amharic language text retrieval from Relational Database.

1.3. Objective of the Study


1.3.1. General objective
The general objective of this research was to translate Amharic sentence into Structural Query
Language (SQL) and to retrieve data from a Relational Database.

1.3.2. Specific Objective


To achieve the above objective, the following specific objectives have to be achieved:
To review previously implemented related works, and thereby; identify and study
different approaches and architectures that have been used in language interface for
database.
To understand the concept of query processing in database using natural language
interface.
To prepare the dictionary and map Database tables and columns based on the dictionary
To develop a model that accepts the natural language query and preprocess it
To construct rules for the techniques used for SQL query generation
To develop an algorithm that can efficiently retrieve data
To evaluate the performance of the developed system.

1.4. Methodology
1.4.1. Literature Review
For the successful completion of this research, different literatures related to language interface
to database and data retrieval from database were reviewed.

1.4.2. Data Collection


For this study, the data that was stored in the database was collected from Debre Markos
Univercity HRM employee database. The dictionary that has been prepared in this study is
limited to language translation related to only employee table. For the final evaluation, question
and sentence has been collected from the user on how they would like to ask the system for
retrieving data from the database.

1.4.3. Research Method


In the study, the main task of converting the natural language (Amharic) query into a structured
query language involved various phases or stages. First, the researchers develop a user interface
that helps the users to write their querys in a natural language (Amharic). Following a system is
created in such a way that it can understand and processed (tokenize) statements or sentences in
Amharic, and produce tokens that enables searching from the database for mapping. Next the
researcher developed an algorithm that can efficiently map a natural language query into SQL
statement when entered in Amharic language. In doing so, for mapping from the Human
Resource Management (HRM) database on EMPLOYEE table and from the administrative
database, the researcher used the Table Handling Table which contains mapping for all names of
available tables; the column Handling Table which contains mapping for all names of available
tables, and Conditional Words Table which contains mapping for conditional symbols (like less
than (<), greater than (>)) respectively. Finally, an attempt was made to design architecture of
Amharic language interface for structured databases for the generation and execution of SQL
query by taking the users input in the natural language (Amharic) form.

1.4.4. Tools and Techniques


To develop natural (Amharic) language interface to database, different tools and techniques are
used. To identify column name and column value parser is important. Parser checks that a string
of words (a sentence) is well formed and break a sentence into a structure that shows syntactic

relationships between words. Parser for Amharic language is developed only at a research level;
no single practical tool has been developed yet. Hence, to avoid this limitation, the researcher
substituted parser for algorithm, and developed it in such a way that it can identify column name
and column value without using parser.
Due to the main reason of the advantage it has and its suitability for this particular research, the
researcher used the pattern matching and similarity checking techniques for the generation and
execution of SQL query through taking the users input in natural language (Amharic) form. For
instance, one main advantage of the pattern matching is simplicity, no elaborate parsing and
interpretation modules are needed, and the systems are easy to implement. In addition, patternmatching systems often manage to come up with some reasonable answer, even if the input is out
of the range of sentences the patterns were designed to handle.

1.4.5. Evaluation
The study involves developing the designed system and evaluating its performance. For
evaluating the performance Amharic sentences have been used.

1.5. Scope of the Study


This study aimed only to design and develop Natural Language Query Interface for Amharic
Data Retrieval from relational database. Local languages other than Amharic language are not
included in the study. The program is designed to take input only in Amharic language. The
program translates the whole query in the domain into the standard query language to extract the
relevant information from the database. Then, the translated language is fired on the database so
that the relevant answer can be retrieved. The system allows the users to provide their inputs in
the form of simple sentences into the system in Amharic language. It supports queries of only
three different types that are: Query for selection of whole table, Queries for selection of certain
columns, and queries for selection of certain rows from certain columns i.e queries with where
condition. Inadition it also includes aggregate function, gruping, and ordering query has
concidered. Other remaining database concepts are out of the scope of the current study. In
addition, the system developed in this particular study receives data only from a single table and
it is domain dependent. Since Amharic parser and POS tagger are not freely available, the

researcher used only pattern matching and similarity checking technique for converting Amharic
sentence into structured Query language (SQL).

1.6. Significance of the Study


The main significance of this study can be directly related to the problematic nature of operating
and using a database for type of users who have no a good command of database language and
natural language (for instance English), the language a database uses for input, process and
retrieval. In a todays world where almost all fields are computerized, activities and tasks in
many organizations are related to databases of one kind all another. However, users of such
databases may find themselves handicapped especially if they dont possess a good knowledge
of SQL and if they dont have a good working knowledge of English. For instance, database
users in Ethiopia, who may not have such knowledge and skills, can be a good example. This is
where the current study comes in to solve such problem. With an aim to design and develop
Natural Language Query Interface for Database, the system that comes out is intended to help a
novice user, without the knowledge of database language like SQL and English language, to
communicate with the database entirely in Amharic language. This specific system that the study
develops can have the following advantages or significances for the intended user:

No requirement of an Artificial Language: - it doesnt require a user to learn a known an


artificial communication language like SQL to refer or to access the file stored on the
database.

No need of Training: - it doesnt need for the user to have taken any previous training on
language interface to accessing a database. It is highly user friendly and easy to use by
the end users.

Simple and easy to use: - The interface is very simple and easy to use because the end
users can write the query in their native (Amharic) language.

It provides means of accessing information in the database to the user independently of


its structure and encoding.

Moreover, the study can serve as a springboard for others, who have interest to do more research
in this area, by providing them with some basic theoretical and practical knowledge they may
need to start with.

1.7. Organization of the study


The study is organized into the following patterns and parts. Chapter one includes sections on
background information on the issues of the research, definitions, explanations and elaborations
of some crucial concepts such as natural language processing and natural language interface to
database. The remaining section of this chapter is on the problem statement, research method and
methodology, scope and significance of the study. Chapter two of the study which is a review of
literature incorporate four sections; section one is a survey on natural language processing (NLP)
application; section two is a review on question answering (QA) system; section three explores
on natural language interface to database(NLIDB), and the last and fourth section is review on
previously done related works. Chapter three is coverage on the Amharic writing system, and has
six sections with introductions to Amharic Alphabet, punctuation, numbers, syntactic structures,
problems in retrieving Amharic text and Amharic software. Chapter four establishes the method
and algorithms, and presents architecture of the system. Chapter five, present the result and
discussion and the overall evaluation. On the first of this section, the user interface_
startup and operation; experiment on query for selection of the whole table, query for selection of
certain column, query for selection of certain row from certain column (single and multiple
condition), join, aggregate function grouping and ordering queries are discussed. On the other
part discussed analysis of results including database used for evaluation, analysis based on
question category with four different categories such as selection (list) query, single conditional
query, multiple conditional query and aggregate function including grouping and ordering
queries. Lastly on this section present the overall measurement and discussion is discussed.
Finally, chapter six present conclusion recommendation and future works.

CHAPTER TWO
LITERATURE REVIEW
2.1. Survey on Natural Language Processing (NLP) Applications
This chapter covers the area of natural language processing by reviewing different articles on the
area. We introduce about information retrieval system, question answering system based on open
and closed domain question answering, and statistical or semantic question answering system.
From open and closed domain question answering system AnswerBus, START, TextMap,
EaGLi, and WolframAlpha are discussed, and different articles discussed on statistical or
semantic question answering system. We also introduce dialog between human and computers in
natural language called a Dialog System. Different works done by researchers are discussed on
the area of dialog system like ELIZA, HAPPY ASSISTANT, BIRD QUEST, CHAT 80,
PLANES. In addition Natural language interface to database (NLIDB) has been discussed. From
NLIDB we discussed the techniques used to implement the area like pattern matching, syntax
based, semantic grammar, and intermediate representation has been discussed. Furthermore we
discussed the advantage and disadvantages of NLIDB, the common architecture of NLIDB and
related works done on the area of that we have been implemented.

2.2. Information Retrieval (IR) System


The Information Retrieval system is a scientific discipline which deals with analysis, design
and implementation of a computerized system that addresses representation, organization, and
access to large amounts of heterogeneous information encoded in a digital format. The search
engine is the well-known application of IR which accepts query from user and returns the
relevant document to the user. It returns the document, not the relevant answers; users are left to
extract answers from the returned documents. The figure 2.1 represents Information Retrieval
Model and its vertical taxonomy [5].

Figure 2. 1: Vertical Taxonomy of Information Retrieval model [5].

2.3. Question Answering (QA) System


Question answering is a specialized case of Information Retrieval System. It is a process of
finding specific answers of a questions posted by user from a large collection of text. The user
can input question in natural language and receive concise answer. The dimension of Question
Answering system is based on different sources to generate an answer. For example, it can be an
open or closed domain Question Answering system or it may be syntactical or semantic based
Question Answering system [70].

2.3.1. Open & Closed Domain QA System


The first Question Answering System was developed in the 1960s and it was basically a natural
language interface to an expert system with a specific domain, such as BASEBALL [11]. There
are many other systems that researchers have developed in the field of the Question Answering
system such as AnswerBus, START, TextMap, EaGLi, AQUA, NSIR and WolframAlpha.
Zheng [12] developed open domain Question Answering system AnswerBus, based on sentence
level web information retrieval. It actually gives a list of answers, each of which is a hyperlink to
the source page. It accepts users natural-language questions in English, German, French,

10

Spanish, Italian or Portuguese and extracts possible answers in English. It uses a specific
dictionary. It classifies all words of retrieved documents in two categories: matching and
nonmatching according to its predefined formula. It provides five search engines and directories
used for retrieveing a webpages that are relevant to the user question. The system uses a simple
language recognition model to determine wheter the question is in English, or any of the other
five languages. And if the question language is not English AnswerBus send the question into
AltaVistas translation tool, and obtain the question that has been translated into English.
Katz et al. [13] developed the worlds first web based open domain Question Answering system
START (SynTactic Analysis using Reversible Transformations). It has been online and
continuously operating since December, 1993. Questions are asked in English about place,
movie, people, dictionary definition and much more. It uses semantic parsing. They built a
system that integrates heterogeneous data sources using an objectpropertyvalue model to
answer user questions. And they identify three main challenges in getting a computer to answer
such questions, understanding the question, identifying where to find the information, and
fetching the information itself. It handles all varieties of media, including text, diagrams, images,
audio and video clips, data sets, web pages, etc. START considered as the best system that can
return the good answers for the user [14]. The major problem in this system is that it accepts only
simple questions related to its domain like Geography, Science & Reference, Arts &
Entertainment, History and Culture, it could not answer the question about causes and methods
[14]. It uses the concept of template <subject relation object>. The START system gives a proper
answer when the query is asked and the answer appears in frequently asked database. In other
cases, the user has to navigate to web page to get an exact answer.
The open domain intelligent Question Answering assistant TEXTMAP [15] focused on
developing an algorithm that automatically mines vast amounts of data in order to answer the
question posed in natural language. It uses answering techniques like factoid question (What is
the capital of Morocco?), cause question (Why is there no cure for the cold?), Biography
questions (What do you know about Dick Cheney?), and event question (What do you know
about the Kobe earthquake?). TEXTMAP employs a combination of rule-based and supervised
and unsupervised machine learning algorithms that are trained on massive amounts of data. It
supports English, Spanish and German language. It provides web based interface to user.

11

Another open domain Question Answering system, EAGLi [16] (Engine for Answers in
Genomics Literature) retrieves relevant answers from selected taxonomy. It uses browser and
predictive model and also includes advanced search. It supports only biology and medicine
questions. It provides web based interface to user.
The computational knowledge engine WolframAlpha [17] is an online service that answers
factual queries directly by computing the answer from an external source rather than providing a
list of documents or web pages. It is composed of a toolkit such as mathematics, computer
algebra, symbolic, numerical computation, visualization and statistical capabilities. It deals with
facts not with options. A computation time for each query is limited.

2.3.2. Statistical or Semantic QA System


There are many similarities between open-domain system and semantic question answering
system. Both need to find synonyms and also their morphological variants for the keywords.
Despite that, there are two major differences between these two systems [18]:
a) Open domain question answering classifies the queries based on hierarchies or heuristics
to recognize named entities, whereas the semantic information needs the ontology.
b) Semantic systems can classify the query based on the answer and equivalent semantic
representation of question rather than only on the type of question submitted. Thus, there
is no need of complex hierarchies of answers.
However, almost all the work in this research area has been done using a process of applying
semantic and syntactic analysis to get a logical representation of a sentence followed by an
appropriate conversion into corresponding representation to a database query. The following are
the major systems carried out by the researchers:
Nguyem Tuan Dang and Do Thi Thanh Tuyen [14] developed a system for free eBook library
Gutenberg called eLSSNL (eLibrary Searching System by Natural Language). The authors
propose a method to build a specific Question-Answering system which is integrated with a
search system for eBooks in library. Users use simple English questions for searching the library
with information about the needed eBooks. They have developed the language theory model
based on three main parts: a syntax model, a semantic model and the transformation rules

12

between them, and this is important to use users use a simple English queries. Limitation of the
system is, it uses predefined syntax structure to input the query in natural language.
Rajendra Akerkar and Manish Joshi [20] in their paper discussed the natural language interface
which accepts questions in natural language and generates textual responses. It uses a keyword
matching approach. It presents the rules to tackle the phenomenon using shallow parsing
technique. The experimental result shows that approach they used provides high accuracy and
produce reasonable textual responses.

2.3.3. Dialogue System


A dialogue system is a computer program that communicates with a human user in a natural
way. The dialogue System provides an interface between the user and a computer-based
application that permits interaction with the application in a relatively natural manner [74]. The
following are the list of systems developed by researchers in field of dialog systems: ELIZA,
HAPPY ASSISTANT, BIRD QUEST, CHAT 80, PLANES, etc.
Weizembaum [21] developed a first dialog system ELIZA. Eliza's task was to talk to a human
in his or her own language and appear to understand and give meaningful and appropriate replies
as shown in figure 2.4. Input sentences are analyzed on the basis of decomposition rules which
are triggered by key words appearing in the input text. Responses are generated by reassembly
rules associated with selected decomposition rules. ELIZA is not restricted for the recognition of
a particular set patterns or responses. It provides web interface to user.
Joyce Chai [22] developed a dialogue system - HAPPY ASSISTANT. It helps user to find
relevant information about products and services for e-commerce applications. It uses the XML
concept to manage domain lexicon and knowledge based business rules.
Another system which combines information extraction and dialogue system called,
BIRDQUEST [23]. It is a website developed for users who watch programs on TV and ask
questions related to Nordic birds. A limitation of this system is that it does not allow free text
search. It gives information which can be extracted from its own database only.
David Warren and Fernando Pereira [24] developed a dialogue system - CHAT-80. It provides a
fascinating sample application, comprising a natural language interface to a geographical

13

database. The system uses semantic grammar techniques and it is implemented in Prolog
language. The CHAT-80 was an impressive, efficient and sophisticated system. The database of
CHAT-80 consists of facts like oceans, major seas, major rivers and major cities about 150 of the
countries world and a small set of English language vocabulary that are enough for querying the
database. The basic method followed by Chat-80 is to attach some extra control information to
the logical form of a query in order to make it an efficient piece of Prolog program that can be
executed directly to produce the answer.
D.L. Waltz [25] developed PLANES (Programmed Language-based Enquiry System) at the
University of Illinois Coordinated Science Laboratory. PLANES include an English language
front end with the ability to understand and explicitly answer user requests. It carries out
clarifying dialogues with the user as well as it answers vague or poorly defined questions. The
system was developed using database related to information of the U.S. Navy 3-M (Maintenance
and Material Management), which is a database of aircraft maintenance and flight data. The idea
can be directly applied to other non-hierarchical record-based databases.

2.3.4. Database System


A database is an organized collection of interrelated data. It captures information about universe
of discourse, called mini world. For example, purpose of university database is to keep and
maintain accurate track of academic activities of university by storing, retrieving and
manipulating the university relevant information in the database.

Database Languages
A database system provides at least one language which includes a Data Definition Language
(DDL) to specify the database schema, the Data Manipulation Language (DML) to articulate
database queries & updates and Data Query Language (DQL) for retrieving the data. SQL is
widely used database languages.
I.

Data Manipulation Language (DML): It is a language for accessing and manipulating the
data contained in database. Accessing the data refers to the manipulation of information
stored in the database as: (i) Insertion of new information into database (ii) Deletion of
information from database (iii) Modification of information stored in the database [2].

14

DML has two classes of language (i) Procedural DML in which user specifies what data
is required and how to get those data and (ii) Non Procedural (also refer as Declarative)
in which user specifies what data is needed without specifying how to get those data.
II.

Data Definition Language (DDL):- It is a language for creating and manipulating the
structure of a data. The schema created by DDL is stored in the data dictionary which
contains metadata that is data about data. The data values stored in the database must
satisfy certain consistency constraints such as domain constraints, referential integrity,
assertions and authorization as defined in a data dictionary.

III.

Data Query Language (DQL): It is a language for retrieving the information from
database.

Query Interface
A query interface to a database is a system that helps the user to access the information which is
stored in a database. Natural Language Interface (NLI) is one kind of query interface in which
user can input the query in natural language. Besides NLI, numbers of traditional user interfaces
are being used by Database Management System (DBMS) packages such as Spreadsheet like
Interface, Forms based Interface, Database Query Interface, Graphical User Interface, Query-ByExample, and Command Line Interface.

2.4. Natural Language Interface to Database (NLIDB)


One may find it intricate and frustrating to interact with a foreign person with no knowledge of
English. Thus, a translator will have to come into the picture to allow one to communicate with
the foreigner. Companies have related this problem to extracting data from a Database
Management System (DBMS) such as MS Access, Oracle and others. A person with no
knowledge of Structured Query Language (SQL) may find himself or herself handicapped to
correspond with the database. Therefore, companies like Elfsoft (English Language Frontend
Software which has developed SQL Tutor) have analyzed the abilities of Natural Language
Processing to develop products for people to interact with the database in simple English. This
enables a user to simply enter queries in English to the Natural Language Database Interface.
This kind of application is known as a Natural Language Interface to a Data Base (NLIDB). To

15

express the definition of NLIDB Abhijeet [27] says it is communication channel between the
user and the computer; without any knowledge of any programming language, a user can act as a
programmer. Through these systems, users can interact with database in a more convenient and
flexible way.
Natural Language Interface to Database (NLIDB) is a system that allows the user to access
information stored in a database by typing requests expressed in some natural language. In the
last few decades, many NLIDB systems have been developed through which users can interact
with database in a more convenient and flexible way. Because of this, this application of NLP is
still very widely used today [6]. Natural Language Interface has been a very interesting area of
research since the past. The aim of Natural language Interface to Database is to provide an
interface where a user can interact with database more easily using his/her natural language and
access or retrieve his/her information [1]. Moreover, the NLIDB is a system that converts the
query in native language into SQL.

2.4.1. Components of NLIDB


Computing scientists have divided the problem of natural language access to a database into two
sub-components [1]: (a) Linguistic component and (b) Database Component. The Linguistic
Component translates the natural language input to an expression of Intermediate Query
Representation (IQR), which is subsequently passed to Database Component for generation
of Structured Query Language (SQL) statement. The resulting SQL statement is then executed
by relational database management system. The Linguistic Component consists of morphological
analysis, query pre-processing & context resolution, lexical analysis, syntactical analysis and
semantic analysis. On the other hand, Database Component consists of SQL query generation
and SQL query execution.

2.4.2. Techniques Used for Developing NLIDB


2.4.2.1.

Pattern-Matching Systems

Pattern matching system is the earliest and the simplest techniques to implement natural
language interface to database (NLIDB). These patterns and rules are fixed [7]. The rules states
that if an input word or sentence is matched with the given pattern, the action has been taken.

16

Those actions are also mention in the database [27]. The main advantage of pattern matching
approach is that no elaborate parsing and modules of interpretation are required and the systems
are very easy to implement. Also, pattern-matching systems often manage to come up with some
reasonable answer, even if the input is out of the range of sentences the patterns were designed to
handle [26]. One of the best natural language processing system that role in this style is ELIZA
[1]. For simplification Ashish kumar [26] present the following example.

Countries_Table
Country

Capital

Language

France

Paris

French

Italy

Rome

Italian

India

Delhi

Hindi

Table 2. 1: Sample database table


A primitive patter-matching system according to Ashish kumar [26] review anderospus [2], may
use riles as: Pattern: Capital <country>
Action: Report CAPITAL of row where COUNTRY = <country>
If the user asked What is the capital of India?, using the above pattern rule the system would
report Delhi. The system would also use the same rule to handle question such as print the
capital of India: , could you please tell me what is the capital of India? etc.

2.4.2.2.

Syntax-Based Systems

In syntax based system user questions are analyzed syntactically i.e. it is parsed and the resulting
syntactic tree is mapped to an expression in some database query language [1]. Syntax-based
systems use a grammar that describes the possible syntactic structures of the users questions.
Syntax-based NLIDBs usually interface to application-specific database systems that provide
database query languages carefully designed to facilitate the mapping from the parse tree to the
database query.

17

The main advantage of using syntax based approaches is that they provide detailed information
about the structure of a sentence. A parse tree contains a lot of information about the sentence
structure; starting from a single word and its part of speech, how words can be grouped together
to form a phrase, how phrases can be grouped together to form more complex phrases, until a
complete sentence is built. Having this information, we can map the semantic meanings to
certain production rules (or nodes in a parse tree). One of the examples of syntax based system is
LUNAR. In this system grammar is nothing but the possible syntactic structure of the users
question.
As Neelu Nihalani [28] present the problem of syntax based, unfortunately not all nodes should
be mapped, some nodes have to be left just as they are without adding any semantic meanings.
And it is not always clear which nodes should be mapped and which should not. Moreover the
same node in different parse trees is not necessarily going to be translated in all the trees. The
second problem is a sentence can have multiple correct parse trees, and if all are translated, they
may lead to different query results. The last problem is that it is difficult for a syntax based
approach to directly map a parse tree into some general database query language, such as SQL
(Structured Query Language).

2.4.2.3.

Semantic Grammar Systems

In semantic grammar systems, the requests and responcess is still done by parsing the input and
mapping the parse tree to a database query. The difference, in this case, is that the grammars
categories do not necessarily correspond to syntactic concepts. The basic idea of a semantic
grammar system is to simplify the parse tree as much as possible, by removing unnecessary
nodes or combining some nodes together. Based on this idea, the semantic grammar system can
better reflect the semantic representation without having complex parse tree structures. Instead of
smaller structures, the semantic grammar approach also provides a special way for assigning a
name to a certain node in the tree, thus resulting in less ambiguity compared to the syntax based
approach [28].
The main drawback of semantic grammar approach is that it requires some prior- knowledge of
the elements in the domain, therefore making it difficult to port to other domains. In addition, a
parse tree in a semantic grammar system has specific structures and unique node labels, which

18

could hardly be useful for other applications. Much of the systems developed till now like
LUNAR, LADDER, use this approach of semantic grammar.

2.4.2.4.

Intermediate Representation Languages

Due to the difficulties of directly translating a sentence into a general database query languages
using a syntax based approach, the intermediate representation systems were proposed. The idea
is to map a sentence into a logical query language first, and then further translate this logical
query language into a general database query language, such as SQL. Figure 2.2 show a possible
architecture of an intermediate representation language system [26].

Figure 2.2: intermediate representation language architecture [26].


In the intermediate representation language approach, the system can be divided into two parts.
One part starts from a sentence up to the generation of a logical query. The other part starts from
a logical query until the generation of a database query. In the part one, the use of logic query
languages makes it possible to add reasoning capabilities to the system by embedding the
reasoning part inside a logic statement. In addition, because the logic query languages are
independent from the database, it can be ported to different database query languages as well as
to other domains, such as expert systems and operating systems [28]. Example of intermediate
representation language architecture is Masque/sql [2].

19

2.4.3. Advantage and Disadvantage of NLIDB


i.

Some advantages of NLIDBs

No artificial language: One advantage of NLIDBs is supposed to be that the user is not required
to learn an artificial communication language. Formal query languages are difficult to learn and
master, at least by non-computer-specialists. Graphical interfaces and form-based interfaces are
easier to use by occasional users; still, invoking forms, linking frames, selecting restrictions from
menus, etc. constitute artificial communication languages that have to be learned and mastered
by the end-user. In contrast, an ideal NLIDB would allow queries to be formulated in the users
native language. This means that an ideal NLIDB would be more suitable for occasional users,
since there would be no need for the user to spend time learning the systems communication
language.
Simple, easy to use: Consider a database with a query language or a certain form designed to
display the query. While an NLIDB system only requires a single input, a form-based may
contain multiple inputs (fields, scroll boxes, combo boxes, radio buttons, etc) depending on the
capability of the form. In the case of a query language, a question may need to be expressed
using multiple statements which contain one or more sub queries with some joint operations as
the connector.
Better for Some Questions: It has been argued that there is some kind of questions (e.g.
questions involving negation or quantification) that can be easily expressed in natural language,
but that seem difficult (or at least tedious) to express using graphical or form -based interfaces.
For example, Which department has no programmers? (Negation), or Which company
supplies every department? (Universal quantification), can be easily expressed in natural
language, but they would be difficult to express in most graphical or form-based interfaces.
Questions like the above can, of course, be expressed in database query languages like SQL, but
complex database query language expressions may have to be written.
Fault tolerance: Most of NLIDB systems provide some tolerances to minor grammatical errors,
while in a computer system; most of the time, the lexicon should be exactly the same as defined,
the syntax should correctly follow certain rules, and any errors will cause the input automatically

20

be rejected by the system. In the case of incomplete sentences, most of computer systems do not
provide any support. [2]

ii.

Some disadvantages of NLIDBs

Linguistic coverage not obvious: A frequent complaint against NLIDBs is that the systems
linguistic capabilities are not obvious to the user. As already mentioned, current NLIDBs can
only cope with limited subsets of natural language. Users find it difficult to understand (and
remember) what kinds of questions the NLIDB can or cannot cope with. For example, Masque
[54] is able to understand What are the capitals of the countries bordering the Baltic and
bordering Sweden? which leads the user to assume that the system can handle all kinds of
conjunctions (false positive expectation). However, the question What are the capitals of the
countries bordering the Baltic and Sweden? cannot be handled. Similarly, a failure to answer a
particular query can lead the user to assume that equally difficult queries cannot be answered,
while in fact they can be answered (false negative expectation).
Formal query languages, form-based interfaces, and graphical interfaces typically do not suffer
from these problems. In the case of formal query languages, the syntax of the query language is
usually well-documented, and any syntactically correct query is guaranteed to be given an
answer. In the case of form-based and graphical interfaces, the user can usually understand what
sorts of questions can be input, by browsing the options offered on the screen; and any query that
can be input is guaranteed to be given an answer [2].
Linguistic vs. conceptual failures: When the NLIDB cannot understand a question; it is often
not clear to the user whether the rejected question is outside the systems linguistic coverage, or
whether it is outside the systems conceptual coverage. Thus, users often try to rephrase
questions referring to concepts the system does not know (e.g. rephrasing questions about
salaries towards a system that knows nothing about salaries), because they think that the problem
is caused by the systems limited linguistic coverage. In other cases, users do not try to rephrase
questions the system could conceptually handle, because they do not realize that the particular
phrasing of the question is outside the linguistic coverage, and that an alternative phrasing of the
same question could be answered. Some NLIDBs attempt to solve this problem by providing

21

diagnostic messages, showing the reason a question cannot be handled (e.g. unknown word,
syntax too complex, unknown concept, etc.)
Users assume intelligence: NLIDB users are often misled by the systems ability to process
natural language, and they assume that the system is intelligent, that it has common sense, or that
it can deduce facts, while in fact most NLIDBs have no reasoning abilities. This problem
does not arise in formal query languages, form-based interfaces, and graphical interfaces,
where the capabilities of the system are more obvious to the user. For example, when user asks a
query list the names of farmers who are 35 years old, he/she is not specifying the word age,
assuming that system will understand it automatically. But system is not so intelligent.
Inappropriate Medium: It has been argued that natural language is not an appropriate medium
for communicating with a computer system. Natural language is claimed to be too verbose or too
ambiguous for human-computer interaction. NLIDB users have to type long questions, while in
form-based interfaces only fields have to be filled in, and in graphical interfaces most of the
work can be done by mouse-clicking. In natural language interface user has to type full sentence
with all the connecters (articles, prepositions, etc.) but in graphical or form based interfaces it is
not required [29].

2.4.4. Most commonly used Architecture of NLIDB


The most commonly used architecture of the NLIDB systems is shown in the Figure. This
architecture has two major components: (a) Linguistic Component, and (b) Database
Component. The Linguistic Component translates the natural language input to an expression of
Intermediate Query Representation (IQR), which is subsequently passed to Database Component
for generation of Structured Query Language (SQL) statement. The resulting SQL statement is
then executed by database management system.

22

Figure 2. 3: commonly used architecture of NLIDBs [1]

A. Syntactic Analysis
The word syntax means grammatical arrangements of words in a sentence and their relationship
with each other. The objective of the syntactic analysis is to find the syntactic structure of the
sentence. This splits the sentence into the simpler elements called Tokens. Then the spelling
checker check the token is correctly spell or not, or check the availability of tokens on the system
dictionary. Ambiguity reduction function reduces the ambiguity in a sentence and simplifies the
task of the parser.

23

B. Parse Tree
Output of syntactic analysis is a parse tree. It represents the syntactic structure of a sentence
according to some formal grammar. A parse tree is composed of nodes and branches; each node
is either a root node, a branch node, or a leaf node. In a parse tree, an interior node is a phrase
and is called a non-terminal of the grammar, while a leaf node is a word and is called a terminal
of the grammar.

C. Semantic Analysis
Semantic Analysis is related to create the representations for meaning of linguistics inputs. It
deals with how to determine the meaning of the sentence from the meaning of its parts. So, it
generates a logical query which is the input of Database Query Generator.

D. Database Query Generator


The task of the Database Query Generator is to map the elements of the logical query to the
corresponding elements of the used databases. The query generator uses four routines, each of
which manipulates only one specific part of the query. The first routine selects the part query that
corresponds to the appropriate DML command with the attributes names (i.e. SELECT *
clause). The second routine selects the part of the query that would mapped to a tables name or
a group of tables names to construct the FROM clause. The third routine selects the part of the
query that would be mapped to the WHERE clause (condition). The fourth routine selects the
part of the natural language query that corresponds to the order of displaying the results
(ORDER BY clause with the name of the column).

E. Database Management System


The purpose of this system is to get the required results from the used database. In order to
achieve this, the generated database query would be tested to verify correctness before applied to
the used database and then represent the result to the user. It executes that query on the database
and produces the results required by the user.

24

2.5. Related works


Himani Jain [30] developed Hindi Language Interface to Database. Hindi Shallow Parser which
uses Shakti Standard Format is considered for parsing a sentence. The system was developed in
Java with MySQL as backend. For testing of the developed system, employee database is used
containing Employee and Department tables. The system does not deal with linguistic
components. It directly maps user keywords to database entity names and the result has been
displayed in Hindi language. It deals with a single and a multiple column retrieval queries,
conditional queries and join queries.
Avinash Agarwal[19] describes a method for semantic analysis of natural language queries for
Natural Language Interface to Database (NLIDB) using domain ontology. For the
experimentation of the proposed method, domain ontology for railway inquiry is created. The
system is tested on a corpus of English language query which is collected from various groups of
user of the railway inquiry domain. Natural language Toolkit with python is used for the
preprocessing. The author also discussed types of questions and types of answers.
Khaleel Al-Rababah and Safwan Shatnawi [31] propose an Arabic Natural Language Interface to
Databases (ANLIDB) by applying Arabic morphological, ontological, and syntactical analyses.
They implemented algorithms for extracting significant single and multiple phrases from Arabic
natural language questions submitted to the database and then constructing and executing SQL
questions. In addition a lexicon derived from the database was created, and a simple part-ofspeech (PoS) was implemented. They states that there system shows high rates of success in
identifying relations, correct mapping of attributes, and constructing and executing SQL
statements.
Faraj et al. [32] developed a natural language interface for database system known as GINLIDB
(Generic Interactive Natural Language Interface to Database). The author proposed a design by
the use of UML and architecture of GINLIDB. The experiments of the query were tested using
VB.NET 2005. The system dealt with only limited domain and answered a small set of queries.
The limitation of the system depends on the size and content of the system's knowledge
base. When the query is not available as per ATN rule, then the query is rejected and the user
has to rephrase the query and enter again into the system. Besides, in the input query, the user

25

has to explicitly specify the attribute name. For example, if the user paraphrase his/her query as
display employee location, the system does not recognize it. However, if the same question is
rephrased as display employee address the system, can recognize and respond to by generating
SQL query accordingly.
Hendrix et al. [33] designed a natural language interface to database system LIFFER/LADDER,
which gives information about US Navy ships. This system uses a semantic grammar to parse
questions and uses distributed database. The system consists of three major components: (a)
INLAND (Informal Natural Language Access to Navy Data), (b) IDA (Intelligent Data Access)
and (c) FAM (File Access Manager). It supports multiple table queries with join conditions.
Language features that increase system usability, such as spelling correction, processing of
incomplete inputs, and run-time system personalization, are also included in the system.
Woods W. A. [29] developed a system LUNAR which answers about rock samples brought back
from the moon. The system makes use of two databases such as chemical analysis and literature
reference. The program used is an Augmented Transition Network (ATN) parsers and procedural
semantics. It consists of three components: (i) general purpose grammar (ii) Parser for a large
subset of natural English (iii) a rule driven semantic interpretation of component. The first
component is responsible for transforming natural language input into the disposable program to
carry out its intent and the third component deals with executing programs against a database to
determine answers to queries. The performance was quite impressive; it managed to handle 78%
of requests without any errors and this ratio rose to 90% when dictionary errors were corrected.
Runvanpura [34] has developed system - SQ-HAL. It is platform independent and has multiuser support. The system is written in Perl, which has a powerful string manipulation capability.
It uses top down parser methodology. It has limited thesaurus, the user has to manually enter the
relationships, and there is no direct method of retrieving column name. All the more, the system
cannot determine synonym for table name and column names; hence, the user has to manually
enter synonym words.
Chauhari S. et al [35] developed a system DBXplorer which describes a multi-step system to
answer keyword queries using relational databases. It proposes methodology which uses a
symbol table to store tables, columns, and rows of all data values that are looked up during the

26

search to identify the locations that contain all the keywords appearing in the question. The
system has been implemented using a commercial relational database and web server and allows
users to interact via a browser front-end.
Rashid Ahmad et al. [36] proposed an algorithm that efficiently maps a natural language query
entered in Urdu language to convert it into structured query language. The system accepts the
user query either in a question or in request form. The algorithm was implemented in Visual
C#.NET and was tested on a database containing student and employee data. The dictionary is
manually constructed and it is database specific. The program correctly maps 85% of the natural
language queries.
Amardeep Kaur [1] presented the design and implementation of natural language interface
to agricultural database in Punjabi language. The system uses MS Access database. The system
accepts input in specified template. Table name, column name and condition query mapped
manually. The author considers the limited words.
Anh Kim Nguyen and Phuong Hong Nguyen [37] in their paper constructed a natural language
interface to relational databases, which accepts fuzzy questions as inputs and generates answers
in the form of tables or short answers. By using derivation evaluation mechanism, the author
constructed a set of translation rules for all possible structures in standard trees of user questions
to translate it into SQL query.
Veera Boonjing and Chang Hsu [38] proposed a metadata search approach to provide practical
solutions to the natural language database query problem. Here the metadata grew in a largely
linear manner and the search was linguistics-free. A new class of reference dictionary integrated
four types of enterprise metadata: enterprise information models, database values, user-words,
and a query cases. The interpretation of input could be easily identified with the help of the
graphical representation method. It uses branch-and-bound method to identify the optimal
interpretation that led to SQL generation. The necessary condition was that the text input
contained at least one entry in reference dictionary, and the input was to complete and correct
grammar which led to correct single SQL query.
Androutsopoulos et al. [39] has proposed a system MASQUE (Modular Answering System for
Queries in English). The system is powerful and has portable natural language front end for

27

Prolog databases. It answers written English questions referring to certain domain knowledge
such as geography and airplane. Each question is transformed to suitable database query using
Prolog database. It uses an extra position grammar parser, and it transforms each question into a
single SQL query.
B. Sujatha et al. [40] discussed the novel architecture of natural language interface to database
which uses a pragmatic approach with illustrations. It incorporated a special language features
that increase system usability such as spelling correction, processing of inputs and runtime
system performance were also discussed. The three-level architecture consists of client level,
intermediate server level and the database level. The presentation of this example queries and
dictionary permits the user to better understand the contents of the database, which facilitates
query formulation. The table names are presented on the interface that helps the user to find out
what tables are present in the database.
H. V. Jagdish et al. [41] developed NALIX system- a generative interactive Natural Language
Query Interface to an XML database. The system can accept an English language sentence as
query input, which can include aggregation, nesting, and value joins, among other things. The
system can be classified as syntax based system. The transformation process has three steps: (a)
generating parse tree, (b) validating parse tree, and (c) translating parse tree into an Xquery
expression. It reformulates the input query to XQuery expression and translates it by a means of
mapping grammatical proximity of natural language, and parses tokens to the nearest
corresponding elements in the resulted XML. The system makes little attempt to understand
natural language itself.
Porfirio P. Filipe et al. [42] discussed a Natural Language Interface for Database. It allows the
user to formulate multimedia queries. Here the questions are first translated into logic language
and then to SQL which is processed by database management to respond to the queries.
Rukshan et al. [43] proposed the natural language interface to database, which allowed input in
the form of an English query through a convenient interface over the internet. A limited data
dictionary was used where all possible words related to particular system would be included.
Niculae Stratica [44] developed a querying system CINDI (Concordia Virtual Library System).
The system uses natural language input and gives structured representation of the answer in the

28

form of structured query language. In his study he uses Link Parser to semantically parse the
query, and it uses WordNet to build the conceptual knowledge base from the database schema.
The system was tested using information contained in the virtual library. As discussed by the
author himself, his system has some limitation such as values should be in double quotes, table
names and attribute names should be specified and template also should be specified.
Looking at the limitations of the various NLIDB systems developed by the various researchers in
this field, we have designed and developed a Natural Language Query Interface for Amharic
Text Retrieval from Relational Database that fulfills this knowledge gap. So our system has
improved the results that retrieve from the database by developing a new algorithm to efficiently
map, based on the structure of Amharic language nature. The system has used human resource
database with appropriate tables stored with data in Amharic natural language. Hence, the user
formulates the query in Amharic sentence and the as system can analyze such users query and
convert into Structured Query Language (SQL).
We have collected 120 sample user queries from ordinary people who have not knowledge about
database language. The accuracy of the system is measured in term of precision percentage
with two classes that identifies query response as: Correct Queries and Incorrect Queries.

29

CHAPTER THREE
THE AMHARIC WRITING SYSTEM
3.1. Introduction
As Betelehem [46] and Danel [47] discussed, sited in Lo [48], the Blackwell Encyclopedia of
Writing Systems defines the term writing system as "a set of visible or tactile signs used to
represent units of a language in a systematic way". Amharic is a Semitic language spoken
predominantly in Ethiopia. It is the working language of the country having a population of over
90 million as the present time. Amharic was the national language of Ethiopia until 1983 E.C
[49]. Currently it is the official working language of the Federal Democratic Republic of
Ethiopia and thus has official status nationwide and the official or working language of several
of the states/regions within the federal system, including Amhara and the multi-ethnic Southern
Nations, Nationalities and Peoples region. The language is spoken as a mother tongue by a large
segment of the population in the northern and central regions of Ethiopia and as a second
language by many others. It is the second most spoken Semitic language in the world next to
Arabic [50]. One of the major differences between Amharic and Semitic languages like Arabic
and Hebrew is that Amharic is written from left to right as of English [51].
According to [52] [53] Amharic is probably the second largest language in Ethiopia (after
Oromo, a Cushitic language) and possibly one of the five largest languages on the African
continent. As [54] sites there are three Semitic languages which are only found in Ethiopia and
Eritrea: those are Geez or Geez, Amharic and Tigrinya which are used in a representation for
Ethiopic system. As [55] cited the Geez syllable is solely Ethiopian writing system, used
nowhere else in the world except Eritrea (which happened to be part of Ethiopia) and Israel (by
Ethiopian Jews).
Geez play a significant role in the development and expansion of Amharic language and writing
system. Several religious texts, such as Bible, translations of Arabic Christian texts from Egypt
and literatures such as the qine () and poems are all written in Geez. The emergence and
expansion of Geez inscriptions in the Ethiopic script traced back to the 4th century AD, when

30

Geez was the language of the empire of Aksum North Ethiopia. Even if the use of geez is limited
to Orthodox Church, it is still a source for the coining of Ethiopian literary and technical terms
[56]. In spite of the relatively large number of speakers, Amharic is still a language for which
very few computational linguistic resources have been developed.

3.2.

Amharic Alphabet

The Ethiopic writing system, which the Amharic language uses, consists of a core of thirty-three
characters (, Fidel) each of which occurs in one basic form and in six other forms all known
as orders. The seven orders (the first basic order and the other six orders) of the Ethiopic script
represent the different sounds of a consonant-vowel combination. Amharic has 231 (7x33)
different characters and nearly 40 more other characters [57]. Most labialized consonants are
basically redundant, and there are actually only 39 context independent phonemes (monophones); of the 275 symbols of the script, only about 231 remain if the redundant ones are
removed [57]. The 40 additional characters contain special feature representing labialization like
/ gwe from / g, / qwe from / q, / lu from / l and / gu
from / g [58].
There are seven vowels in Amharic alphabets /, /u, /i, /a, /e, /_, and /o which are
based on their point of articulation grouped as Peripherals /u, /i, /a, /e, and /o and
Central vowels /, and /_ which are mostly used than the peripherals. The other idea worth
mentioning is that two consonants can appear in the middle or at the end of words in a cluster
whereas clusters at the beginning of the word are very restricted in which /_ is used. As an
example, the symbolic representations of the seven forms of the Amharic characters (ha),
(le), (me) are shown in Table

1st order

2nd order

3rd order

4th order

5th order

6th order

7th order

(H)

(Hu)

(Hi)

(Ha)

(He)

(H)

(Ho)

(L)

(Lu )

(Li )

(La )

(Le )

(L)

(Lo)

(M )

(Mu)

(Mi )

(Ma )

(Me)

(M )

(Mo)

Table 3. 1: The Amharic Character Representation

31

3.3. Amharic Punctuation


Analysis of Amharic texts reveals that different Amharic Punctuations marks are used for
different purposes. There are about 17 punctuation marks of which only a few of them are
commonly used and have representations in Amharic software [59]. For example the sentence
separator for Amharic text writing is four dots arranged in a square sequence as and is
referred as /arat netib. The comma equivalence punctuation in Amharic is /
netela serez which is symbolized as used to separate lists. The equivalence of the semi
colon which is used to separate phrasal lists or compound sentences is referred as /
dereb serez which is denoted by the symbol . Punctuation marks like the question mark ?
and the exclamation mark ! are borrowed from the English language and used in
Amharic language for the same purpose as they are used in other foreign languages [58].

3.4. Amharic numbers


The number system in Amharic writing has 20 single characters which represent numbers: ones
(1/ up to 9/), tenths (ten/ to ninety/), and hundred () and ten thousand (). These characters
are derived from Greek letters and are modified to look like the Amharic characters by adding a
horizontal line on top and bottom of each character [58]. The Amharic number system, however,
does not have any representing symbol for zero value and it does not use any decimal points and
commas. As a result arithmetic computation is very complicated using the Amharic number
system. It is mainly used in calendar dates to show the dates in the Ethiopian calendar. Using the
Amharic suffix -/-gna on the cardinal numbers, their ordinal number equivalents are
formed like / hult (two) forms / hultna (second). There are numerals that are
used to indicate distribution and partitions of something whole like / hult hult (two
two) and / gemash (half), / siso (one third), / rub (quarter) [47].

3.5. Syntactic Structure of Amharic


Since Amharic word formation follows its own structure, the syntax of the language also exhibit
a unique structure. The syntactic structure of Amharic is generally S+O+V (Subject + Object +
Verb). The modifiers in such structure generally precede the word or the phrases they modify.
For example, the Amharic equivalent for the English sentence she has understood

32

mathematics. is In the sentence is the subject and the object is


and the verb is . But usually pronouns are omitted when used as a subject. For the
above English sentence the common saying in Amharic is by implicitly
understanding the pronoun [60].

3.6. Problems in Retrieving Amharic Text


Here, some of the problems in Amharic text retrieval systems that are caused by the nature of the
writing system of the language are discussed.

3.6.1. Redundancy of Some Characters


In Amharic Alphabets there are some different symbols having the same pronunciation
(sound). Although in Geez language, these different symbols give each word different
meanings, in the Amharic language they are used interchangeably [61] [58]. The presence of
these redundant characters with the same sound in the language creates problem, especially in
term matching retrieval systems. Literally different word can be formed by combining the
different form of the same sound character. For instance, the same word tsehay (sun) can
be written differently as , , , , etc.
The class of symbols with the same sound falls into two. The first class includes characters with
the same sound for the first and fourth order. These are and , and , and , and , and
and . The second class includes characters with different alphabets that share the same sound.
These characters are , and , and , and , and and .

3.6.2. Formation of Compound Words


In Amharic writing system there is no agreed upon standard in spelling compound words
[100]. They are sometimes written as a single word and some other time as two separate words.
For example, the word megneta bet which means bed room can possibly be written as
and and also the word bet mekides which means temple can be written
as and ' .
Occasionally, the constituent terms may have completely different meaning from the compound
word formed from them. For example, the word 'hode-sefee (-) which means tolerant

33

has different meaning from the constituent terms hode which means stomach and sefee
which means wide. In literal term matching retrieval systems, the constituent terms of the
compound noun are considered as independent and a document which contains one of these
terms is treated as relevant. This phenomena result in retrieval of irrelevant materials for a query
which contains one of the constituent terms. However, concept based retrieval systems, like LSI,
can partially handle this problem, as the co-occurrence frequency of the constituent terms is
taken into account in determining the relation between the terms [62] [63].

3.6.3. Existence of Irregular Spelling


A number of words in Amharic can be written with different spelling [58]. For example, the
word samtoal which means he has heard may be spelled as , and and
also the frequently used term Ethiopia can also be spelled as and [58].
Literal term matching retrieval systems also suffer from this irregularity in spelling, as the same
word can have different spelling.
Transliteration of foreign words into Amharic writing system is one of the main causes of
this irregular spelling of words (ibid). Amharic language lacks some Basic English sounds.
As stated on [58], about six vowels and three consonant sounds common to English are absent in
Amharic. Due to this a native Amharic speaker may fail to correctly pronounce some English
words. The situation is similar to other foreign languages. Hence, each writer has a tendency to
write a foreign word the way he/she pronounces it.

3.7. Amharic Software


Fundamentally, computers just deal with numbers. They store letters and other characters by
assigning a number for each one [64]. Amharic alphabets do not have a representation in the
ASCII (American Standard Code for Information Interchange) code table. Apparently, different
Amharic word processing software makes use of the ASCII code for writing Amharic by
associating the English keyboard buttons with Amharic symbols. Since the number of Amharic
character together with punctuation marks is much greater than English, two and three keys are
used to represent a single Amharic symbol [65].

34

Different Amharic word processing software have been developed since 1987 (e.g Power
Geez, Geez, Agafari, Visual Geez, Ethiopic etc.) [66] These softwares use the same
English keyboard differently. That is, two Amharic word processors can use the same button to
represent two different Amharic characters. As a result, whenever data is passed between
different Amharic word processing software, that data always runs the risk of corruption. ECoSA
(Ethiopian Computer Standards Association), a professional association, is working to solve the
problems that result from the inconsistency in the available different Amharic software.
Most of the softwares are written to work only with Microsoft word. However, there are few
which can work in other programs. Visual Geez is one of the exceptions. Visual Geez has
two versions; VG2 and VG2000 developed for different versions of Microsoft office
products. Both the test collection and the sample queries used in this research are written in VG2
version of Visual Geez. [67]

35

CHAPTER FOUR
METHODS AND ALGORITHMS
4.1. Architecture of the System
In this chapter, we present our architecture to develop Amharic language interface to database.
The Amharic language interface for database accepts Amharic sentence as an input and generate
SQL query. The generated queries then execute on the actual database and retrieve results and
display to the user. The given input has been analyzed semantically based on the domain
dependent dictionary.

Figure 4. 1: Architecture of Amharic Language Interface for database

36

4.2. Natural Language Interface for Text Retrieval from DB


Natural Language is the language that is used by almost all human beings for communication in
the real world. The aim of Natural language Interface to Database is to provide an interface
where user can interact with database more easily using their natural language and access or
retrieve their information using the same [68]. In our case, a user natural language query through
Amharic language query can be accepted as an input to the natural language interface to retrieve
data or information from database. This interface is the first part of the user encountered and
responsible for present user query into the application. By using this interface the user express
the query with his/her native language. So this interface is considered as a bridge for users to
communicate/access a database. In order to use this user interface additional knowledge is not
required; only the knowledge of natural language/Amharic is enough to use this interface,
because all complex process is handled inside the system.

4.3. Query pre-processing


The natural language (Amharic) input first goes through a pre-processing phase. The preprocessing of the input query includes: analyze the token, stopword removal, spelling checking,
normalization, and stemming. The user token after going through pre-processing step gets
converted into base word which is whether it is table name, column name, condition word, or
keyword.

Tokenize
In the first part of the process the Amharic sentence input query should be tokenized. During the
tokenization process, the sentence is broken down into words called tokens. Those tokens are
stored in an array list or hash map. Tokens may represent name of tables, column, row,
command, operation, or they can be any value or any non-useful words. As presented on [69]
unnecessary repeated words called stop words are removed, and the remaining words are stored
in the array list.
Tokenizing of a given text depends on the characteristics of language of the text in which it is
written. The Amharic language has its own punctuation marks which demarcate words in a

37

stream of characters which includes colon : (hulet netib), the four dots or duble colon (arat
netib), semi-colon (derib sereze), comma (netela serez), exclamation mark! (Qalagano)
and question mark ? (Teyakimeleket). But right now hulet netieb (:) is replaced by white space
and it is no more use to separate list of words. To make a java understand the Amharic Unicode
the system has been used the following codes.
FileInputStream fileInput = new FileInputStream(filePath);
InputStreamReader inputStream = new InputStreamReader(fileInput, "utf-8");
BufferedReader bufferedReader = new BufferedReader(inputStream);
StringBuffer stringBuffer = new StringBuffer();
String lineContent = null;
While ((lineContent = bufferedReader.readLine()) != null) {
stringBuffer.append(lineContent);
}
String content = stringBuffer.toString();
Algorithm for handling Unicode characters
Stop words removal: The all words of a user query as well as a database documents do not have
equal value for mapping a database query. The least important words are called stope words.
Stop words are non-context bearing words, also known as noisy words which are to be excluded
from the input sentence to speed up the process. Stop words do not represent objects or concepts
of the world, and in our case stop words do not represent table name, column name or column
values. They often belong to syntactic classes such as articles, pronouns, particles, and
prepositions. These words are characterized by poor ability to map a query and similarity
identification. Thus, they could be removed from the text by comparing each term in the text
with a list of common words developed for a particular language and sometimes for a particular
domain. The stop word removal should be done carefully; otherwise it may affect the system.
Sample stop words in these categories are: , , , , , , etc.

Processes for removing stop words from user query


38

Input user keywords in Amharic language


Split user keywords into tokens= UQs
Receive stop words from hash map= STP
For (all UQs) do
Compare UQs with STP
If UQs much with STP
Remove the word from token lists
Else
Store token into a new Hash map
End if
End Loop
Algorithem for removing stop words

Spelling Checker
A spell checker is a tool that enables us to check the spellings of the words in user query,
validates them i.e. checks whether they are rightly or wrongly spelled and in case the spell
checker has doubts about the spelling of the word, suggests possible alternatives. The two core
functionalities provided by a spell checkers are: spelling error detection and spelling error
correction. Error Detection is to verify the validity of a word in the language while Error
Correction is to suggest corrections for the misspelled word.
The researchers use the spelling checker developed by Tefery Kebebew in Jimma University.
This spelling checker includes Amharic Unicode writing tolls, thereby users can write there
query without changing the system writing mode. This tools have good performance with a little
dictionary errors.

Stemmer
For grammatical reasons, documents are going to use different forms of a word. There are
families of derivational words with similar meanings, for instance words that have the same root
such as democrat but have different morphological variants, democracy, democratic, and
democratization [70]. The goal stemming is to reduce inflectional forms and sometimes
derivationally related forms of a word to a common base form. Stemming is also used to reduce
the size of the dictionary (i.e. the number of distinct terms used in representing a set of

39

documents). A smaller dictionary size results in a smaller storage space and processing time
required.
According to Atelach A. [57], the stemmer finds all possible segmentations of a given word
according to the morphological rules of the language and then selects the most likely prefix and
suffix for the word based on corpus statistics. It strips off the prefix and suffix, and then tries to
look up the remaining stem (or alternatively, some morphologically motivated variants of it) in a
dictionary to verify that it is a possible stem of the word.
The Amharic language makes use of prefixing, suffixing and infixing to create inflectional and
derivational word form. In a morphologically complex language like Amharic, a stemmer has a
great role in information retrieval. As present on [60], for stemming a given word it uses
exception list and normalization list files. Normalization is used to correct a variant of a word to
its stem after suffix is removed for some words (e.g. for a word will be removed as
affix and will be normalized to which is the stem).
Like Parser discussed on section 1.4.4, stemmer for Amharic language is developed only at a
research level; no single practical tool has been developed yet. So, to use the advantages of
stemmer we have developed a limited lookup table stemmer depending on the column names,
table names, and conditions of our particular system.
TOKEN

STEMMED

Table 4. 1: Sample stemming table

40

Normalization
As we have discussed on chapter 3 one of the problems in Amharic writing system is the
variation of alphabets (fidels) used with the same pronunciation. Tessema [71] has developed an
analyzer for normalizing documents to a specific form of a letter such as and to and ,
and , to and and to as well as their orders (, , , etc. , , , etc.) [71]. In addition
to the above normalizations, Yimam [72] investigated and found that some other orders of the
letters should also be normalized. For example , , , , and should be normalized to .
Similarly the characters , , , ; , ; and so on should be normalized to one form as they are
being used interchangeably in documents.
For each word in a token list
For each character in a word
If the character is any one of , , or, any other order thereof then
Replace it by continue
Else if the character is any one of , , or any other order thereof then
Replace it by continue
Else if it is or any other order thereof
Replace it by continue
Else if it is
Replace it by continue
Else if it is or any other order thereof
Replace it by continue
End if
End for
End for
Algorithm for normalizing the input words

41

In addition a compound word has been handled to correctly map the table name and column
name.
FIRST_WORD

SECOND_WORD

CONCATENATION

Table 4. 2: A Sample for compound words

4.4. Semantic analysis


Our system has a domain dependent dictionary and based on this dictionary the user query is
analyzed semantically. So, whenever, the users enter the query the table name and column name
are analyzed semantically.

4.5. Mapping of the User Query


The aim of a natural language interface is to facilitate the user to do computing in a natural way.
For this purpose, we have designed a domain specific dictionary to keep the synonyms of the
columns and tables name. The inclusion of synonyms makes it possible for the user to write a
sentence in different natural ways. We call this dictionary as semantic dictionary [36], because
there is no syntactic information for tokens. Semantic/System dictionary contains the synonyms
for each of the column names, condition, and table names. This dictionary is manually
constructed and is database specific. The dictionary is not like a huge corpus; rather it has entries
according to the number of entities in a database. The results of tokens are accepted as an input
to map the user query to SQL. Based on the system dictionary the system can be able to map the
user query to table name, column name, and conditional words. The system dictionary contains
table_ handling table, column_ handling table, and conditional_ words handling tables to map
natural language input query to SQL. Table_ handling table contains all available table names in
the actual database with their different possibility to be expressed in the sentence; whereas,
column_ handling table contains all available column names of all the table from the actual

42

database with their different possibility to be expressed in the sentence. Thirdly, conditional_
words table contains different conditional symbol like <, >, <=, and >= with their name
expression in a natural language or in a given sentence. Then this logical query is transfer to
query generator. The overall mapping is describing and presented on the following tables [1].
Table_ Handling Table
Token Word

Mapped words

EMPLOYEE

EMPLOYEE

EMPLOYEE_ON_EDUCATION

Table 4. 3: Table_ Handling Table


Column_ Handling Table
Token Words

Mapped words

Id

SEX

FILDE_OF_STUDY

HIRE_DATE

DEPARTMENT

COLLEG

SALARY

NAME

FILDE_OF_STUDING

POSITION

Table 4. 4: Column_ Handling Table

43

Conditional_ Word Table


Token words

Mapped Words
||

<

||

>

|| ||

<=

|| ||

>=

||

!=

Table 4. 5: Conditional_ Word Table


The system searches the token on the Table_ Handling table in the database. And if any token
matches with one of the entry words in this table, then the system will map that entry with the
table name and will find the table to which that query is associated. If the system does not find
the table name then query is taken as invalid.
After finding the table name, the system searches tokens in Column_ Handling table. And if the
token matches with one of the entry in this table, then system maps that entry with the column
name and finds that column name to which that query is associated. If column name is not
mapped then the system executes the query select * from tablename.
Furthermore when the column handling finds the column name, the system checks whether the
column name is present for the condition to handle the result to be displayed or the selection to
be displayed. If the column name is accepted as to be selection, the system maps the column
name as select (column name), [column name] from table name. But if the column name is
identified as column name for condition, the system make a conditional query by identifying
column name and column value. A conditional word used for make a conditional comparison
between column name and a column value. An algorithm has been:
Normalized input words = normalized_words;
Connect to MySQL database;

44

Retrieve token_wored and mapped_wored from database table and store on new
hashmap;
Get first tokens from normalized_words by using stringTokenizer method;
While (tokens are StillExist on normalized_words) {
Itrate the hasmap;
While (itrater have nextValue) {
Get token_wored;
Get mapped_wored;
If (token_word contains token) {
TableName/ColumnName= mapped_word;
}
}
token = Next token;
}
Algorithm for Mapping Table Name and Column Name

4.6. SQL Query Generation


This component takes the input from the query map and then generate SQL query. For
developing an algorithm for generating SQL query the natural language structure has been
analyzed. Consequently for analyzing language request and sentence types structured in different
ways have been analyzed. We have categorized the user query into three parts called query for
selection of the whole table, query for selection of certain column from a table, and query for
selection of certain row from certain column or query using the where condition; the remaining
database concept called aggregate function, grouping and ordering query discussed on the next.
Once we identify which one is column name and which one is table name the first and the second
category is a straight forward. That means there is no attribute and value relation on the sentence.
The third category needs additional investigation to classify which column is for selection and
which columns is for condition from the given column identified on the sentence. We would

45

analyze the syntactic structure of Amharic sentence for querying a sentence to retrieve a text
from a database. The sentence or the question contains the name of a table, attribute or value. For
instance, when the user querying the input to retrieve the whole part of the table, then the query
looks like [] [] []. The equivalent English meaning of each word is [all]
[employees] [display] respectively. This structure has a table name called / employee and
no column name has been include on the sentence. So this sentence gives SELECT * FROM
EMPLOYEE; SQL query. This query displays the entire content of the table.
In the other case, for querying a sentence to retrieve a certain column from the table, the sentence
should embrace the column name. For instance when the user querying a sentence to display the
name of the employees, the appropriate Amharic query is [] [] [] []
and this structured looks [Employees] [Name] [List] [display] respectively. From this
structure /Employee is a Table Name, / name is a Column Name and the
remaining words are no contribution on generating the query on this system. Because from this
sentence the principal words are table name and column name for generating the query. So the
Query looks SELECT NAME FROM EMPLOYEE;. Therefore from the specified sentence the
column name has not a column value i.e. here the column name has used for a selection only.
Likewise if the sentence contains more than one column name and each column names are
recognized as used for selection only, directly list the column names next to a select query. For
example when the user wants to retrieve the name and the salary of an employee, the requests are
stated in Amharic like . From the given sentence /name and
/salary are identified as a column name and no column value for both column names; or
both column names are used for selection only. The expected SQL query intended for the above
sentence is SELECT NAME, SALARY FROM EMPLOYEE.
In other way when the user wants to retrieve a certain row from a certain column, analyzing the
column name for condition and the column name for selection is expected. To analyze which
column name is for condition and the other is selection, we analyze the question presented on
different requesting type. For example the queries like retrieve name and id number of
employees where sex is male, the Amharic structured offered requests like [] [] []
[] [] [] [] [] []. This structure looks [their sex] [male]
[been] [employees] [name] [and] [identification] [numbers] [display]. From this sentence

46

/employee is a table name and the other /sex, /name, and _/idnumber
are column name. Indeed, /sex is a column name used for a condition to handle the retrieved
results.
In other example display the name of the employees which are hired on 2005. Querying this
sentence in Amharic is [] [/] [] [] [] []. This structured looks
[in 2005] [year] [hired] [employees] [name] [display]. From the given sentence
/employee is a table name and /hired, and /name are column names. The token
/hired is a column name used for a condition to handle the results to be displayed. From
the entire given sentence the column name have present before table name and after table name,
and we have identify a rule. From the given sentence a token recognized as column name before
a table name is considered as a column name used for a condition, and a token recognized as a
column name after a table name is considered as a column name used for a selection statement.
However, form the given description there is an exception to identify where column from the
given sentence. For example, [ ] [ ] [] [] [ ] []
[] [] [] [] [] this means that select the name and sex of the
employees who have got more than 10000 birr and worked on accounting department. And the
structure looks [accounting] [department] [employees] [there salary] [10000] [more than] [have
been] [name] [and] [there sex] [display]. This structure indicates that the column name
/salary is presented after a table name called /employees. So to handle this and
such exception, we have checked the word after a table name called /have_been. So if the
word or is presented after the table name, the column name after a table name and
before is considered as a column name used for a condition.
In general, we have conclude that: IndexNumberOf(Cc) < IndexNumberOf(TableName) <
IndexNumberOf(Cs).

47

KEY
Table name
Column name
Column value
Conditional word
Keywords

48

More Condition and Aggregate Function on SQL Concepts


Ordered By: - This query is used to display the results in ascending or descending orders. In a
natural language (Amharic) ordered by is presented on a sentence like . This word
found on the sentence after a table name is placed i.e.:
0

<=

IndexNumberOf(TableName)

<

IndexNumberOf().

The column name used for ordered is:


IndexNumberOf() 1(IndexValue).

Based on this word we formulate the rule to handle the ordered by queries. Forexample,
; display all employees whose
salary less than 2500 ordered by names. In this sentence array value of TableName is 5 and array
value of is 7. So 0 <= 4 < 6. Based on this rule the query has been converted. The
above natural (Amharic) language query converted into:
SELECT * FROM EMPLOYEE
WHERE Salary <
ORDERED BY Name;
Group By: - This database query has used to display the results in a group. To recognize the
Group By query from the given input, we have checked the word / from the given
sentence. This keyword comes after a table name and before table name according to the the user
request. Then based on the keyword we formulate the group by queries. For example,
; this means that display all
employees whose salary is less than 3550 group by their departments. Then this natural language
query has been converted into:
SELECT * FROM EMPLOYEE

49

WHERE Salary <


GROUP BY Department;

Count (): - This query is used for counting the result which fulfill the queries or the condition.
To recognize this SQL function we find the word from the given sentence. This keyword is
found after the table name on the sentence. For example, ,
which means display the total number of female employees. Then this query converted into:
SELECT COUNT (*) as TOTAL FROM EMPLOYE
WHERE Sex = ;
SUM (): - This database query is used to add the column values based on the specification. To
formulate the sum () query, we identified the word from the given sentence. This keyword
found after the table name and follow the Ordered by rules. For example,
; which means display the total number of salary in
management department. This query converted into:
SELECT SUM(Salary) as TOTAL_SALARY FROM EMPLOYEE
WHERE Department = ;
MAX (), MIN (), AVG (): - This database query is used to select the maximum, minimum, and
average of the value approximately. To formulate the query, we have identified the word
for maximum, for minimum, and for average from the given Amharic sentences.
For example, ; this means
that display the maximum salary from the computer science department. This query has
converted into:
SELECT MAX(salary) as MAXIMUM_SALARY FROM EMPLOYEE
WHERE Department = ;
We have identified rules to develop algorithms.

50

RULE #1: if the sentence doesnt contain the table name, the sentence is invalid for
translation.
RULE #2: If the sentence contain table name only, the query is selection of the whole
table. SELECT * FROM TABLE_NAME;
RULE #3: If the sentence contains both table name and column names, and if the column
name positioned next to table name, the column name is used for selection. SELECT
(COLUMN_NAME), [COLUMN_NAME] FROM TABLE_NAME;
RULE #4: If the sentence contains both table name and column names, and if the column
name positioned next to table name and a token

found after a column name on

the sentence the column name is a column name for conditions;


RULE #5: If the sentence contains both table name and column names, and if the column
name positioned before table name, the column name is used for the condition.
SELECT

FROM

TABLE_NAME

WHERE

COLUMN_NAME

COLUMN_VALUE;
RULE #6: If the sentence contains both table name and column names, and if the column
name positioned before table name and after table name, the column name placed before
the table name is column name for condition (COLUMN_NAMEc) and the column name
placed after table name is column name for selection (COLUMN_NAMEs). SELECT
(COLUMN_NAMEs),

[COLUMN_NAMEs]

FROM

TABLE_NAME

WHERE

COLUMN_NAMEc = COLUMN_NAMEc_VALUE;
RULE #7: If the sentence contains both table name and column names, and the column
name is positioned before table name, and if the column name positioned at the beginning
of the sentence the column value is the word/s located next to column name (stop words
has been removed), and if the word placed next to column value is a conditional word the
sign is a mapped_ word of conditional word else the sign is equal sign.
RULE #8: If the sentence contains both table name and column names, and the column
name is positioned before table name, and if the word positioned next to column name is
a table name or column name, the column value is the word/s placed before the column
name (stop word has been removed) , and the sign is equal sign.
RULE #9: If the sentence contains both table name and column names, and the column
name is positioned before table name, and if the word positioned before column name is a

51

conditional word, and the word positioned next to column name is either table name or
column name, then the word before column name is a condition (a sign) and the value is
the word placed before the condition (sign).
RULE #10: If the natural language sentence contains the word , the query includes
COUNT () function.
RULE #11: If the natural language sentence contains the word , the query
includes AVG () function and the column to be calculated is found next to the word
.
RULE #12: If the natural language sentence contains the word , the query
includes MAX () functions and the column to be compared is found next to the word
.
RULE #13: If the natural language sentence contains the word , the query
includes MIN () functions and the column to be compared is found next to the word
.
RULE #14: If the natural language sentence contains the word , the query includes
SUM () functions and the column to be add is found before the word .
RULE #15: If the natural language sentence contains the word , the query includes
the condition with the keyword between. The initial comparable value is found before
the word and the second value found after the word .
RULE #16: If the natural language sentence contains the word , the query includes
group by query. The column used for grouping is found before the word .
RULE#17: If the sentence contains both table name and column names, and the column
name belongs from different table name, both the table name makes a join with a
keyword INNER JOIN.
RULE#18: If the sentence contains both table name and column names, and if the
sentence contains the keyword /, the query include LIKE in the where
cconditon. The keyword used for the condition is found before the word
/ and the column name used for checking is found before the keyword
used for the condition.

52

RULE#19: If the sentence contains both table name and column names, and if it contains
the keyword , the query includes ordering query. The column name used for a
comparison is found before the wored .

4.6.1. Algorithm to Handle Rules


To handle the rules identified from the sentence we have constructed the algorithms. A natural
language input NL contains X tokens (K1, K2, , KX) and this tokens stored on Array A. In our
database table T there are N columns (C1, C2, , CN) with M rows (R1, R2, , RM). . And a NL
contains column name C, table name T, a column value V, and conditional word W. column
name has further divided into column name for condition Cc and column name for selection Cs.
The dictionary D contains table name T with table mapping Tm, column name C with column
mapping Cm, and conditional words W with condition mapping Wm. The Array separated into
three groups A1, A2, and A3. Array A1 handle from the beginning of the array called A[0] to the
array value that holds table name minus one A[T-1], A2=A[T], and A3 contains from A[T+1] to
the last. The column names found on A1 are column name for condition (Where) and it contains
attribute, sign, and value (column name, condition, and column value) i.e. WHERE Cc1 W1 V1
AND/OR Cc2 W2 V2 AND/OR AND/OR CcN WN VN. Array A2 contains table name T
specified next to FROM keyword like FROM T. And finally, A3 contains column names used for
selection positioned next to SELECT keyword, like SELECT Cs1, Cs2, , CsN. A3 also contains
aggregate functions like SUM, AVG, MAX, MIN, and COUNT queries. Additional quires like
group by and ordered by is included called Po. So the final output could be the combination of
A1, A2, A3 and Po and it looks SELECT Cs1, Cs2, , CsN FROM T WHERE Cc1 W1 V1
AND/OR Cc2 W2 V2 AND/OR AND/OR CcN WN VN Po;. For mapping the table name,
column name, and conditions each tokens check on the dictionary. The column names and table
name has been identified on the above algorithm and placed on Token_word and mapped_word.

Algorithm to handle A2
Int Table_location = -1;
User Input = Input;
Put the input on ArrayList = input_list;

53

For (intial= 0; intial <= input_list.size();intial ++) {


IF (input_list.get(intial) contents equals with(Token_word)) {
Table_location = initial;
Break;
}
}
This algorithm acquires the array location of the table name, and based on this location we
formulate column name for condition and column name for selection.

Algorithm to handle select column


// column name for selection
FOR (initial = Table_Location; initial <= input_list.size(); initial ++) {
IF (Array value of initial = column name) {
IF (list.get (initial + 2) contains ()) {
The column name used for condition;
The column name= input_list.get (initial);
The column value= input_list.get (initial + 1);
Initial += 2;
}
ELSE {
The column used for selection;
The column name= input_list.get (initial);
}

Algorithm to handle where condition


// column name for condition
FOR (initial = 0; initial < Table_Location; initial ++) {

54

IF (Array value of initial = column name) {


IF (input_list.get (initial + 1)!=TableName OR ColumnName OR Keywords) {
The column name = input_list.get (initial);
The column value = input_list.get (initial + 1TableName OR
ColumnName OR Keyword exists); /* initial value increased by one until table name or column
name or keyword exists*/
}
ELSE IF (input_list.get (initial + 1) =TableName OR ColumnName OR
Keywords) {
The column name = input_list.get (initial);
The column value= input_list.get (initial 1 TableName OR
ColumnName OR Keyword exists OR initial = -1);
}
ELSE {
Where condition is not on the query
}
}
}
Algorithm to handle aggregate function
// to handle aggregate function
FOR (initial = 0; initial <= input_list.size(); initial ++) {
IF (Array value of initial = column name) {
IF (input_list.get (initial + 1) contains ()) {
The Query = SUM ();
The column to be add= input_list.get (initial);
Initial += 1;
}
ELSE IF (input_list.get (initial 1) contains ()) {
The Query = AVG ();

55

The column to be calculated = input_list.get (initial);


}
ELSE IF (input_list.get (initial 1) contains ()) {
The Query = MAX ();
The column to be compared = input_list.get (initial);
}
ELSE IF (input_list.get (initial 1) contains ()) {
The Query = MIN ();
The column to be compared = input_list.get (initial);
}
ELSE {
The query does not contain aggregate function
}
}

Algorithm to handle Like, Between, Group by, and Ordered by Queries


// to handle Like, Between, Group by, and Ordered by queries
FOR (initial = 0; initial <= input_list.size(); initial ++) {
IF (Array value of initial = column name) {
IF (input_list.get (initial + 2) contains ()) {
The Condition Sign = BETWEEN;
The first value= input_list.get (initial + 1);
The second value= input_list.get (initial + 3);
Initial += 2;
}
ELSE IF (input_list.get (initial + 2) contains ( OR )) {

56

The Condition Sign = LIKE;


The value = input_list.get (initial + 1);
Initial += 2;
}
ELSE IF (input_list.get (initial + 1) contains ()) {
The Query = ORDERED BY";
The column to be compared = input_list.get (initial);
}
ELSE IF (input_list.get (initial + 1) contains ()) {
The Query = GROUP BY;
The column to be group = input_list.get (initial);
}
ELSE {
The query does not contain aggregate function
}
}
We are proposed to join two or more table; and to do so we identify the column name of each
table. If the query contains column names from different table, then we join two tables with
INNER JOIN.

4.7. SQL Query Execution


After the users query is processed and SQL query is generated, the next process is executing the
query into the database. For executing the query there should be a connection between
application program and the database. Based on this connection the query is sends to and
executed on the database.

57

4.8. Result
The results found from the database are again sent to the application program, and the
application sends the results in a form that is understandable to the users. Finally, the result is
displayed in the interface accordingly so that the user can see the converted query as well as the
result retrieved from the database.

58

CHAPTER FIVE
IMPLEMENTATION, RESULTS and DISCUSSIONS
5.1. Introduction
As mentioned previously, in line with the main objective of this study, the researchers have
developed an Amharic language interface to database within java 8.02 with jdk 8 update 101
(1.8.0_101) and MYSQL database at the back end for preparing the dictionary and the actual
database. This section of the chapter deals with issues in the experimentation of the designed
Amharic language interface to database as discussed in 4.5. Primarily in using this particular
system developed by the research first the users have to enter the query through Amharic
sentence so that the system display SQL and results can be retrieved from the database
accordingly.
In the next few sections, the processes involved and the results and output that are obtained
during the researchers experimentation of the database system are discussed and presented in
detail. This include the steps or procedures employed in users executions of different types of
queries; such as, query for selection of the whole table, query for selection of certain columns,
queries with a single condition, queries with multiple condition, aggregate function, joining
queries, grouping and ordering queries. In addition, the results of the researchers evaluation of
their proposed system in different dimensions are also mentioned. In another section of the
chapter, analysis of the results from the experimentation of the system that includes the type of
database used to validate the design prototype, the categories of the queries and the mode or
forms of the queries are explicated and demonstrated. Finally, the result of the overall
performance measurement of the designed database system in this study is presented.

5.2. The User interface startup and operation


Initially, the user interface developed in this study is designed in such a way that the users can
make there queries or requests to the database in a simple and clear manner, can get the
converted SQL query as well as the results fetched from the database.

59

As displayed on figure 5.1, in order to startup accessing the database, the user first enters his/her
request in a sentence form in Amharic language, in the text box provided then they can generate
and execute their entered query by clicking on the button GENERATE. This will lead the
system to convert the users query in the natural language as the selection of the whole table,
selection of certain column from a table, and selection of certain rows from certain columns
(conditions) including aggregate function, grouping and ordering queries.

Figure 5. 1: User interface of Amharic language interface to database


Next the result of the experimentation that involved a user actual query based on the various
modes of selection like List query, query with in a single and multiple condition, aggregate
function, grouping and ordering queries are presented.

5.3. Experiment on Query for Selection of the Whole Table


This type of query is executed when a user wants to make a request to display the entire table or
the entire employees. For instance if a certain query such as;
is entered by the user, the database retrieves and displays the following data that shows

60

the list of all employees and every information about them both in rows and columns. The result
of such query made on the selection of the whole table is demonstrated on figure 5.2 as follows.

Figure 5. 2: Example of Query for Selection of Whole Table


To mention the process involved in acquiring the SQL query and the data, first the system accept
the user query like , then the system analyize the input and
identify whether the token conatins table name, column name, and/or conditions. From the given
input then the system identify the table name called (Employee) and no column name
and condition are included. Following this the system check the rule specified on the previous
chapter. The given query contains only a table name and according to RULE#2 the system
identifies the request is selection of the whole table. Finaly, the system converts it into SELECT
* FROM Employee; and send the query into the database. Following this it fetch all the columns

61

and rows of the employee table and display the required accordingly. The query can be
paraphrase in different form such as: , and etc.

5.4. Queries for Selection of Certain Column


Unlike the previous type of query this type of query is made when a user wants to retrieve a
particular or selected data on a certain column from the entire employee table.

Figure 5. 3: Example of Query for Selection of Certain Columns


For instance in order to get the selection results, as displays on figure 5.3, the users query
paraphrased as is first
entered into the text box provided. After that the system analyize the user query and identify
table name, column name, and condition based on the system dictionery. The given query
contains a table name called /Employee and list of columns called /sex,
/level, and /hire date. Those column names are positiond after the table
name is present and no column name is found before table name. Following this the system

62

check the rule specified on the previous chapter. According to RULE#3, if the user query
contains both table name and column names, and if the column name positioned next to table
name, the column name is used for selection. So the system convert the query into SELECT
SEX, LEVEL, HIR_DATE FROM Employee;. Finaly, this query sends and fired on the
database and retrieved and displayed the required data with columns on sex, level, and hire date
along with the entire rows. Figure 5.3 shows result of such query for selection of certain
columns. The different paraphrases of such queries are: ,
,
, , ,
, and etc.

5.5. Queries with a Single Condition


These types of query is a more specified one compared to the previous ones for it aims to access
and retrieve data based on a single and specific chosen condition.

63

Figure 5. 4: Example of Query for Single condition


For example, a user who requires to obtain a data all about the employees specified by single
characteristics such as their salary, he/she can paraphrase his/her query as 5000
. This query contains a table name
/Employee and a column names /salary, and /name. On this query the
colun name salary presents twowise on the sentence, before the table name and after the table
name. The remaining column name called name found after a table name. Based on RULE#6
the column name found before and after table name; so the column name placed before the table
name is column name for condition and the column name placed after table name is column
name for selection. To identify a column value it calls a RULE#7 called if the column name
positioned at the beginning of the sentence, the column value is the word/s located next to
column name, and if the word placed next to column value is a conditional word the sign is a
mapped_ word of conditional word else the sign is equal sign. As a result the query is changed
in to SELECT SALARY, NAME, FROM Employee WHERE SALARY>5000; and result of
the converted query and retrived result is displayed on the user interface as shown on figure 5.4.
The query can be written and paraphrased like the following: ,
5000 , 5000
, 4500
, and etc.

5.6. Queries with Multiple condition


This type of query is executed in search of a data with multiple conditions. In clear terms, a
users query in such a case is targeted to access information on an items selected for a multiple
set of their characteristics.

64

Figure 5. 5: Example of Query for Multiple Conditions


To demonstrate this more specifically, for example a user can paraphrase his/her query as

. From this particular example a system identifies /employee is table
name and /level of education, /sex, /name, and /salary is a
column names. The column name level and sex is found on both before table name and after
table name. So, according to the RULE#6 the column name found before table name used for a
condition and the remaining is used for a selection. Finally, the query converted into SELECT
NAME, SEX, LEVEL, SALARY FROM Employee WHERE LEVE = AND
SEX= ;and retrieve the data and displays it on the table at shown on figure 5.5. The
paraphrases of this type of query are: 5000
, 5000 ,

etc.

65

5.7. Join Queries


This types of query is executed when the user want to retrieve a data from multiple tables. To do
so we have developed a little bit to join two or more tables. We have selected three tables;
employee, department, and employe_on_education. Department and employee_on_education
table is part of the employee table. So, whenever the user requests to fetch departments or to
know

about

employees

education

the

queries

is

joined

with

employee

table.

Figure 5. 6: Example of join query


To demonstrate this join query, for example a user can paraphrase his/her query as
, the system analyze and identify the table name and
column names. On this particular example two column names is included: /name and
/department. The column name department is belongs from a department table and a
column name name is belongs from employee table. According to RULE#17 if the column
name belongs from different table name, both the table name makes a join with their coumn

66

attribute by using a keyword INNER JOIN. So, according to this rule, the query converted into
SELECT NAME, DEP_NAME FROM department INNER JION ON employee.DEEP_ID =
department.DEP_ID. This query sends to the database and retrieves the data and displays it on
the table at shown on figure 5.6. The paraphrases of this type of query are looks like:
,
,
, ,
etc.

5.8. Aggregate Function


These types of queries exist when the user want to calculate the retrived results or comparing the
results.

Figure5. 7: Example of Aggregate Function

67

For example, a user requires finding a total number of employees and average salary, works on
mathimatics department, with a specification of the first letter leter of the employee should start
with they can paraphrase as
. According to RULE#18 if the sentence contains the keyword
/, the query include LIKE in the where cconditon. In the same fashine, on
RULE#10 and RULE#11, if the sentence contains the word , the query includes COUNT ()
function and if the keyword is the query includes AVG () function. Therfore, according
to RULE#18, RULE#10, and RULE11 the query converted into SELECT count(*) AS TOTAL,
AVG(SALARY)AS AVGSALARY FROM

department INNER JOIN employee ON

employee.DEP_ID = department.DEP_ID WHERE NAME Like "%" AND DEP_NAME =


'';; and result of the converted query is displayed on the user interface on shown on figure
5.7. The query can be written and paraphrased like the following:
, ,
,
, ,
,
etc.

5.9. Groping and Ordering Quires


This type of query exists when the users want to display the results in asending or desinding
order and/or display the results in a group.

68

Figure5. 8: Example of Aggregate Function


For example, the query given to the system is
. From this example table name / employee and a
column name /name, /salary, and /level are exist. All the column name is
found after a table name and according to RULE#3 if the column name positioned next to table
name, the column name is used for selection. Therefore, no conditional columns are exist on the
particular example presented on Figure 5.8. Next, the system identifies the keyword
and according to RULE#19, if the sentence contains the keyword , the query includes
ordering query. Finaly, the query looks SELECT NAME, SALARY, LEVEL FROM Employee
Order by SALARY;. The query can be written and paraphrased like the following:
,
,
etc.

69

5.10.Evaluation of the System


We have collected sample queries from the novice users who have no an expert knowledge of
database. This is to say that we have evaluated our system from different dimension.
Language character recognition: - the user can type the query without too much
care of ambiguous and multiple form characters such as , , , , , , , , , ,
since, the system can understand such characters in various alternatives.
Relaxation of grammar: - The system can also understand queries of the user
constructed in the form of various expressions or structured in various grammatical
forms. For instance, the system can recognize the following two sentence forms of single
query as or .

Language Understanding: - during the request the user expected to include the
column name on the query. For instance in a certain query express the sex is female
but for our system would not understand and the user should reframe the query as
.

Simplification: - The user can type the query without a need for complexity of
expression and without the strong knowledge of SQL.
We have collected 120 sample user queries from ordinary people who have not knowledge about
database language. The accuracy of the system is measured in term of precision percentage
with two classes that identifies query response as: Correct Queries and Incorrect Queries.
5.10.1.

Analysis of Results

5.10.1.1.

Databases used for Evaluation

We have used Academic employee database to verify or validate the functionality of the
developed prototype or ALIDB system. This database contains three tables namely Employee,
Department, and Employee on education. Each table has its own column. The structure of the
table is presented below:
COLUMN NAME

DATA TAYPE

COLLATION

EMP_ID

VARCHAR (10)

utf8_unicode_ci

70

COMMENTS
Employee

identification

number
NAME

VARCHAR (30)

utf8_unicode_ci

Name of employee

SEX

VARCHAR (4)

utf8_unicode_ci

Sex of employee

FILD_STUDY

VARCHAR (50)

utf8_unicode_ci

Field of study

LEVEL

VARCHAR (15)

utf8_unicode_ci

Level of employee

DEP_ID

VARCHAR (10)

utf8_unicode_ci

Department

identification

number
HIRE_DATE

DATE

utf8_unicode_ci

Hire date of employee

SALARY

VARCHAR (7)

utf8_unicode_ci

Salary of employee

POSITION

VARCHAR (15)

utf8_unicode_ci

Position of employee

Table 5. 1: Employee table structure


COLUMN NAME

DATA TAYPE

COLLATION

COMMENTS

DEP_ID

VARCHAR (10)

utf8_unicode_ci

Department identification number

DEP_NAME

VARCHAR (30)

utf8_unicode_ci

Name of department

COLLAGE

VARCHAR (30)

utf8_unicode_ci

Name of collage

Table 5. 2: Department table structure


COLUMN NAME

DATA TAYPE

COLLATION

COMMENTS

EMP_ID

VARCHAR (10)

utf8_unicode_ci

Employee identification number

UNIVERSITY

VARCHAR (30)

utf8_unicode_ci

University

of

employee

attending
FILED_OF_STUDING VARCHAR (30)

utf8_unicode_ci

The employee filed of studying

STARTING_YEAR

utf8_unicode_ci

The

DATE

year

employee

started

studying
Table 5. 3: Employee on education structure

5.10.1.2.

Analysis Based on Question Category

To evaluate the actual users query request we have divided the requesting category into four:
Query for Selection, Query for a Single Condition, Query for Multiple Conditions, and Query for

71

Aggregate Function including Group by and Ordered by. We have included join query in all part
of the category.

A. Selection (List) query


These quires include a selection of columns with all rows without conditional queries. On the
converted query JOIN condition is included; when the requested list includes data from different
tables. In the following table we present the list of query used for evaluation and the converted
query.
No.

User input query

System generated query

C/I

SELECT * FROM E employee;

SELECT * FROM E employee;

SELECT * FROM E employee;

SELECT * FROM E employee;

SELECT
NAME,
SEX, C

HIRE_DATE, LEVEL FROM
employee;

SELECT
NAME,
SEX, C

HIRE_DATE, LEVEL FROM


employee;

SELECT
NAME,
SEX, C

HIRE_DATE LEVEL FROM
employee;

SELECT HIRE_DATE, NAME, C



SEX, LEVEL FROM employee;

SELECT LEVEL, HIRE_DATE, C

72


10

SEX, NAME FROM Employee;

SELECT NAME, HIRE_DATE, C


SALARY FROM Employee;

11

SELECT HIRE_DATE, NAME, C


SALARY FROM Employee;

12

SELECT NAME,

HIRE_DATE

SALARY, C
FROM

Employee;
13

SELECT

SALARY,

HIRE_DATE

NAME, C
FROM

Employee;
14

SELECT
SALARY, C

HIRE_DATE, NAME FROM


Employee;

15

SELECT NAME, POSITION, C


SALARY FROM Employee;

16

SELECT NAME, POSITION, C


SALARY FROM Employee;

17

SELECT NAME, POSITION, C


SALARY FROM Employee;

18

SELECT NAME, POSITION, C


SALARY FROM Employee;

19

SELECT NAME, POSITION, C



SALARY FROM Employee;

20

73

SELECT POSITION, NAME, C

SALARY FROM Employee;


21

SELECT NAME, DEP_NAME, C


SALARY FROM

department

INNER JOIN employee ON


employee.DEP_ID

department.DEP_ID;
22

SELECT NAME, DEP_NAME C

FROM

department

JOIN

employee

employee.DEP_ID

INNER
ON
=

department.DEP_ID;
23

SELECT
NAME,
SEX, C

HIRE_DATE,
DEP_NAME,
LEVEL FROM

department

INNER JOIN employee ON


employee.DEP_ID

department.DEP_ID;
24

SELECT NAME, DEP_NAME C


FROM

department

JOIN

employee

employee.DEP_ID

INNER
ON
=

department.DEP_ID;
25

SELECT NAME, DEP_NAME C


FROM

department

JOIN

employee

employee.DEP_ID
department.DEP_ID;

74

INNER
ON
=

26

SELECT NAME,

DEP_NAME
department

University, C

INNER

employee

JOIN
ON

employee.DEP_ID
department.DEP_ID
JOIN

FROM

emp_on_edu

emp_on_edu.EMP_ID

=
INNER
ON
=

employee.EMP_ID;
27

SELECT NAME, University, C

COLLAGE FROM department


INNER JOIN employee ON
employee.DEP_ID
department.DEP_ID
JOIN

emp_on_edu

emp_on_edu.EMP_ID

=
INNER
ON
=

employee.EMP_ID;
28

SELECT
NAME,
SEX, C

HIRE_DATE,
DEP_NAME,

University,

LEVEL

FROM

department

INNER

JOIN

employee

ON

employee.DEP_ID
department.DEP_ID
JOIN

emp_on_edu

emp_on_edu.EMP_ID

=
INNER
ON
=

employee.EMP_ID;
29

SELECT NAME, POSITION, C



SALARY FROM Employee;

75

SELECT NAME, POSITION, C



SALARY FROM Employee;

30

Table 5. 4: List query request and results


Total queries

Correct queries (C)

Incorrect queries(I)

Accuracy

30

30

100%

Table 5. 5: Accuracy of List Query

100
90
80
70

Correct

60

Incorrect

50
40
30
20
10
0
Select query

Figure 5. 9: System performance of List Query


As a result, we have given 30 different sentences or queries in various forms of writing provided
to the system, and the system has converted the whole sentence correctly. As shown on table 5.9,
the accuracy of the system is 100% accurate as indicated by the list query scored.
B. Single conditional queries

76

Single query includes selection of the whole column or specific column with specific rows. This
part of the query handles only a single condition and it have this form: Select Selection
[selection] from Table _name where Condition.
No.

User input query

System generated query

C/I

SELECT
NAME,
SALARY, C

LEVEL FROM Employee


WHERE SEX = '';

SELECT NAME FROM Employee



WHERE SALARY >= '';

4500 SELECT NAME,


FROM Employee

HIRE_DATE C

WHERE SALARY < '4500';


4

SELECT NAME FROM Employee

WHERE SALARY = '';

SELECT
NAME,

FROM Employee

POSITION C

WHERE SALARY = '';


6

SELECT
NAME,

FROM Employee

POSITION C

WHERE SALARY = '';


7

SELECT
NAME,

FROM Employee

POSITION C

WHERE SALARY >= '';


8

SELECT NAME FROM Employee

WHERE POSITION = '';

SELECT NAME FROM Employee



WHERE LEVEL = ' ';

10

SELECT NAME FROM Employee



WHERE LEVEL = ' ';

11

SELECT

77

NAME,

DEP_NAME I

FROM department INNER JOIN

employee ON employee.DEP_ID =
department.DEP_ID;
12

SELECT NAME, EMP_ID FROM C


Employee
WHERE SALARY > '';

13

SELECT NAME FROM Employee



WHERE LEVEL = ' ';

14

SELECT NAME, SALARY FROM C

emp_on_edu

INNER

JOIN

employee ON employee.EMP_ID =
emp_on_edu.EMP_ID
WHERE FILD_OF_STUDING =
'';
15

16

SELECT NAME, EMP_ID FROM C


Employee
WHERE SALARY > '';

SELECT

NAME

FROM C

department INNER JOIN employee


ON

employee.DEP_ID

department.DEP_ID
WHERE

DEP_NAME

'';
17

SELECT
NAME
FROM C

department INNER JOIN employee
ON

employee.DEP_ID

department.DEP_ID
WHERE

DEP_NAME

'';
18

SELECT

NAME

FROM C

department INNER JOIN employee

78

ON

employee.DEP_ID

department.DEP_ID
WHERE

DEP_NAME

'';
19

SELECT

NAME

FROM C

department INNER JOIN employee


ON

employee.DEP_ID

department.DEP_ID
WHERE

DEP_NAME

'';
20

SELECT POSITION, SALARY, C

NAME FROM department INNER


JOIN

employee

ON

employee.DEP_ID

department.DEP_ID
WHERE

DEP_NAME

'';
21



22

SELECT NAME, LEVEL FROM C



department INNER JOIN employee
ON

employee.DEP_ID

department.DEP_ID
WHERE

DEP_NAME

'';
23

SELECT NAME, DEP_NAME C


FROM department INNER JOIN

employee ON employee.DEP_ID =
department.DEP_ID
WHERE LEVEL = ' ';

79

24

SELECT
NAME,
EMP_ID, C
DEP_NAME FROM department

INNER JOIN employee ON


employee.DEP_ID

department.DEP_ID
WHERE SALARY > ';
25

SELECT NAME, POSITION, SEX C



FROM department INNER JOIN
employee ON employee.DEP_ID =
department.DEP_ID
WHERE COLLAGE = '';

26

SELECT

FILD_OF_STUDING

NAME, C
FROM

department INNER JOIN employee


ON

employee.DEP_ID

department.DEP_ID INNER JOIN


emp_on_edu

ON

emp_on_edu.EMP_ID

employee.EMP_ID
WHERE LEVEL_EDU = '
';
27

SELECT
FILD_OF_STUDING

emp_on_edu

University, C

INNER

FROM
JOIN

employee ON employee.EMP_ID =
emp_on_edu.EMP_ID
WHERE LEVEL_EDU = '
';
28

SELECT * FROM Employee


C

WHERE NAME = '
';

80

SELECT LEVEL, SALARY FROM C


Employee

29

WHERE NAME = ' ';


SELECT * FROM Employee
C

WHERE NAME = '

30

';
Table 5. 6: Single conditional query and results

Total queries

Correct queries (C)

Incorrect queries(I)

Accuracy

30

28

93.33

Table 5. 7: Accuracy of single conditional query

100
90
80
70

Correct

60

Incorect

50
40
30
20
10
0
Single Query

Figure 5. 10: System performance of single condition Query

81

As revealed by the query score the single conditional query performance test, shown on figure
5.10 above, the accuracy of the system for this particular types of query is calculated to be
93.33%. This indicates that except for insignificance number of queries, only two in this case the
system is found to be very much accurate. This implies that the prototype of our proposed system
has high validity and reliability to convert and executed users queries exactly as per their
request.

Composite (multiple) conditional queries: No.

User input query

SELECT NAME, SEX, LEVEL, C


SALARY FROM Employee

System generated query

C/I

WHERE FIELD_STUDY = '


' AND LEVEL =
' ' AND SEX = '';

SELECT * FROM Employee


C

WHERE SALARY > '' AND
SEX = '';

SELECT * FROM department C



INNER JOIN employee ON
employee.DEP_ID

department.DEP_ID
WHERE SALARY > '' AND
DEP_NAME = ' ';
4

SELECT
NAME
FROM C
department INNER JOIN employee

ON

employee.DEP_ID

department.DEP_ID
WHERE LEVEL = ' '
AND DEP_NAME = '
';

82

SELECT NAME, SEX FROM C



department INNER JOIN employee
ON

employee.DEP_ID

department.DEP_ID INNER JOIN


emp_on_edu

ON

emp_on_edu.EMP_ID

employee.EMP_ID
WHERE University = ' '
AND DEP_NAME = '';
5

SELECT NAME, SEX FROM C


department INNER JOIN employee

ON

employee.DEP_ID

department.DEP_ID INNER JOIN


emp_on_edu

ON

emp_on_edu.EMP_ID

employee.EMP_ID
WHERE University = ' '
AND DEP_NAME = '';
6

SELECT
NAME
FROM C
department INNER JOIN employee

ON

employee.DEP_ID

department.DEP_ID
WHERE LEVEL = ' '
AND DEP_NAME = '';
7

SELECT
NAME
FROM C
department INNER JOIN employee

ON

employee.DEP_ID

department.DEP_ID
WHERE LEVEL = ' '
AND DEP_NAME = '';
8

SELECT

83

NAME

FROM C

Employee
WHERE LEVEL = ' '
AND SEX = '';

SELECT

Employee

NAME

FROM C

WHERE LEVEL = ' '


AND SEX = '';

10

4500 SELECT
Employee

NAME

FROM C

WHERE SALARY < '4500' AND


SEX = '';

11

SELECT
4500 Employee

NAME

FROM C

WHERE LEVEL = ' '


AND SALARY < '4500';

12

4500

SELECT
NAME
FROM C
department INNER JOIN employee

ON

employee.DEP_ID

department.DEP_ID
WHERE SALARY > '4500' AND
DEP_NAME = '' AND
SEX = '';
13

SELECT
NAME,
SEX, I
HIRE_DATE, LEVEL FROM
department INNER JOIN employee

ON

employee.DEP_ID

department.DEP_ID INNER JOIN


emp_on_edu
emp_on_edu.EMP_ID

ON
=

employee.EMP_ID
WHERE POSITION null 'null'

84

AND DEP_NAME = '';


14

SELECT
NAME
FROM I
department INNER JOIN employee
ON

employee.DEP_ID

department.DEP_ID
WHERE DEP_NAME = '';
15

SELECT NAME, HIRE_DATE C


FROM department INNER JOIN

employee ON employee.DEP_ID =
department.DEP_ID INNER JOIN
emp_on_edu
emp_on_edu.EMP_ID

ON
=

employee.EMP_ID
WHERE LEVEL_EDU = '
' AND COLLAGE = '
';
16

SELECT
NAME,
HIRE_DATE, LEVEL

Employee

SEX, C
FROM

WHERE POSITION = '' AND


SEX = '';
17

SELECT
NAME
FROM C
department INNER JOIN employee

ON

employee.DEP_ID

department.DEP_ID INNER JOIN


emp_on_edu
emp_on_edu.EMP_ID

ON
=

employee.EMP_ID
WHERE LEVEL = ' '
AND LEVEL_EDU = ' '
AND COLLAGE = ' ';

85

18

SELECT * FROM Employee


C
WHERE NAME = ' '

AND SEX = '';

19

SELECT NAME, SEX FROM C


department INNER JOIN employee

ON

employee.DEP_ID

department.DEP_ID
WHERE SALARY < '' AND
COLLAGE = ' ';
20

4500 SELECT NAME, EMP_ID FROM I


Employee

21

WHERE SALARY > '4500';

5000

SELECT

NAME,

DEP_NAME C

FROM department INNER JOIN


employee ON employee.DEP_ID =
department.DEP_ID
WHERE

SALARY

>=

'5000'

AND DEP_NAME = '


';
22

SELECT

Employee

NAME

WHERE SEX !=

FROM C

'' AND

LEVEL = ' ';


23

SELECT NAME, SEX FROM C


department INNER JOIN employee

ON

employee.DEP_ID

department.DEP_ID
WHERE SALARY < '' AND
SEX != '' AND COLLAGE =
' ';

86

24

SELECT NAME, DEP_NAME C



FROM department INNER JOIN

employee ON employee.DEP_ID =
department.DEP_ID
WHERE SALARY <= '' AND
DEP_NAME = '';

25

5000 SELECT NAME, DEP_NAME, C


SALARY FROM
department
INNER JOIN employee ON
employee.DEP_ID

department.DEP_ID
WHERE

SALARY

>=

'5000'

AND DEP_NAME = '';


26

SELECT
NAME,
University I
FROM department INNER JOIN
employee ON employee.DEP_ID =

department.DEP_ID INNER JOIN


emp_on_edu

ON

emp_on_edu.EMP_ID

employee.EMP_ID
WHERE

SEX

LEVEL_EDU

''

AND

''

AND

DEP_NAME = '
';
27

5000 SELECT NAME, SEX FROM C


department INNER JOIN employee

ON

employee.DEP_ID

department.DEP_ID
WHERE SALARY > '5000' AND
COLLAGE = ' ';
28

5000 SELECT

87

NAME,

DEP_NAME C

FROM department INNER JOIN


employee ON employee.DEP_ID =

department.DEP_ID
WHERE SALARY <= '5000' AND
SEX = '' AND DEP_NAME =
'';

29

SELECT NAME, SEX FROM C


department INNER JOIN employee

ON

employee.DEP_ID

department.DEP_ID
WHERE SALARY >= '' AND
COLLAGE = ' ';
30

5000 SELECT NAME, SEX FROM I


department INNER JOIN employee

ON

employee.DEP_ID

department.DEP_ID
WHERE SALARY <= '5000' AND
COLLAGE = ' ';
Table 5. 8: Multiple conditional query and results
Total queries

Correct queries (C)

Incorrect queries(I)

Accuracy

30

25

83.33%

Table 5. 9: accuracy of multiple condition queries

88

90
80
70
60

Correct

50

Incorrect

40
30
20
10
0
Multiple condition query

Figure 5.11: System performance of multiple conditions


Besides the performance evaluation of our ANLIDB system to handle simple query forms such
list and single condition queries, we have also made an attempt to measure the Systems
performance to correctly accept queries of multiple condition and generate responses in SQL
forms as per the request of the user. Our analysis held result that disclosed that out of the 30
queries the system was provided and fed with, it was able to generate SQL for 25(83.33%) of the
queries correctly; whereas, it went wrong and exhibited inaccuracies for the remaining
5(16.67%) users queries. Compared to the performance result of the ANLIDB to address,
generate and execute user quires of the types such as List queries (100%), Single condition (93
%) and the Aggregate function queries (86 %), it was found with a least accuracy score, only
83.33%.
Aggregate Function include Grouping and Ordering query
No.

User input query

System generated query

C/I

SELECT count(*) AS TOTAL C


FROM department INNER JOIN
employee ON employee.DEP_ID =
department.DEP_ID

89

WHERE

DEP_NAME

' ';
2

SELECT count(*) AS TOTAL C


FROM department INNER JOIN

employee ON employee.DEP_ID =
department.DEP_ID
WHERE

DEP_NAME

' ' AND SEX = '';


3

SELECT AVG(SALARY) AS C

AVGSALARY FROM department
INNER

JOIN

employee

employee.DEP_ID

ON
=

department.DEP_ID
WHERE DEP_NAME = '';
4

SELECT MAX(SALARY) AS C

MAXIMUMSALARY
FROM
department INNER JOIN employee
ON

employee.DEP_ID

department.DEP_ID
WHERE

DEP_NAME

'';
5

SELECT MIN(SALARY) AS C

MINIMUMSALARY
FROM
department INNER JOIN employee
ON

employee.DEP_ID

department.DEP_ID
WHERE

DEP_NAME

'';
6

SELECT

Employee

NAME

FROM C

WHERE NAME Like "%";

90

SELECT * FROM department C


INNER JOIN employee ON

employee.DEP_ID
=
department.DEP_ID
WHERE

DEP_NAME

''
Order by SALARY;
8

SELECT * FROM Employee

WHERE SALARY BETWEEN


'' and '';

SELECT count(*) AS TOTAL C


FROM Employee
WHERE SALARY > '';

10

SELECT count(*) AS TOTAL C


FROM Employee;

11

SELECT count(*) AS TOTAL C

FROM department INNER JOIN


employee ON employee.DEP_ID =
department.DEP_ID
WHERE DEP_NAME = '';

12

SELECT

MIN(SALARY)

MINIMUMSALARY

AS C

FROM

Employee
WHERE

FIELD_STUDY

'';
13

SELECT

SUM(SALARY)

TOTALSALARY

AS C

FROM

department INNER JOIN employee


ON

employee.DEP_ID

department.DEP_ID
WHERE DEP_NAME = '';

91

14

SELECT

SUM(SALARY)

TOTALSALARY

AS C

FROM

Employee;
15

SELECT SUM(SALARY) AS C

TOTALSALARY
FROM
department INNER JOIN employee
ON

employee.DEP_ID

department.DEP_ID
WHERE COLLAGE = '';
16

5770 SELECT count(*) AS TOTAL C


FROM Employee

WHERE SALARY < 5770 AND


SEX = ;

17

SELECT count(*) AS TOTAL, C



AVG(SALARY)
AS
AVGSALARY FROM department
INNER

JOIN

employee

employee.DEP_ID

ON
=

department.DEP_ID
WHERE DEP_NAME = '';
18

SELECT count(*) AS TOTAL, C



AVG(SALARY)
AS
AVGSALARY FROM department
INNER

JOIN

employee

employee.DEP_ID

ON
=

department.DEP_ID
WHERE DEP_NAME = '';
19

SELECT count(*) AS TOTAL, C


AVG(SALARY)
AS

AVGSALARY FROM department
INNER

92

JOIN

employee

ON

employee.DEP_ID

department.DEP_ID
WHERE NAME Like '%' AND
DEP_NAME = '';
20

SELECT count(*) AS TOTAL I


FROM Employee;

21

SELECT count(*) AS TOTAL C


FROM Employee
WHERE SEX = '';

22

SELECT

MIN(SALARY)

MINIMUMSALARY

AS I

FROM

Employee;
23

SELECT count(*) AS TOTAL, C



MIN(SALARY)
AS
MINIMUMSALARY

FROM

department INNER JOIN employee


ON

employee.DEP_ID

department.DEP_ID
WHERE DEP_NAME = '';
24

SELECT count(*) AS TOTAL, C

MIN(SALARY)AS

MINIMUMSALARY

FROM

department INNER JOIN employee


ON

employee.DEP_ID

department.DEP_ID
WHERE DEP_NAME = ''
Group by SEX;
25

26

SELECT * FROM Employee

WHERE NAME Like '%';

SELECT count(*) AS TOTAL C


FROM Employee

93

WHERE NAME Like '%';


27

SELECT count(*) AS TOTAL C

FROM Employee
WHERE NAME Like '%';

28

SELECT count(*) AS TOTAL, C


AVG(SALARY) AVGSALARY,

MAX(SALARY)

AS

MAXIMUMSALARY

FROM

Employee
WHERE NAME Like '%';
29

SELECT
SUM(SALARY)AS I

TOTALSALARY
FROM
department INNER JOIN employee
ON

employee.DEP_ID

department.DEP_ID;
30

SELECT count(*) AS TOTAL I


FROM department INNER JOIN

employee ON employee.DEP_ID =
department.DEP_ID INNER JOIN
emp_on_edu

ON

emp_on_edu.EMP_ID

employee.EMP_ID
WHERE LEVEL_EDU = ''
AND DEP_NAME = '
';

Table 5. 10: Aggregate function query and results


Total queries

Correct queries (C)

Incorrect queries(I)

Accuracy

30

26

86.67

94

Table 5. 11: Accuracy of aggregate function

90
80
70
60

Correct

50

Incorrect

40
30
20
10
0

Aggregate function
Figure 5.12: System performance of aggregate function
Similarly for this particular type of query testing, 30 user queries were presented as input to the
system. Among this the system generated 26 of the queries with 86.67% accuracy where as it
executed and generated the remaining 4 query incorrectly which somehow insignificant. Based
on our evaluation we found that the system exhibited strong validity and reliability, except for
just a few inaccuracies, which implies its strong functionality to generate user queries under the
category of aggregate function.

5.10.1.3.

OVERALL MEASUREMENT

Besides our evaluation of the system accuracy and functionality to generate users queries
separately for single condition multiple condition and aggregate function, we have also tried to
undertake an overall measurement of the systems accuracy to have an aggregate result of its
performance all on forms of queries and see its full operation. To do so we have calculated the
overall accuracy of correct query through a division of the total number of correct queries as
generated by the system, by the total number of imputed query. Henceforth, the results as shown

95

on figure 5.13 revealed that the system we have developed has an overall accuracy of 91% which
implies that the systems validity and reliability is very high, an indicator of its strong and success
full feature use and operation.
Overall accuracy of Correct Query =

109

= 0.91, = 91%

120

Total measurment of our system

Correct
Incorrect

Figure 5.13: Overall system performances


Discussion
Compared to other previous types of NLIDB systems developed so far, our ANLIDB systems
that we created shares some similarities, but also can be marked for the unique features it
incorporated. Obviously, in the previous sections on the evaluation of our systems performance,
it was noticed that the ANLIDB system we have developed was found to have high performance
as evidenced by results of measurement of accuracy of correct query of the system for all the
four individual types of queries as well as its aggregate or overall performance. It is expected

96

that questions can be raised seeking answers for the reason/s that lead to such high performance
of our NLIDB system; hence, to clarify on such questions we would also try to mention the
features that our system has incorporated during its design and development and as well as what
aspects it has constituted contributed to its best performance.
One similarity that it has with other previous NLIDB systems, by other developers such as
Himani Jain [31], Khalel Al-Rabbah and Safwan Shantnawi [31], Faraj et al.[32] and many
others is that it designed a user interface for database system in a natural language, which is
Amharic in our case. In addition, it is intended to address database users difficulties that have no
knowledge of English; hence it enables such users to use Amharic Language for their queries and
access to the database. Alike other previously created NLIDBs including the above ones, our
NLIDB system also targeted at users who have no any prior and expert knowledge and skill of
how to operate or use a database; in our case, any novice user can make a use of and operate the
database system whenever they need to access and obtain the data they require.
Another similarity it has is that, for instance, alike the NLIDB systems developed by Khaleel AlRabalah and Safwan Shatnawi(31) who proposed an Arabic Natural Language Interface to
Database(ANLIDB), for the design of our ANLIDB system we have developed and implemented
an algorithm that can map and extract phrases from a natural language query, paraphrased and
entered in Amharic and submitted to the database and enables it to construct and execute queries
in SQL form. In addition, we have also used a Relational Database alike other previous systems.
Whereas coming to the features that makes our ANLIDB system different from other previous
systems, it has incorporated unique features and has constituted some important aspects that gave
rise to its high performance as well as its simplicity and user friendly nature. These structural and
linguistic aspects are discussed as follows.
To begin with, one is that the syntactic and semantic features and the complexity in the structure
of the Amharic language, the natural language used in developing our NLIDB system compared
to other natural languages, such as English, Arabic, Hindu, Urdu and other languages on which
other previous NLIDB systems were developed. In the design of their database systems, the
various authors have created algorithms to help them create a platform where the database can
understand queries in the natural language and convert it into SQL queries. In particular to the

97

ANLIDB system we have designed, we have also developed and established an algorithm
specific to and that only works for the Natural language of our database system, which is
Amharic.
For instance, NLIDB created by Himani Jain [30) in developing his Hindi Language Interface to
Database did not fully investigated the structure the natural language of his database system,
Hindi, and did not create an efficient algorithm for the design of his NLIDB. As a result, the
HNILDB he developed cannot handle complex forms of queries such multiple condition,
aggregate function and grouping and ordering. It can execute only list and single conditional
query types. This actually not a limitation of only this particular system, but also other prior
NLIDB systems such as UNLIDB (Urdu Natural Language Interface for Database) and
PUNLIDB (Punjabi Natural Language Interface for Data Base) built by authors such as Rashid
Ahmad et al [36) and Amardeep Kaur [1] respectively also did not form algorithms for their
respective natural language databases; hence their system could only execute simple query
forms. However, in different from such and other previous NIDB system, we tried to overcome
such limitations by developing unique algorithm for the natural language, Amharic, that did not
exist previously and new in its type. This avoided any difficulty that our database could have
faced in understanding natural language queries in Amharic sentences as a result of character
variations, punctuation and other lexical and semantic variations in the structure of the language.
This gave rise to the simplicity of our particular database system to easily communicate with the
database user without any language barriers. In addition, the unique algorithm we have
developed also strengthened our NLIDB systems performance to generate and execute all forms
of queries irrespective of their complexity; it can manage to respond to, convert in SQL form,
and generate data correctly for all query types that fall under the category of List Query, Single
conditional, multiple conditional and aggregate function, grouping and ordering.

98

CHAPTER SIX
Conclusions, Recommendation and Future Works
6.1. Conclusions
Based on its aim and objective this study has tried to develop an Amharic language interface to
database system. The system targeted users who have no knowledge or skill of a database
system, and who have no a good command of English language. Based on this to intervene into
and solve difficulties of such database users, we have designed and developed a user interface of
a database system which enable users to execute their query in a natural language, Amharic in
our case, to generate and retrieve data as per their queries from the database. To evaluate the
system we have calculated the systems performance, using academic employee database, and
have analyzed their validity and reliability of NLIDB system we have created.
Generally, our study has revealed the following finding and conclusions made based on them are
stated as follows

Evaluation of the proposed system revealed that novice users can operate the system
without difficulties. This is because the system can easily recognize various characters
easily, can understand and execute various user queries with multiple forms of
expressions and structures.

The system performance during users query for a single condition found to be 100%
accurate. Hence, it is possible to concluded that the system has full functionality for
selection (list) query.

Despite fewer decrease, only by 6.67%, the system is found to be highly operational,
with 93.33% accuracy, for the single conditional query.

Except for some few noticeable performance differences, compared to the previous two
forms of queries, our system has shown strong performance to generate queries for
aggregate function, with 86.7% accuracy, correctly as per the requests of the user

Compared to the other three query forms, our performance evaluation turned out with the
least accuracy score of 83.33% for the multiple condition queries. The researchers

99

assumed and related such decrease in the performance of the NLID as the queries grew
more complex requesting data from various column names, the system may not exactly
process such queries as per users request as it is primarily not column specified.

As per the overall analysis of the system developed the finding revealed that its highly
efficient in generating and executing users queries of all the three categories. Its total
performance or overall accuracy of correct query is found to be 91%. However, this also
indicates for feature improvement to insure the systems full and perfect functionality.

It was noted that the ANLIDB system we have created has been identified for its
strengths and merits to properly execute users queries of any types, to be simple and
easy to use and is very much user friendly, reduces previous user difficulties due to lack
of SQL database and good command of English language, etc. ..This can be attributed to
the unique features it incorporated into it, the development of good and systematic
algorithm for its natural language, Amharic, and others.

However, it doesnt mean that our NLIDB system and the way it is designed and
developed is not without some drawbacks or limitations. One is related to its ability for
data manipulation by the users in that in it can only let users to SELEC data but does not
enable users to DELETE, UPDATE and INSERT data from the database, only making
the SELECTION option available to them.

Another limitation is also that whenever users want to access data belonging to the
column names, they are expected to specify the column name as the system would not
understand and comply with their queries unless and otherwise they do so. It has no
feature that can let it automatically comprehend user queries from column names unless
it is specified in their queries. For instance the query , a
word (male) indicates that the sex is male, but our system ignore it ales the query
present like .

Thirdly, the system has difficulties to grasp and understand date values framed in general
key words question such as (tomorrow), (yesterday), (September),
on the 12th of September etc This because the researchers could not find any
available and previously developed parser, part of speech tagging and a like language

100

tools for Amharic language, as a result they could not incorporate pattern matching and
similarity checking of date values in the system. Hence, the users, during their queries,
are expected to provide the complete data value as found in the predetermined column
values of the database. In addition our system can not inderstand fuzzy questions like
(much), (small), (bad), (good) and a like.

Lastly, our system is domain-dependent, that is for instance, it is employed for Employee
database; hence it cannot be manipulated for data other than the domain it is originally
created for

6.2. Recommendation and Future Research Works


Based on our findings and conclusions as well as from the practical experience we have gained
during the development of our NLIDB system, we would like to forward the following advises
and recommendations for others who have a plan to engage themselves to design and develop
NLIDBs in the future anywhere in the world. In addition, we will also try to indicate some future
areas of improvement for our created system to make it a better NLIDB system by amending its
limitations and weakness it already has.
Other researchers in the future can work more on the already designed ANLIDB by
adding more features into it that can make the system designed domain independent, and
make it available for use to users for various and multiple sets of databases such as
agriculture, bank, etc.
Similarly, can also upgrade our system by endowing it with additional and remaining
database manipulation applications such as DELET , UPDATE and INSERT in
addition to SELECTION application it already has.
Thirdly, also may consider extending our work on the ANLIDB system we created
through supplementing it with more features such as; automatic identification and
comprehension of column values without the need for specified column names in the user
queries.

101

We also make other authors in the same track by including parsers POS tagger and
stemmer for Amharic language that can make the task of creating new and more
functional algorithms, than already done by the current researchers, so that it can be
possible to cut limitations of the kind that our NLIDNB system has regarding the
communication barriers it has with Users as a result of its incapacities to understand
multiple and more complex values of user queries in the Natural language.

102

REFERENCE
[1]

A. Kaur, PUNJABI LANGUAGE INTERFACE TO Database, THAPAR


UNIVERSITY, 2010.

[2]

I. Androutsopoulos, G. D. Ritchie, P. Thanisch, and M. Road, Natural Language


Interfaces to Databases An Introduction , no. 709, pp. 150, 1995.

[3]

A. Tamrakar and D. Dubey, Query Optimisation using Natural Language Processing 1


1,2, Int. J. Comput. Sci. Technol., vol. 3, no. 1, pp. 307310, 2012.

[4]

B. Manaris, Natural Language Processing: A Human Computer Interaction


Perspective, vol. 47. Academic Press, New York, pp. 155, 1998.

[5]

G. Canfora and L. Cerulo, A Taxonomy of Information Retrieval Models and Tools, J.


Comput. Inf. Technol., pp. 175194, 2004.

[6]

Y. Chandra, Natural Language Interfaces to Databases, UNIVERSITY OF NORTH


TEXAS, 2006.

[7]

J. Patel and J. Dave, A Survey: Natural Language Interface to Databases, Int. J. Adv.
Eng. Res. Dev., 2015.

[8]

D.Ramesh and S. K. Sanampudi, TELUGU LANGUAGE INTERFACE TO Database,


Int. J. Adv. Res. Comput. Commun. Eng., vol. 2, no. 7, pp. 29032905, 2013.

[9]

A. G. Sanjay Agrawal, Surajit Chaudhuri, Gautam Das, Automated Ranking of Database


Query Results, in CIDR Conference, 2003.

[10] K. Murugan and T. Ravichandran, Intelligent query processing in temporal database


using efficient context free grammar, Indian J. Sci. Technol., vol. 5, no. 6, pp. 2885
2890, 2012.
[11] A. Philpot, J. L. Ambite, E. Hovy, and M. Rey, DGRC AskCal: Natural Language
Question Answering for Energy Time Series, 1999.
[12] Z. Zheng and A. Arbor, AnswerBus Question Answering System, in in Proceedings of

103

the Human Language Technology Conference, 2002.


[13] B. Katz, S. Felshin, D. Yuret, A. Ibrahim, J. Lin, A. J. Mcfarland, and B. Temelkuran,
Omnibase: Uniform Access to Heterogeneous Data for Question Answering, in In
Proceedings of the 7th International Workshop on Applications of Natural Language to
Information Systems (NLDB), 2002, no. June, pp. 15.
[14] N. T. Dang, D. Thi, and T. Tuyen, Natural Language Question Answering Model
Applied To Document Retrieval System, World Acad. Sci. Eng. Technol., pp. 3639,
2009.
[15] No Title. [Online]. Available: http://www.isi.edu/natural-language/projects/TextMap/.
[16] No Title. [Online]. Available: http://eagl.unige.ch/EAGLi/.
[17] No Title. [Online]. Available: http://www.wolframalpha.com/index.html.
[18] J. C. Martin, Introduction to Languages and The Theory of Computation, Fourth Edi.
McGrew Hill Companies, 2011.
[19] D. O. G. K. Agrawal Avinash J, Semantic Analysis of Natural Language Queries Using
Domain Ontology for Information Access from Database, I.J. Intell. Syst. Appl., no.
November, pp. 8190, 2013.
[20] R. Akerkar and M. Joshi, Natural language int erface using shallow parsing, Int. J.
Comput. Sci. Appl., vol. 5, no. 3, pp. 7090, 2003.
[21] JosEPh VEIZENBA UM, ELIZA-A Computer Program for the Study of Natural
Language Communication between Man and Machine, communication of the ACM, vol.
9. 1966.
[22] J. Chai, J. Lin, W. Zadrozny, Y. Ye, M. Budzikowska, V. Horvath, N. Kambhatla, and C.
Wolf, Comparative Evaluation of a Natural Language Dialog Based System and a Menu
Driven System for Information Access: a Case Study, in In Proceedings of the
International Conference on Multimedia Information Retrieval, 2000, no. April.
[23] No Title. [Online]. Available: http://www.trueknowledge.com/.

104

[24] P. F. C. N. David H. D. Warren, An Efficient Easily Adaptable System for Interpreting


Natural Language Queries 1, Am. J. Comput. Linguist., vol. 8, no. 3, 1982.
[25] D. L. Waltz, An English Language Question Answering System for a Large Relational
Database, in Communications of the ACM, 1978, pp. 526539.
[26] A. Kumar and K. S. Vaisla, Natural Language Interface to Databases: Development
Techniques, Elixir Comp. Sci. Engg, no. November, 2015.
[27] A. R. Sontakke and P. A. Pimpalkar, A Review Paper on Hindi Language Graphical User
Interface to Relational Database using NLP, Int. J. Adv. Res. Comput. Eng. Technol., vol.
3, no. 10, pp. 33933397, 2014.
[28] N. Nihalani, S. Silakari, and M. Motwani, Natural language Interface for Database: A
Brief review, IJCSI Int. J. Comput. Sci. Issues, vol. 8, no. 2, pp. 600608, 2011.
[29] R. Woods, W., Kaplan, Lunar rocks in natural English: Explorations in natural language
question answering, 1977.
[30] H. Jani, HINDI LANGUAGE INTERFACE TO Database, Thapar University, 2011.
[31] S. S. Khaleel Al-Rababah, An Arabic Language Interface to Databases Using a
Morphologically- based Lexicon , Language Indicators and POS Tagging, Int. J.
Multimed. Image Process., vol. 2, no. June, pp. 8795, 2012.
[32]

and I. S. E.-F. Faraj A. El-Mouadib, Zakaria S. Zubi, Ahmed A. Almagrous, Generic


Interactive Natural Language Interface to Databases (GINLIDB), Int. J. Comput., vol. 3,
no. 3, 2009.

[33] G. G. Hendrix, E. D. Sacerdoti, D. Sagalowicz, and J. Slocum, Developing a Natural


Language Interface to Complex Data, ACM Trans. Database Syst., vol. 3, no. 2, pp. 105
147, 1978.
[34] Ruwanpura, SQ-HAL Natural language to SQL translator, Monash University, 2002.
[35] G. Das Sanjay Agrawal, Surajit Chaudhuri, DBXplorer: A System for Keyword-Based
Search over Relational Databases, in in proceedings of 18th international conference on

105

data engineering, 2002.


[36] R. Ahmad, M. A. Khan, and R. Ali, Efficient Transformation of a Natural Language
Query to SQL for Urdu, in Proceedings of the Conference on Language & Technology,
2009, pp. 5360.
[37] A. K. Nguyen and P. H. Nguyen, An Intelligent Natural Language Interface to Relational
Databases, in The 6th international conference on Information Technology and
applications (ICITA 2009), 2009, no. Icita, pp. 978981.
[38] C. H. Boonjing Veera, A New Feasible Approach to Natural Language Database Query,
Int. J. Artif. Intell. Tools, vol. 3590, no. 662, 2005.
[39] I. Androutsopoulos, G. Ritchie, and P. Thanisch, Masque / sql - An E cient and Portable
Natural Language Query Interface for Relational Databases, in Proc. of Sixth
International Conference on Industrial & Engineering Applications of AI & Expert
System, 1993, pp. 17.
[40] H. S. B.Sujatha, Dr.S.Viswanadha, A Novel Architecture for the Natural Language
Interface to Databases, Int. J. Comput. Archtecture Mobil., vol. 1, no. 6, 2013.
[41] H. V. J. Yunyao Li, Huahai Yang, NaLIX: an Interactive Natural Language Interface for
Querying XML, in in the proceedings of the SIGMOD, 2005.
[42] P. P. Filipe, N. J. Mamede, S. D. E. De Lisboa, and R. C. E. Navarro, Databases and
Natural Language Interfaces, 1949, pp. 111.
[43] R. Alexander and P. Rukshan, Natural Language Web Interface for Database ( NLWIDB
), in Proceedings of the Third International Symposium, 2013, no. July, pp. 67.
[44] N. Stratica, A NATURAL LANGUAGE PROCESSOR FOR QUERYING CINDI,
CONCORODIA UNIVERSITY, 2002.
[45] G. Rao, C. Agarwal, S. Chaudhry, and N. Kulkarni, NATURAL LANGUAGE QUERY
PROCESSING USING SEMANTIC, Int. J. Comput. Sci. Eng., vol. 02, no. 02, pp. 219
223, 2010.

106

[46] B. M. HAILEMARIAM, N-gram-Based Automatic Indexing for Amharic Text, ADDIS


ABABA UNIVERSITY, 2002.
[47] D. G. AGONAFER, AN INTEGRATED APPROACH TO AUTOMATIC COMPLEX
SENTENCE PARSING FOR AMHARIC TEXT, ADDIS ABABA UNIVERSITY,
2003.
[48] L. K. Lo, Types of Writing Systems. [Online]. Available:
http://www.ancientscripts.com/ws_types.html.
[49] S. HIRPASSA, DESIGNING AN INFORMATION EXTRACTION SYSTEM FOR
AMHARIC VACANCY ANNOUNCEMENT TEXT, ADDIS ABABA UNIVERSITY,
2011.
[50] Y. A. Alelgn Tefera, Automatic Construction of Amharic Semantic Networks From
Unstructured Text Using Amharic WordNet, 2010.
[51] A. K. T. TEGEGNIE, HIERARCHICAL AMHARIC NEWS TEXT
CLASSIFICATION, ADDIS ABABA UNIVERSITY, 2010.
[52] Z. ABEBAW, PHRASE BASED AMHARIC NEWS TEXT CLASSIFICATION,
ADDIS ABABA UNIVERSITY, 2010.
[53] B. Alemu, A Named Entity Recognition for Amharic, ADDIS ABABA UNIVERSITY,
2013.
[54] A. H. MADESSA, PROBABILISTIC INFORMATION RETRIEVAL SYSTEM FOR
AMHARIC LANGUAGE, ADDIS ABABA UNIVERSITY, 2012.
[55] F. T. SHEBESHE, PHRASAL TRANSLATION FOR AMHARIC ENGLISH CROSS
LANGUAGE INFORMATION RETRIEVAL (CLIR), ADDIS ABABA UNIVERSITY,
2010.
[56] T. Bloor, The Ethiopic Writing System: a Profile, J. Simpl. Spell. Soc., pp. 3036, 1995.
[57] A. A. Argaw and L. Asker, An Amharic Stemmer: Reducing Words to their Citation
Forms, in Proceedings of the 5th Workshop on Important Unresolved Matters, 2007, no.

107

June, pp. 104110.


[58] M. Bender, The Ethiopic Writing System. London: Oxford University Press, 1976.
[59] Y. Baye, . A: ...., 1987.
[60] S. MEKONNEN, WORD SENSE DISAMBIGUATION FOR AMHARIC TEXT: A
MACHINE LEARNING APPROACH, ADDIS ABABA UNIVERSITY, 2010.
[61] G. H., The Problem of Amharic Writing System, 1976.
[62] S. T. Dumais, Enhancing Performance in Latent Semantic Indexing (LSI) Retrieval, pp.
119.
[63] S. Deerwester, G. W. Furnas, T. K. Landauer, and R. Harshman, Indexing by Latent
Semantic Analysis.
[64] W. Stalling, Computer Organization & Architecture: Principles of Structure and
Functions. New York: Macmillan Publishing Company, 1993.
[65] W. Alemu, The application of OCR Techniques to the Amharic Script, Addis Ababa
University, 1997.
[66] Z. Sintayehu, Automatic Classification of Amharic News Items: The Case of Ethiopian
News Agency., Addis Ababa University, 2001.
[67] T. H. GEBERMARIAM, AMHARIC TEXT RETRIEVAL: AN EXPERIMENT USING
LATENT SEMANTIC INDEXING (LSI) WITH SINGULAR VALUE
DECOMPOSITION (SVD), ADDIS ABABA UNIVERSITY, 2003.
[68] A. K. Dr.Kunwar Singh Vaisla, Hindi Language Interface to Database using Semantic
Matching, An Int. Open Free Access, Peer Rev. Res. J., vol. 6, no. JUNE 2013, 2015.
[69] B. Demelash, Linguistically Motivated Amharic IR (LM-IR), ADDIS ABABA
UNIVERSITY, 2013.
[70] B. R. Ricardo baeza, Moderen Information retrieval, vol. 9. A dicision of the association
for computing machinary, 1999.

108

[71] M. Tessema, Design and implementation of Amharic Search Engine, Addis Ababa
University, 2007.
[72] S. Yimam, Amharic Question Answering For Factoid Questions, Addis Ababa
University, 2009.
[73] M. WORDOFA, SEMANTIC INDEXING AND DOCUMENT CLUSTERING FOR
AMHARIC INFORMATION RETRIEVAL, ADDIS ABABA UNIVERSITY, 2013.
[74] S. Arora, K. Batra, and S. Singh, " Dialogue System: A Brief Review",
[75]

109

Appendix
Appendix I: The Amharic character set [58]
Ordere
1st

2nd

3rd

4th

5th

110

6th

7th

Appendix II: Amharic Numbers

10 20

30

40

50

70

80

90

100

60

Appendix III: Sample source code

111

You might also like