You are on page 1of 13

KHULNA UNIVERSITY OF ENGINEERING & TECHNOLOGY

COMPUTER SCIENCE & ENGINEERING


Report Submission Sheet
(No assessment will be accepted without this)

Student to Complete Actual Hand-In Date: 22.06.2008

Student Name:
M. S. A. SHAHNAWAZ CHOWDHURY & S. M. ABU SALEH SHAWON
ROLL-0507014 ROLL-0507016

This submission is the result of our own work. Primary and secondary sources of
information and any contributions to the work by third parties, other than my tutors,
have been fully and properly attributed. Should this statement prove to be untrue I
recognise the right and duty of the University to take appropriate action in keeping with
the regulations regarding candidates’ use of unfair means during assessment.

Academic Staff to Complete

Supervisor Name: RUSHDI SHAMS

Course Title: SOFTWARE DEVELOPMENT PROJECT-2 Course No: CSE-3100

Report Title: A TOOL FOR CORPUS ANALYSIS

Date Issued: 25.02.2008 Hand-In Date: 22.06.2008

Received: On Time Late


(within 5 working
days of issue)
Mark Allocated: ______________________________________________________

Comments:
_____________________________________________________________________________
_____________________________________________________________________________
_____________________________________________________

1
Contents

Chapter Title Page

1. Acknowledgements 3

2. Introduction 4

3. Analysis 5

3.1 Corpus 5
3.1.1 Corpus requirement 5
3.1.2 Corpus creation 5
3.1.3 Type of corpus 6
3.1.4 Why corpus is needed 6
3.2 Corpus Analysis Tools 7
3.2.1 Analysis Tools 7
3.2.1.1 Design a parser 8
3.2.1.2 Save corpus into database 9
3.2.1.3 Search & Show any data 9
3.2.1.4 Representation of corpus as TREE 9
3.3 System requirements 10
4. Snapshots 11-16
5. Future plan 17
6. Conclusion 17
7. Refference 18

1
Chapter-1
Acknowledgements
With due homage and honor we are wishes to express our gratitude to
Almighty Allah.
We are expressing our special thanks to RUSHDI SHAMS Sir(Lecturer,
Department of Computer Science & Engineering) who gave us the idea of this
project.
We express our indebtedness with reverential acknowledgement to our
honorable supervisor RUSHDI SHAMS Sir(Lecturer, Department of Computer
Science & Engineering) for his friendly & excellent guidance.

We also express our gratitude to all our teachers, senior students and our
cordial batch mates for their invaluable support.

1
Chapter-2
Introduction
Corpus-based approaches to dialogue have become an increasingly
important part of dialogue agent design, providing a scope of the real issues that
need to be dealt with in order to engage in natural dialogue with humans, as well as
providing the basic data for statistical methods for language processing.
This paper describes the development of a multi-modal corpus based
on language interaction.
Our software is developed to analysis corpus & represent multimodaly.
Here we use xml file as a corpus and represent it.

Chapter-3

1
Analysis
3.1Corpus:
A corpus is a large body of machine-readable texts. In linguistics and
lexicography, a corpus is a body of texts, utterances, or other specimens considered
more or less representative of a language, and usually stored as an electronic
database.
The main purpose of a corpus is to verify a hypothesis about language.
Corpora are ideal for functionally based analyses of language, they have other
uses as well.Now computer corpora may store many millions of running words,
whose features can be analyzed by means of tagging and the use of concordance
programs.
3.1.1Corpus requirement:

 Corpus creation
 Import of existing data
 Support of state of the art linguistic software
 Corpus analysis
 Linguistically relevant queries
 Generation of sub-corpora
 Corpus extension
 Simple corpus extension
 Revision mechanisms
 Corpus dissemination
 Within a working group and extensibility

3.1.2Corpus creation:

• Electronic form: adaptation of material already in electronic form, scanning,


keyboarding.
• Permissions: copyright, safeguard against exploitation and piracy.
• Design: types and proportions of material in it (spoken and written language,
from formal to informal, from literary to ordinary, etc).
• Characteristics:
o quantity => large
o quality => authentic
o simplicity => plain text
o documented => document

1
3.1.3Type of Corpus:

There are many type of corpus. Such as__

 XML
 RDF
 TOPIC MAP
 CONCEPT MAP

3.1.4Why Corpus is needed:

There are many reasons why corpus is neccessary. They


are followed by__

 To verify a hypothesis about language.


 To take representatitive decision.
 To study computational linguistic.
 To analysis a natural knowledge based topic.

1
3.2Corpus Analysis Tools

We have used Netbeans 6.0 IDE for implementing our software . We


used MySql Query Server as a DBMS. We needed a connector jar file for
connecting the program to database. We copied that connector file to JDK library.

3.2.1Analysis Tools:

Our first job is to purse the corpus(XML file). So we had to


design a PARSER to do that. For that we first determined the TAGS of the XML
file. After getting tags we purse the corpus and save it to the database using Mysql
as a DBMS. Then we can search & show any data within the corpus. After that we
represent the total corpus in a TREE view. In a short view our implementation
divided into 4 parts:

1. Design a parser.
2. Save corpus into database.
3. Search & Show any data.
4. Representation of corpus as a TREE.

1
3.2.1.1Design a parser:

To design a parser we have to implement a program that took the


Tags of the corpus. Then these tags are used to purse the corpus.

Efficient parsing of XML documents is more and more critical as


XML gets adopted more widely. It is very important to have an efficient way to
parse XML data, especially in applications that are intended to handle large
volumes. Improper parsing can result in excessive memory usage and processing
times that can hurt scalability. Several types of XML parsers are available.

An XML parser takes as input a raw serialized string and


performs certain operations on it. First it checks the syntactic well-formedness of
the XML data, making sure that the start tags have matching end tags and that there
are no overlapping elements. Most parsers also implement validation against the
Document Type Definition (DTD) or the XML Schema to verify that the structure
and content are as you specified. Finally, the parsing output provides access to the
content of the XML document via programmatic APIs.

There are three popular XML parsing techniques for Java:

• Document Object Model (DOM), a mature standard from W3C


• Simple API for XML (SAX), the first widely adopted API for XML in Java
and a de facto standard
• Streaming API for XML (StAX), a promising new parsing model introduced
in JSR-173

Each of these techniques has benefits and drawbacks.

We worked here with SAX parsing techniques.

1
3.2.1.2Save corpus into database:
To save corpus into the database we have to connect
Java with Mysql. Then we use sql queries to save corpus into the database. After
saving corpus into the database a confirmation message will be shown in the
window.

3.2.1.3Search & Show any data:


To search any data within the corpus we have to select
which content of the corpus you want. Then a window of the searching data will be
shown in the window.
To see total data in the corpus you have to push a
buuton. Then a window of all data will be shown.

3.2.1.4Representation of Corpus as a TREE:


Finally we represent the total corpus as a TREE. In the
TREE we click the head nodes and then the child nodes will be shown.

1
System Requirements

Operating Systems:

Supported operating systems are:

• Windows XP Professional or Home editions; all language versions.


• Microsoft Windows Millennium edition.
• Microsoft Windows 98 all editions.
• Windows 95 or earlier is not supported.

Software Requirements:
 NetBeans IDE 6.0
 MySQL 5.0
 JDK 1.6

1
Snapshots

1
Future Plan
Our future plan is to make the “Text to knowledged mapping prototype”
software more friendly, more easier and more comfortable for the user. It is not
only theoritical but also attractive representation.

Conclusion
The interface of the software is user friendly. This software is applicable for
Windows XP operating system. As maximum computer users feel comfort to use
this operating system. Our main target was to develop a software which can really
help people to take decision or understand the subject of the corpus. If we were
provided sufficient technical help as well as enough time, we could develop this
software more effectively. We are looking forward to improve our software to
make it truly platform independent.

1
Refferences
 http://www.oracle.com/technology/ Parsing XML Efficiently - Dev xml.html
 http://www.corpus analysis.html
 http://www.cambridge.org
 { robinson, martinovski, stephan, traum}@ict.usc.edu
 Database System Concepts
--Abraham Silberschatz
--Henry F. Korth
--S. Sudarshan
 Database Systems Lab Sheets
--Rushdi Shams(Lecturer,CSE,KUET)

You might also like