You are on page 1of 32

1.

INTRODUCTION
Similarity coefficient is a statistical measure used to compare the similarity
between two given documents. Assessment of similarity coefficients between research
publications using keyword matching also explains that searching procedure starts
with matching input keyword to original document.
Sometimes, searching becomes difficult so there are many searching engines
available in the cyber-world today to facilitate internet users. However, the most
important factor for an effective and successful searching is keyword. For example,
the most efficient keyword that user has input will link to the relate document he
requires. This has proved that keyword is very important in searching activity.
Similarity Coefficient method can better identify similarity between keywords and
words contained in the index when characters of words are compared.
Keyword search is the simplest form of the most popular query method for
search engine in information systems. It contains a single keyword or multiple
keywords and a sort phrase. In a single keyword search, a particular word in the
document will be displayed.
It is said that the search process commences from importing user’s queries to
compare with the database. In case of input keyword matches with the index of words
in the database, those words can be accounted for the main keywords displayed in that
search process.Keywords matching system doesn’t analyze the whole user input but
focuses on searching words on phrases defined in the user says.

1
System Architecture:

Fig:1 System Architecture

The Fig:1 shows the design of system. The similairty checking provides the
comparision between the two documents and the given keyword similarity with the
required input file. When the user selects on option to give the similarity with source,
the system retrieves the data from the dataset which is stored in a .csv file and runs
the code on the dataset and generates an output in the form of .pdf file. MySQL
database is used to store the research paper details.

2
1.1 EXISTING SYSTEM AND PROPOSED SYSTEM

1.1.1Existing System:
A measure of similarity of the search words is the determination of the
association between two words with Jaccard coefficient. Jaccard index is a name often
used for comparing similarity. Consequently, a similarity measurement between
keywords and index terms is essentially performed to facilitate searchers in accessing
the required results promptly.
Thus,Jaccard proposed the similarity measurement method between words by
deploying Jaccard Coefficient. Technically, this system is primarily responsible to
document operations, creates a document representation or an index, query operations
and representation, and searches documents by comparing the similarities (Similarity
Computation) of a keyword and the document agents.
Results of the system are a list of documents sorted (Ranking) by the similarity of
documents displayed to users. Therefore,Jaccard model focused on measuring the
similarity of the keyword using Jaccard Coefficient that was developed to measure the
similarity between the documents.
Disadvantages of Existing System:
 It is restricted to the relatively small datasets.
 It does not work well when users have very sparse or no data.
 If the number of keywords are more to be searched, a huge amount of time
will be consumed.
1.1.2 Proposed System:
To meet existing need, we proposed a system, namely, Assessment of
similarity coefficients between research publications using keyword matching.
Comparing such as documents, research papers based on their keywords is a
challenging problem. This requires a similarity measure between two sets of
keywords.
We present a new measure based on matching the words of two groups assuming that
a similarity measure between two individual words is available. In this proposed
system we will detect the similarity keyword by keyword.Keyword is very important
in searching activity. This explain that the ultimate goal of information searching
activity is to find any document that users require which relate to keyword input and

3
also explains that searching procedure stats with matching input keyword to word
index.
This project aims at comparing the research publications, by using the keyword
separation and keyword matching techniques. In addition, the user can also specify,
his own interested keywords and/or phrases for comparison.
The main logic of the system involves in separating the keywords of the research
publications,and comparing them one by one using an iterative structure.
Advantages of Proposed System:
 This is very useful to the research scholars, in finding out the similarity
between research papers, in their interested area.
 The project is also useful to the publishers, in finding out the similarity
between the proposed new research papers with the existing papers.
 This finally leads to the quality enhancement of research papers.
1.2. Requirements Analysis:
Ian Sommerville’s book ” Software Engineering” gives the following
definition as to what a requirement is:
” The requirements for a system are the descriptions of the services provided
by the system and its operational constraints.”
To design similarity coefficients between research publications is then carried
out an analysis of system requirements that will be built. Requirements of the system
requirements are divided into two, namely the functional requirements and non-
functional requirements.
1.2.1. Functional Requirements:
Functional requirements are statements of what the system should provide and
how the system should react into several inputs.

 Manually reviewing passages copied: User can compare the copied text in
real time using two windows.

 Adding contents and reviewing list of contents in database: User can add
contents to database and review list of contents using form and table

 Viewing the history of comparisions: User can view the history of


comparision.

4
1.2.2 Non-Functional requirements:
 Compatibility: System should be compatible.

 Easy to use.

 Users will interact with the system to generate similarity report through a user-
friendly graphical user interface. Furthermore, the generated reports will
contain textual representation for the results.

 Safety: system should guarantee relevant safety. Users passwords stored in the
database should be hashing.
1.2.3. Hardware Requirements:
 It requires a minimum of 2.16 GHz processor.
 It requires a minimum of 4 GB RAM.
 It requires 64-bit architecture.
 It requires a minimum storage of 500GB.
1.2.4. Software Requirements:
 It requires a 64-bit Ubuntu Operating System.
 Python Qt Designer for designing user interface.
 MY SQL server for storing database Entities.
 Pyuic for converting the layout designed user interface (UI) to python code.
User Requirements:
The Identification of similarity coefficients shall generate reports
showing the keyword similarity with source file and the words similarities
between the source file and target file.

System Requirements Specification:


1.1 The system shall generate the report showing the similarity of words
between two documents.
1.2 A report shall be generated showing the count of words repeated in
both the documents.
1.3 A report shall be created showing the keyword similarity with the
source file in which the user specifies

5
2. LITERATURE SURVEY
G. Varelaset et al., [VGS05], LinglingMeng et al., [5] were worked with the
semantic similarity measures using WordNet in various time periods. The former had
developed a similarity method using WordNet. It was believed that in a larger
collection, the method would perform even better. Having an increased number of
indexed documents, could help the retrieval of relevant documents that match the
expanded terms model. The later had reviewed the principles, features, advantages
and disadvantages of various state of art semantic similarity measures viz., Path based
measures, information content based measures, feature based measures and hybrid
measures in WordNet based on is-a relation. In fact he had stated that there were no
absolute good performance measures.
Sahami et al. [6] presented a method to measure semantic similarity between two
queries using small text snippet collections returned for those queries by a search
engine. This novel approach overcomes the problem that existed in earlier approaches
like Cosine similarity to measure the similarity between small texts. The proposed
system contains kernel function to determine semantic similarity and query expansion
technique. For each query, they collect snippets from a web engine and signify each
small piece as a TF-IDF weighed term vector. Each vector is L2 normalized and the
centre point of the group of vectors is determined. Semantic similarity between two
queries is then defined as the inner product between the associated centroid vectors.
They are not compared their similarity measure with taxonomy-based similarity
measures.
Hung Chim and Xiaotie Deng [8] developed an approach to compute document
similarity. The main objective of their work was to uncover a sentence-based document
similarity to calculate the pair wise similarities of documents based on the Suffix Tree
Document (STD) model. By correlating every node in the suffix tree of STD model into a
unique feature term in the Vector Space Document (VSD) technique, the sentence-based
document relationship naturally inherits the term tf-idf weighting system in calculating the
document similarity with phrases.
The Jaccard index[1], also known as the Jaccard similarity coefficient (originally coined
coefficient de communaute by Paul Jaccard), is a statistic used for comparing the similarity

6
and diversity of sample sets. The Jaccard coefficient measures similarity between sample sets,
and is defined as the size of the intersection divided by the size of the union of the sample
sets:
Elias Iosif and Alexandros Potamianos [77] proposed a Web-oriented metrics that
compute the semantic similarity between words or terms and compared it with the state of the
fine art. The fundamental assumption is that the similarity of context states that similarity of
synonym and related web documents were downloaded via web search engine and the
contextual information of words of interest can be compared. In addition, the proposed
unsupervised context-based similarity computation algorithms seem to be competitive with
the state-of-the-art supervised semantic similarity algorithms that are based on language-
specific knowledge resources. They have been considered data pre-processing task to reduce
the unwanted terms, special characters and symbols to appear in operation. The data
preprocessing task is an important activity in knowledge discovery oriented operation to
reduce 34 search time and memory to save the information content. They evaluated their
proposed system using Charles-Miller and medical term data set collection.

7
3. SYSTEM DESIGN

3.1 Database:
A database is a collection of information that is organized so that it can be easily
accessed, managed and updated. Data is organized into rows, columns and tables. Each
database has one or more distinct APIs for creating, accessing, managing, searching and
replicating the data it holds.
Other kinds of data stores can be used, such as files on the file system or large hash
tables in memory but data fetching and writing would not be so fast and easy with those types
of systems. The database used is MySQL. MySQL is an open-source relational database
management system.
Terminology:
 Database: A database is a collection of tables, with related data.
 Table: A database consists of one or more tables. Each table is made up of
rows and columns.
 Column: Each table has multiple columns and each column has a unique
name. The columns of the table correspond to the attributes of an entity.
 Row: Each Table has multiple rows. A row is a tuple, entry or record.
 Primary Key: A primary key is unique. A Primary key is a candidate key
chosen by the database designer as the principal means of identifying entities
in the entity set.
3.2 UML DIAGRAMS:
The Unified Modelling Language allows the software engineer to express an analysis
model using the modelling notation that is governed by a set of syntactic semantic and
pragmatic rules.
A UML system is represented using five different views that describe the system
from distinctly different perspective. Each view is defined by a set of diagram, which is as
follows.

 User Model View:


This view represents the system from the user’s perspective. The
analysis representation describes a usage scenario from the end users
perspective.

8
 Structural Model view:
In this model the data and functionality are arrived from inside the
system.This view models the static structures.
 Behavioral Model View:
It represents the dynamic behavior of the parts of the system, depicting
the interactions of collection between various structural elements described in the user
model and structural model view.

 Implementation Model View:


In this the structural and behavioural as parts of the system are represented as
they are to be built.The UML diagrams are as follows:
i. Use case Diagram
ii. Sequence Diagram
iii. Activity Diagram
iv. Component Diagram
v. Deployment Diagram

3.2 UML DIAGRAMS:


The Unified Modeling Language allows the software engineer to
express an analysis model using the modeling notation that is governed by a
set of syntactic semantic and pragmatic rules.
Structural Diagrams:
The UML's four structural diagrams exist to visualize, specify,
construct, and document the static aspects of a system. You can think of the
static aspects of a system as representing its relatively stable skeleton and
scaffolding. Just as the static aspects of a house encompass the existence
and placement of such things as walls, doors, windows, pipes, wires, and
vents, so too do the static aspects of a software system encompass the
existence and placement of such things as classes, interfaces,
collaborations. The UML's structural diagrams are roughly organized
around the major groups of things you'll find when modeling a system.
1. Class diagram
2. Object diagram
3. Component diagram

9
4. Deployment diagram.
Structural model represents the framework for the system and this
framework is the place where all other components exist. Hence, the class
diagram, component diagram and deployment diagrams are part of
structural modeling. They all represent the elements and the mechanism to
assemble them.
The structural model never describes the dynamic behavior of the
system. Class diagram is the most widely used structural diagram.
Behavioral Diagrams
The UML's five behavioral diagrams are used to visualize, specify,
construct, and document the dynamic aspects of a system. You can think of
the dynamic aspects of a system as representing its changing parts. Just as
the dynamic aspects of a house encompass airflow and traffic through the
rooms of a house, so too do the dynamic aspects of a software system
encompass such things as the flow of messages over time and the physical
movement of components across a network.
The UML's behavioral diagrams are roughly organized around the
major ways you can model the dynamics of a system.
1. Use case diagram - Organizes the behaviors of the system.
2.Sequence diagram - Focused on the time ordering of messages.
3. Collaboration diagram - Focused on the structural organization of
objects that send and receive messages.
4. Statechart diagram - Focused on the changing state of a system
driven by events.
5. Activity diagram - Focused on the flow of control from activity
to activity.

3.2.1 Use Case Diagram:


A use case diagram in the Unified Modeling Language (UML) is a
type of behavioral diagram defined by and created from a Use-case
analysis. It shows a set of use cases and actors and their relationships.

10
The purpose of use case diagram is to capture the dynamic aspect of a
system Use case diagrams are used to gather the requirements of a system
including internal and external influences.Hence, when a system is analyzed to
gather its functionalities, use cases are prepared and actors are identified.When
the initial task is complete, use case diagrams are modelled to present the
outside view.These diagrams are especially important in organizing and
modelling the behavior of a system
Use case diagrams are considered for high level requirement analysis of
a system. When the requirements of a system are analysed, the functionalities
are captured in use cases.Use cases are nothing but the system functionalities
written in an organized manner. The second thing which is relevant to use
cases are the actors. Actors can be defined as something that interacts with the
system. Actors can be a human user, some internal applications, or may be
some external applications.

Fig:3.2.1 Use case Diagram

The Fig:3.2.1 shows use case diagram. It has one actor who is the user. The user can give
input data such as research paper details and keywords related to which he wants to
compare with the document.
11
3.2.2Sequence Diagram:
A sequence diagram in Unified Modeling Language (UML) is a kind of
interaction diagram that shows how processes operate with one another and in
what order.A sequence diagram is an interaction diagram that emphasizes the
time-ordering of message

Fig:3.2.2 Sequence Diagram

The Fig:3.2.2 shows sequence diagram.A user inputs data into the system and this
gets stored in the database and acknowledges him. When user clicks the similarity
with source button the system provides him with similarity results between sourch
file and target file.Then user checks the keyword similarity with the source file it
then generates the output in the form of .csv file.

12
3.2.3 Activity Diagram:
It captures the dynamic behaviour of the system.Activity diagram is used to
show message flow from one activity to another.Activity is a particular
operation of the system. Activity diagrams are not only used for visualizing the
dynamic nature of a system, but they are also used to construct the executable
system by using forward and reverse engineering techniques. The only missing
thing in the activity diagram is the message part.It does not show any message
flow from one activity to another. Activity diagram is sometimes considered as
the flowchart. Although the diagrams look like a flowchart, they are not. It
shows different flows such as parallel, branched, concurrent, and single.
The purpose of an activity diagram can be described as
 Draw the activity flow of a system.
 Describe the sequence from one activity to another.
 Describe the parallel, branched and concurrent flow of the system.

Fig:3.2.3 Activity Diagram

13
3.2.4 Component Diagram:
Component diagram is a special kind of diagram in UML. The purpose is also different
from all other diagrams discussed so far. It does not describe the functionality of the
system but it describes the components used to make those functionalities.Thus from that
point of view, component diagrams are used to visualize the physical components in a
system. These components are libraries, packages, files, etc.Component diagrams can
also be described as a static implementation view of a system. Static implementation
represents the organization of the components at a particular moment.
A single component diagram cannot represent the entire system but a collection of
diagrams is used to represent the whole.
The purpose of the component diagram can be summarized as −
 Visualize the components of a system.
 Construct executables by using forward and reverse engineering.
 Describe the organization and relationships of the components.
This diagram is very important as without it the application cannot be implemented
efficiently. A well-prepared component diagram is also important for other aspects such
as application performance, maintenance, etc.

Fig: 3.2.4 Component Diagram

From Fig:3.2.4 we represented various components of our application where there are
different components like list, details components where they communicate with the
server and the server would communicate with the database and extracts the data for the
user in behalf.

14
3.2.5 Deployment Diagram:
The term Deployment itself describes the purpose of the diagram. Deployment diagrams
are used for describing the hardware components, where software components are
deployed. Component diagrams and deployment diagrams are closely related.
Component diagrams are used to describe the components and deployment diagrams
shows how they are deployed in hardware. UML is mainly designed to focus on the
software artefacts of a system.

The purpose of deployment diagrams can be described as −


 Visualize the hardware topology of a system.
 Describe the hardware components used to deploy software components.
 Describe the runtime processing nodes.

Fig:3.2.5 Deployment Diagram

The Fig:3.2.5 describes about the whole system where the application, server, and the
database connected to each other and accepts the requests and passes the request.

The process is:


1)application sends request to the server
2)server accepts the request and process to the database
3)database detects the request and tries to gives the response
4)response from the database is again received by the server
5)on receiving the response then again server sends the response to the client.

15
4.IMPLEMENTATION
4.1 Module Description:
Similarity coefficient is a statistical measure used to compare the similarity between two
given documents. This project aims at comparing the research publications, by using the
keyword separation and keyword matching techniques. In addition, the user can also
specify, his own interested keywords and/or phrases for comparison, by using the front
end Graphical User Interfaces. The GUI screens in this project are created by a python
tool called PyQt.

MODULES:
The project has three modules.

1. KPI (Keywords & Phrases Input):


This module is used to provide different Keywords and phrases, as input for the
system.
2. KPD(Keywords & Phrases Detection):
This module is used to analyze and compare the similarities between the two input
files and the keyword similarities can also be known by this keywords and phrases
detection.
3. Report:
This report module is used to report the matched keywords and phrases, along with
the statistics. The project uses report lab tool generate the output report, in .pdf form.
This Report contains the keywords and phrases that occurred in both the documents.

16
4.2 Algorithm:

Pandas is an open source,BSD-licensed Python library providing high performance, easy


to use data structures and data analysis tools for the python programming language.
Algorithm:
import sys
Import csv
import pandas as pd
with open('paper1', 'r') as myfile:
text=myfile.read().replace('\n', '')
text=text.replace(',','')
text=text.replace(';','')
myfile.close()
data1 = text.split() #split string into a list
with open('paper2', 'r') as myfile:
text2=myfile.read().replace('\n', '')
text2=text2.replace(',','')
text2=text2.replace(';','')
myfile.close()
print "Similarity file occu1 is generated"
with open('occu1.csv', 'w') as f:
sys.stdout = f
list1 = []
for temp in data1:
if(tempnot
in['a','is','of','in','the','that','to','without','on','with','data','being','give','was','by','from','and','ca
n','or','such','by']):
if (text2.count(temp) > 1):
if (temp not in list1):
list1.append(temp)
print
temp,text2.count(temp),text.count(temp),float(text.count(temp))/float(text2.count(temp))*
100.0

17
4.3 Language Used:
Python Language:
Python is an interpreted, object-oriented, high-level programming language with dynamic
semantics. Its high-level built in data structures, combined with dynamic typing and
dynamic binding, make it very attractive for Rapid Application Development, as well as
for use as a scripting or glue language to connect existing components together. Python's
simple, easy to learn syntax emphasizes readability and therefore reduces the cost of
program maintenance.

4.4 Tools Used:


.The GUI screens in this project are created by a python tool called PyQt.
PyQt is a Python binding of the cross-platform GUI toolkit Qt, implemented as a Python
plug-in. PyQt is developed by the British firm Riverbank Computing. PyQt supports
Microsoft Windows as well as various flavors of UNIX, including Linux and MacOS.

PyQt implements different classes and methods including: classes for accessing SQL
databases (ODBC, MySQL, PostgreSQL, Oracle, SQLite),Scintilla-based rich text editor
widget, data aware widgets that are automatically populated from a database, an XML
parser and SVG support. Scalable Vector Graphics (SVG) is an XML-based vector image
format for graphics with support for interactivity.
All the above mentioned features of PyQt, are extensively used in this project, to create
the needed Graphical User Interfaces.
PyUic tool is used automatically generate the code for the Front end user interfaces
created by PyQt. All the front end python code is automatically generated by this tool, by
converting the user interface (.ui) files into .py files.

Python Qt Designer:
Qt:
1.Qt is designed for developing applications and user interfaces once and deploying
them across several desktop and mobile operating systems.

2.The easiest way to start application development with Qt is to download and install Qt.It
contains Qt libraries, examples, documentation, and the necessary development tools, such
as the Qt Creator integrated development environment (IDE).

18
3.The PyQt installer comes with a GUI builder tool called Qt Designer.

.PY File Extension:


A PY file is a program file or script written in Python, an interpreted object-oriented
programming language. It can be created and edited with a text editor, but requires a
Python interpreter to run. PY files are often used for programming web servers and other
administrative computer systems.
Python is designed to be easy to read and simple to implement. It is open source and used
for developing a wide variety of free and commercial applications, such as Bazaar,
Blender, Pylons, and Panda3D.

.UI File Extension:


What is a UI file?
Stores the user interface configuration for a program; saved in an XML format and
contains definitions of Qt widgets with slots and signals; can be viewed in a basic text
editor or opened with a UI designer program.UI files can be created using Qt Designer and
Qt Creator, which are part of the Qt SDK.
.CSV File Extension:
What is a CSV file?
A CSV file is a comma separated values file commonly used by spreadsheet programs
such as Microsoft Excel or OpenOffice Calc. It contains plain text data sets separated by
commas with each new line in the CSV file representing a new database row and each
database row consisting of one or more fields separated by a comma. CSV files are often
opened by spreadsheet programs to be organized into cells or used for transferring data
between databases.
.PYC File Extension:

What is a PYC file?


Complied source code written in the Python programming language.

19
4.5 JDBC Connectivity:
MySQL is an open-source database management system, commonly installed as part
of the popular LAMP (Linux, Apache, MySQL, PHP/Python/Perl) stack. It uses a
relational database and SQL (Structured Query Language) to manage its data.
The short version of the installation is simple: update your package index, install the
mysql-server package, and then run the included security script

Step 1 — Installing MySQL


On Ubuntu 16.04, only the latest version of MySQL is included in the APT package
repository by default. At the time of writing, that's MySQL 5.7

To install it, simply update the package index on your server and install the default
Package with apt-get.

You'll be prompted to create a root password during the installation. Choose a


secure one and make sure you remember it, because you'll need it later. Next, we'll
finish configuring MySQL.

Step 2 — Configuring MySQL


Step 3 — Testing MySQL
Regardless of how you installed it, MySQL should have started running automatically.
To test this, check its status.

20
You'll see output similar to the following:

If MySQL isn't running, you can start it with sudo systemctl mysql start.
For an additional check, you can try connecting to the database using the mysqladmin
tool, which is a client that lets you run administrative commands. For example, this
command says to connect to MySQL as root (-u root), prompt for a password (-p), and
return

21
Sample backend tables:

22
5.TESTING
The purpose of testing is to discover errors. Testing is the process of trying to
discover every conceivable fault or weakness in a work product. It provides a way to check
the functionality of components, sub-assemblies, assemblies and/or a finished product.
It is the process of exercising software with the intent of ensuring that the Software system
meets its requirements and user expectations and does not fail in an unacceptable manner.
There are various types of test. Each test type addresses a specific testing requirement.

5.1Types of Testing:
Unit Testing:
Unit testing focuses verification effort on the smallest unit of software design—the software
component or module. It is done after the completion of an individual unit before integration.
Unit tests ensure that each unique path of a business process performs accurately to the
documented specifications and contains clearly defined inputs and expected results.

Integration Testing:
Integration testing is a systematic technique for constructing the program structure
while at the same time conducting tests to uncover errors associated with interfacing.
The objective is to take unit tested components and build a program structure that has
been dictated by design.
System Testing:
System testing ensures that the entire integrated software system meets requirements. It tests a
configuration to ensure known and predictable results. An example of system testing is the
configuration oriented system integration test. System testing is based on process descriptions
and flows, emphasizing pre-driven process links and integration points.

White Box Testing:


White Box Testing is a testing in which the software tester has knowledge of the inner
workings, structure and language of the software, or at least its purpose. It is purpose. It is
used to test areas that cannot be reached from a black box level.
Black Box Testing:
Black Box Testing is testing the software without any knowledge of the inner workings,
structure or language of the module being tested. Black box tests, as most other kinds of tests,
must be written from a definitive source document, such as specification or requirements
document. It is a testing in which the software under test is treated, as a black box.

23
5.2. TEST CASES:

The project is thoroughly tested by testing the each and every text box and push buttons of the
GUI Screens, and verifying the corresponding results in the Data Base.

Following is the Research Paper Details screen along with Data.

Following screen shot of the mysql db confirms that the above data is stored in the
Database.

The keyword search input module is tested by invoking it from the command prompt using
the following command:

Python simco1.py

Following is the Keywords screen along with Data.

24
Following screen shot of the mysql db confirms that the above data is stored in the
Database.

The files acceptance is tested by invoking it from the command prompt using the
following command:

Python simco1.py

Following is the ‘Accept Files’ screen along with Data.

Following screen shot is the result of clicking the three push buttons in the above screen.

25
6. SCREENSHOTS

6.1 Creating Files:

Fig 6.1: Screen shot showing the files created during the project

The Fig:6.1 shows different files created during this projects. There are four
different types of files: (1) .pyc files (2) .ui files (3).py files and (4).txt files.
Python automatically compiles the python script to compiled code, so called byte
code, before running it. When a module is imported for the first time, or when the
source is more recent than the current compiled file, a .pyc file is created.
1..ui files are the user interface files, created by using PyQt layout editor.
2..py files are python program files created either manually, or automatically. For
instance, each .ui file has a corresponding .py file that is created automatically by
using the PyUIC tool. .txt files contains the generic useful information about the
project.
3.simco1.py, is the entry program for this project. Execution of this python
program leads to the entry screen as follows:

26
6.2.HomePage:

Fig:6.2. Home Page

This entry screen consists of five push buttons. Upon clicking the first button,
rpaper1.py program is instantiated resulting in a screen, by using which the user
can enter the details of the research papers.
The second button leads to the invoking of kywrds1.py program, which results in
a screen, where the user can enter the details of the research keywords.
Upon clicking the third button, files1.py program is instantiated resulting in a
screen, by using which the user can enter the name and location details of the
research papers.
Upon clicking the fourth button, pap2.py program is instantiated resulting in the
calculation of the similarity between two given research papers.
The fifth button leads to the invoking of pap1.py program, which results in the
calculation of similarity between a given set of keywords and a given research
paper.

 rpaper1.py program leads to the following screen.

27
6.3. Research paper details:

Fig: 6.3. Research paper details

The user can enter the paper ID, Title, Authors and Contact details of the primary
author, by using the above screen.The user can retrieve the information whenever
he wants to get it from the database.

The Research paper details are the one regarding the document authors,titles and
their email id’s.

6.4. Keywords to Search:


The user can enter and store the research keywords, by using the following screen.

Fig:6.4 Keywords to search

28
6.5. Accept Files:
The file names of research papers and keywords, can be entered and stored in the system,
by using the following screen.

Fig:6.5 Accept Files

6.6 Output Showing Similairty between papers:

Following is the result of comparison between two given research papers.

Fig:6.6 Output showing similarity between papers

29
6.7 Output showing keyword similarity:

Following is the result of comparison between a given research paper, and a given set of
keywords.

Fig:6.7 Output showing keyword similarity

30
7. CONCLUSION AND FUTURE SCOPE
A new measure called matching similarity was proposed for comparing two
groups of words can be implemented by using two algorithms called keyword
matching and keyword separation. It has simple intuitive logic and it avoids the
problems of the considered minimum, maximum and average similarity
measures.We can easily identify the repetition of words in research publications.
Keyword search is the simplest form of the most popular query method for search
engine in information systems. It contains a single keyword or multiple keywords
and a sort phrase. In a single keyword search, a particular word in the document
will be displayed.It is said that the search process commences from importing
user’s queries to compare with the database. In case of input keyword matches
with the index of words in the database, those words can be accounted for the
main keywords displayed in that search process.

31
32

You might also like