You are on page 1of 11

Understand ShortTexts by Harvesting & Analyzing SemantiKnowledge

Abstract:

Understanding short texts is crucial to many applications, but challenges

abound. First, short texts do not always observe the syntax of a written language.

As a result, traditional natural language processing tools, ranging from part-of-

speech tagging to dependency parsing, cannot be easily applied. Second, short

texts usually do not contain sufficient statistical signals to support many state-of-

the-art approaches for text mining such as topic modeling. Third, short texts are

more ambiguous and noisy, and are generated in an enormous volume, which

further increases the difficulty to handle them. We argue that semantic knowledge

is required in order to better understand short texts. In this work, we build a

prototype system for short text understanding which exploits semantic knowledge

provided by a well-known knowledgebase and automatically harvested from a web

corpus. Our knowledge-intensive approaches disrupt traditional methods for tasks

such as text segmentation, part-of-speech tagging, and concept labeling, in the

sense that we focus on semantics in all these tasks. We conduct a comprehensive

performance evaluation on real-life data. The results show that semantic

knowledge is indispensable for short text understanding, and our knowledge-


intensive approaches are both effective and efficient in discovering semantics of

short texts.

architecture diagram:

EXISTING SYSTEM:

Many problems in natural language processing, data mining,

information retrieval, and bioinformatics can be formalized as string

transformation, which is a task as follows. Given an input string, the system

generates the k most likely output strings corresponding to the input string. This

paper proposes a novel and probabilistic approach to string transformation, which

is both accurate and efficient. The approach includes the use of a log linear model,

a method for training the model, and an algorithm for generating the top k
candidates, whether there is or is not a predefined dictionary. The log linear model

is defined as a conditional probability distribution of an output string and a rule set

for the transformation conditioned on an input string. The learning method

employs maximum likelihood estimation for parameter estimation. The string

generation algorithm based on pruning is guaranteed to generate the optimal top k

candidates. The proposed method is applied to correction of spelling errors in

queries as well as reformulation of queries in web search. Experimental results on

large scale data show that the proposed approach is very accurate And efficient

improving upon existing methods in terms of accuracy and efficiency in different

settings.

PROPOSED SYSTEM:

Understanding short texts is crucial to many applications, but

challenges abound. First, short texts do not always observe the syntax of a written

language. As a result, traditional natural language processing methods cannot be

easily applied. Second, short texts usually do not contain suffi cient statistical

signals to support many state-of-the-art approaches for text processing such as

topic modeling. Third, short texts are usually more ambiguous. We argue that
knowledge is needed in order to better understand short texts. In this work, we use

lexicalsemantic knowledge provided by a well-known semantic network for short

text understanding. Our knowledge-intensive approach disrupts traditional methods

for tasks such as text segmentation, part-of-speech tagging, and concept labeling,

in the sense that we focus on semantics in all these tasks. We conduct a

comprehensive performance evaluation on real-life data. The results show that

knowledge is indispensable for short text understanding, and our knowledge-

intensive approaches are effective in harvesting semantics of short texts.

ADVANTAGES:

• user can search realated words

• view chart based on most word searching

Module description:

Number of Modules:
After careful analysis the system has been identified to have the following:

Modules:

1. User module

2. Owner module

3. Admin module

4. Chart module

5. word search module

User module:

User module , the new user should register application form , before

enter the particular site, after login , user should create the profile for that

particular login user, user can search any word ,they can view related word like

anything,foe example All annotators regard instancessuch as “dog” as

unambiguous although they belong to multiple concept clusters. These concept


clusters (e.g., predator, animal, creature, etc.) actually constitute a hierarchy which

we denote as a Sense in this work,

owner module:

owner module , the new owner should register application form ,

before enter the particular site, after login , user should create the profile for that

particular login user , owner can add the new worsd,and related words based on

that,if user can search the particular word they can add as soon as possible,owner

can view the chart based on most number ofword search. Ambiguity level 0 refers

to instances that most people regard as unambiguous. These instances contain only

one sense, such as “dog” (animal) and “california” (state); Ambiguity level 1

refers to instances that both ambiguousand unambiguous make sense. These

instances usually contain more than one senses, but all of these senses are related

to some extent, such as “google” (company & search engine) and “nike” (brand &

company);

Admin module:

Admin is a super user. they can view all the user and owner

details.admin can view the chart based on most number of word search , they can
add related word ,so user can easily mapping arelated words for example

Ambiguity level 2 refers to instances that most people think as ambiguous. These

instances contain two or more unrelated senses, such as “apple” (fruit & company)

and “jaguar” (animal & company). In this work, we only focus on disambiguation

of instances.

word search module:

word search module, user can search any word ,so they can

easily mapping a realated words, based on search only we will create a chart in this

project, user can view the string manipulation word like string,sub string ,shortcut

words and alis words.

Chart module:

Understand ShortTexts by Harvesting & Analyzing

SemantiKnowledge in this project we will generate a chart based on most number

of word searching or related word mapping,admin and owner can view which word

mostly searching for the user.so owner can easily add the mapping word.
System Requirements

H/W System Configuration:-

Processor - Pentium –III


Speed - 1.1 Ghz

RAM - 256 MB(min)

Hard Disk - 20 GB

Key Board - Standard Windows Keyboard

Mouse - Two or Three Button Mouse

Monitor - SVGA

S/W System Configuration

 Operating System :Windows95/98/2000/XP /7

 Application Server : Tomcat5.0/6.X /8.X

 Front End : HTML, Java, Jsp

 Scripts : JavaScript,jquery,ajax

 Server side Script : Java Server Pages.

 Database Connectivity : Mysql.


Conclusion:

In this work, we propose a generalized framework to understand short texts

eectively and e ciently. More specifically, we divide the task of short text

understanding into three subtasks: text segmentation, type detection, and concept

labeling. We formulate text segmentation as a weighted Maximal Clique problem,

and propose a randomized approximation algorithm to maintain accuracy and

improve e ciency at the same time. We introducea Chain Model and a Pairwise

Model which combine lexicaland semantic features to conduct type detection.

They achieve better accuracy than traditional POS taggers on the labeled

benchmark. We employ a Weighted Vote algorithm to determine the most

appropriate semantics for an instance when ambiguity is detected. The

experimental results demonstrate that our proposed framework outperforms

existing state-of-the-art approaches in the

field of short text understanding. As a future work, we attempt to analyze and

incorporate the impact of spatial-temporal feature into our framework for short text

understanding

Future enhancement:
In future, we will future develop our algorithm in the

following aspects:

• In this work, we propose a generalized framework to understand short texts

e ectively and effciently. More specifically, we divide the task of short text

understanding into three subtasks: text segmentation, type detection, and

concept labeling

• We formulate text segmentation as a weighted Maximal Clique problem, and

propose a randomized approximation algorithm to maintain accuracy and

improve e ciency at the same time. We introducea Chain Model and a

Pairwise Model which combine lexicaland semantic features to conduct type

detection. They achieve better accuracy than traditional POS taggers on the

labeled benchmark.
.

You might also like