Understand ShortTexts by Harvesting & Analyzing SemantiKnowledge

Understand ShortTexts by Harvesting & Analyzing SemantiKnowledge
Abstract:
Understanding short texts is crucial to many applications, but challenges
abound. First, short texts do not always observe the syntax of a written language.
As a result, traditional natural language processing tools, ranging from part-of-
speech tagging to dependency parsing, cannot be easily applied. Second, short
texts usually do not contain sufficient statistical signals to support many state-of-
the-art approaches for text mining such as topic modeling. Third, short texts are
more ambiguous and noisy, and are generated in an enormous volume, which
further increases the difficulty to handle them. We argue that semantic knowledge
is required in order to better understand short texts. In this work, we build a
prototype system for short text understanding which exploits semantic knowledge
provided by a well-known knowledgebase and automatically harvested from a web
corpus. Our knowledge-intensive approaches disrupt traditional methods for tasks
such as text segmentation, part-of-speech tagging, and concept labeling, in the
sense that we focus on semantics in all these tasks. We conduct a comprehensive
performance evaluation on real-life data. The results show that semantic
knowledge is indispensable for short text understanding, and our knowledge-

intensive approaches are both effective and efficient in discovering semantics of
short texts.
architecture diagram:
EXISTING SYSTEM:
Many problems in natural language processing, data mining,
information retrieval, and bioinformatics can be formalized as string
transformation, which is a task as follows. Given an input string, the system
generates the k most likely output strings corresponding to the input string. This
paper proposes a novel and probabilistic approach to string transformation, which
is both accurate and efficient. The approach includes the use of a log linear model,
a method for training the model, and an algorithm for generating the top k
candidates, whether there is or is not a predefined dictionary. The log linear model
is defined as a conditional probability distribution of an output string and a rule set
for the transformation conditioned on an input string. The learning method
employs maximum likelihood estimation for parameter estimation. The string
generation algorithm based on pruning is guaranteed to generate the optimal top k
candidates. The proposed method is applied to correction of spelling errors in
queries as well as reformulation of queries in web search. Experimental results on
large scale data show that the proposed approach is very accurate And efficient
improving upon existing methods in terms of accuracy and efficiency in different
settings.
PROPOSED SYSTEM:
Understanding short texts is crucial to many applications, but
challenges abound. First, short texts do not always observe the syntax of a written
language. As a result, traditional natural language processing methods cannot be
easily applied. Second, short texts usually do not contain suffi cient statistical
signals to support many state-of-the-art approaches for text processing such as
topic modeling. Third, short texts are usually more ambiguous. We argue that
knowledge is needed in order to better understand short texts. In this work, we use
lexicalsemantic knowledge provided by a well-known semantic network for short
text understanding. Our knowledge-intensive approach disrupts traditional methods
for tasks such as text segmentation, part-of-speech tagging, and concept labeling,
in the sense that we focus on semantics in all these tasks. We conduct a
comprehensive performance evaluation on real-life data. The results show that
knowledge is indispensable for short text understanding, and our knowledge-
intensive approaches are effective in harvesting semantics of short texts.
ADVANTAGES:
• user can search realated words
• view chart based on most word searching
Module description:
Number of Modules:
After careful analysis the system has been identified to have the following:
Modules:
1. User module
2. Owner module
3. Admin module
4. Chart module
5. word search module
User module:
User module , the new user should register application form , before
enter the particular site, after login , user should create the profile for that
particular login user, user can search any word ,they can view related word like
anything,foe example All annotators regard instancessuch as “dog” as
unambiguous although they belong to multiple concept clusters. These concept

clusters (e.g., predator, animal, creature, etc.) actually constitute a hierarchy which
we denote as a Sense in this work,
owner module:
owner module , the new owner should register application form ,
before enter the particular site, after login , user should create the profile for that
particular login user , owner can add the new worsd,and related words based on
that,if user can search the particular word they can add as soon as possible,owner
can view the chart based on most number ofword search. Ambiguity level 0 refers
to instances that most people regard as unambiguous. These instances contain only
one sense, such as “dog” (animal) and “california” (state); Ambiguity level 1
refers to instances that both ambiguousand unambiguous make sense. These
instances usually contain more than one senses, but all of these senses are related
to some extent, such as “google” (company & search engine) and “nike” (brand &
company);
Admin module:
Admin is a super user. they can view all the user and owner
details.admin can view the chart based on most number of word search , they can
add related word ,so user can easily mapping arelated words for example
Ambiguity level 2 refers to instances that most people think as ambiguous. These
instances contain two or more unrelated senses, such as “apple” (fruit & company)
and “jaguar” (animal & company). In this work, we only focus on disambiguation
of instances.
word search module:
word search module, user can search any word ,so they can
easily mapping a realated words, based on search only we will create a chart in this
project, user can view the string manipulation word like string,sub string ,shortcut
words and alis words.
Chart module:
Understand ShortTexts by Harvesting & Analyzing
SemantiKnowledge in this project we will generate a chart based on most number
of word searching or related word mapping,admin and owner can view which word
mostly searching for the user.so owner can easily add the mapping word.
System Requirements
H/W System Configuration:-
Processor - Pentium –III

Speed - 1.1 Ghz
RAM - 256 MB(min)
Hard Disk - 20 GB
Key Board - Standard Windows Keyboard
Mouse - Two or Three Button Mouse
Monitor - SVGA
S/W System Configuration
 Operating System :Windows95/98/2000/XP /7
 Application Server : Tomcat5.0/6.X /8.X
 Front End : HTML, Java, Jsp
 Scripts : JavaScript,jquery,ajax
 Server side Script : Java Server Pages.
 Database Connectivity : Mysql.

Conclusion:
In this work, we propose a generalized framework to understand short texts
eectively and e ciently. More specifically, we divide the task of short text
understanding into three subtasks: text segmentation, type detection, and concept
labeling. We formulate text segmentation as a weighted Maximal Clique problem,
and propose a randomized approximation algorithm to maintain accuracy and
improve e ciency at the same time. We introducea Chain Model and a Pairwise
Model which combine lexicaland semantic features to conduct type detection.
They achieve better accuracy than traditional POS taggers on the labeled
benchmark. We employ a Weighted Vote algorithm to determine the most
appropriate semantics for an instance when ambiguity is detected. The
experimental results demonstrate that our proposed framework outperforms
existing state-of-the-art approaches in the
field of short text understanding. As a future work, we attempt to analyze and
incorporate the impact of spatial-temporal feature into our framework for short text
understanding
Future enhancement:
In future, we will future develop our algorithm in the
following aspects:
• In this work, we propose a generalized framework to understand short texts
e ectively and effciently. More specifically, we divide the task of short text
understanding into three subtasks: text segmentation, type detection, and
concept labeling
• We formulate text segmentation as a weighted Maximal Clique problem, and
propose a randomized approximation algorithm to maintain accuracy and
improve e ciency at the same time. We introducea Chain Model and a
Pairwise Model which combine lexicaland semantic features to conduct type
detection. They achieve better accuracy than traditional POS taggers on the
labeled benchmark.
.

Understand ShortTexts by Harvesting &amp; Analyzing SemantiKnowledge

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Understand ShortTexts by Harvesting &amp; Analyzing SemantiKnowledge

Uploaded by

Copyright:

Available Formats

Understand ShortTexts by Harvesting & Analyzing SemantiKnowledge

Understanding short texts is crucial to many applications, but challenges

As a result, traditional natural language processing tools, ranging from part-of-

speech tagging to dependency parsing, cannot be easily applied. Second, short

is required in order to better understand short texts. In this work, we build a

provided by a well-known knowledgebase and automatically harvested from a web

corpus. Our knowledge-intensive approaches disrupt traditional methods for tasks

such as text segmentation, part-of-speech tagging, and concept labeling, in the

sense that we focus on semantics in all these tasks. We conduct a comprehensive

performance evaluation on real-life data. The results show that semantic

knowledge is indispensable for short text understanding, and our knowledge-

Many problems in natural language processing, data mining,

information retrieval, and bioinformatics can be formalized as string

transformation, which is a task as follows. Given an input string, the system

paper proposes a novel and probabilistic approach to string transformation, which

is defined as a conditional probability distribution of an output string and a rule set

for the transformation conditioned on an input string. The learning method

employs maximum likelihood estimation for parameter estimation. The string

generation algorithm based on pruning is guaranteed to generate the optimal top k

candidates. The proposed method is applied to correction of spelling errors in

queries as well as reformulation of queries in web search. Experimental results on

improving upon existing methods in terms of accuracy and efficiency in different

Understanding short texts is crucial to many applications, but

language. As a result, traditional natural language processing methods cannot be

signals to support many state-of-the-art approaches for text processing such as

lexicalsemantic knowledge provided by a well-known semantic network for short

text understanding. Our knowledge-intensive approach disrupts traditional methods

in the sense that we focus on semantics in all these tasks. We conduct a

comprehensive performance evaluation on real-life data. The results show that

knowledge is indispensable for short text understanding, and our knowledge-

intensive approaches are effective in harvesting semantics of short texts.

• user can search realated words

• view chart based on most word searching

5. word search module

anything,foe example All annotators regard instancessuch as “dog” as

unambiguous although they belong to multiple concept clusters. These concept

we denote as a Sense in this work,

owner module , the new owner should register application form ,

refers to instances that both ambiguousand unambiguous make sense. These

word search module:

words and alis words.

Understand ShortTexts by Harvesting & Analyzing

SemantiKnowledge in this project we will generate a chart based on most number

H/W System Configuration:-

Processor - Pentium –III

RAM - 256 MB(min)

Key Board - Standard Windows Keyboard

Mouse - Two or Three Button Mouse

S/W System Configuration

 Operating System :Windows95/98/2000/XP /7

 Application Server : Tomcat5.0/6.X /8.X

 Front End : HTML, Java, Jsp

 Server side Script : Java Server Pages.

 Database Connectivity : Mysql.

In this work, we propose a generalized framework to understand short texts

labeling. We formulate text segmentation as a weighted Maximal Clique problem,

and propose a randomized approximation algorithm to maintain accuracy and

Model which combine lexicaland semantic features to conduct type detection.

benchmark. We employ a Weighted Vote algorithm to determine the most

appropriate semantics for an instance when ambiguity is detected. The

experimental results demonstrate that our proposed framework outperforms

existing state-of-the-art approaches in the

field of short text understanding. As a future work, we attempt to analyze and

• In this work, we propose a generalized framework to understand short texts

Understand ShortTexts by Harvesting & Analyzing SemantiKnowledge

Understand ShortTexts by Harvesting & Analyzing SemantiKnowledge