Professional Documents
Culture Documents
K URUKSHETRA
MAJOR PROJECT
REPORT
ON
SUBMITTED TO:
Dr R.M Sharma
SUBMITTED BY:
Manisha Singh(111497)
Sneha Bairagi(111717)
Abhinav Rai(511004)
CONTENTS
1.
2.
3.
4.
5.
6.
7.
8.
9.
Introduction
Motivation
Problem Statement
Description
4.1 Steps involved
4.1.1 Tokenization
4.1.2 Stemming
4.1.3 POS Tagging
4.1.4 Annotating corpora and searching patterns
Java
JDBC
Conclusion
Future work
References
Acknowledgments
2
5
5
6
6
7
7
7
8
10
20
22
24
24
25
We express our profound gratitude and indebtedness to Prof. R.M. Sharma, Department of
Computer Science and Engineering, NIT Kurukshetra for supporting the present topic and for
their inspiring intellectual guidance, constructive criticism and valuable suggestion
throughout the project work.
Date - 4/05/2015
Kurukshetra
Manisha Singh
Sneha Bairagi
Abhinav Rai
ABSTRACT
3
The machine learning field has gained its thrust in almost any domain of research and just
recently has become a reliable tool in the medieval domain. The experimental domain of
automatic learning is used in tasks such as medical decision support, medieval imaging,
protein-protein interaction, extraction of medical knowledge, and for overall patient
management care. Machine Learning is envisaged as a tool by which compute-based systems
can be integratedin the healthcare field in order to get a better, well-organised medical care. It
describes a ML-based methodology for building an application that is capable of identifying
and disseminating healthcare information. It extracts sentences from published medical
papers that mention diseases and treatments, and identifies semantic relations that exist
between diseases and treatments. Our evaluation results for these tasks show that the
proposed methodology obtains reliable outcomes that could be integrated in an application to
be used in the medical care domain. The potential value of this paperstands in the ML settings
that we propose and in the fact that we outperform previous results on the same data set.
1. Introduction
4
Because of the enormous increase in the research in the medical domain, information
extraction tools become more and more important for practitioners of the medical domain.
Finding the relevant information in medical domain is still very problematic because most of
the data on the internet is poorly structured, amorphous, and unable to deal with problems
algorithmically. Most of the data is contained by the journal of medicines and biology which
makes this type of textual mining a central and core problem. In this project, we have focused
on Disease-Medicine co-occurrence relationship extraction from the text of the literature.. It
will be a very valuable contribution in the field of public health to auto-identification of
relationship from medicinal records between the disease and treatment to support the process
of diagnosis.
In this project we are presenting a methodology for extracting useful information from large
medical data. In this project we are applying some techniques of data mining to extract
treatment corresponding to a disease from huge corpus of data. The system tries to identify
the relationship of an active disease and extract relevant medicine for the patient. With the
growing number of medical thesis, research papers, research articles, researchers are faced
with the difficulty of reading a lot of research papers to gain knowledge in their field of
interest. So this system helps the user to extract disease-treatment relationship without
reading the whole document. From the extracted file treatment of the particular disease is
filtered and displayed to the user. Thus the user gets the required information alone which
saves his time and improves the quality of the result. This text mined document can be used
in medical health care domain where a doctor can analyse various kinds of treatment that can
be given to patient with particular medical disorder. The doctor can update the knowledge
related to particular disease or its treatment methodology. A large-scale and accurate list of
drug-disease treatment pairs derived from published biomedical literature can be used for
drug repurposing[1]. The extracted pairs themselves contain many interesting drug-disease
repurposing pairs with evidence from case studies or small-scale clinical studies. Second,
these pairs can be used in network-based systems approaches for drug repurposing. For
example, if drug 1 is similar to drug 2 and disease 1 can be treated by drug 1 then we can
hypothesize that disease 1 can also be treated by drug 2. Here drug-disease relationships will
be important to connect drugs to diseases.
2. Motivation
There is a huge volume of data growing on the internet in the form of research papers and
web documents. The amount of medical literature continues to grow and specialize. The
traditional healthcare system is also becoming one that hug the internet and electronic world.
Electronic Health Records (EHR) is becoming the standard in the healthcare domain.
Researches and studies show that the potential benefits of having an EHR system are:
Health information recording and clinical data repositories immediate access to patient
diagnoses, allergies, and lab test results that enable better and time-efficient medical
decisions;
Medication management rapid access to information regarding potential adverse drug
reactions, immunizations, supplies, etc. Decision support the ability to capture and use quality
medical data for decisions in the workflow of healthcare; and Obtain treatments that are
5
3. Problem Statement
Problem: Finding Semantic Relationship among associated medical terms using
pattern.Sematic relationship among trems basically refers to hidden meaning between the
terms like between drug and disease the hidden meaning is treatment. In this we are trying
to find out the treatments for the diseases by processing the relevant documents using
NLP(natural language processing) techniques which can be used by doctors to improve their
knowledge by knowing about latest treatments discovered and can also be used in drugrepurposing.
4. Description
In this project we are coming out with a system that will be used to identify various
medicines available for a particular disease. In this project input will be the disease name and
will extract the medicines available for the disease from the text documents available in
unstructured format. So basically we are processing the text documents to get the diseasetreatment pairs available in documents.
Proposed Algorithm:
Following is the used algorithm:
Input : Disease, Rules.
Output: Medicine, Semantic Relationship.
1. For any disease do
Extract paper form Medline.
2. Tokenize the document.
3. Remove all stopwords.
4. Perform stemming.
5. POS tagging is preformed to separate required part of speech.
6. convert this corpora to annotated corpora.
7. From annotated sentences
6
the search engine tries to find web pages that contained the terms how, to, develop,
information, retrieval, applications the search engine is going to find a lot more pages
that contain the terms how , to than pages that contain information about developing
information retrieval applications because the terms how and to are so commonly used in
the English language. So, if we disregard these two terms, the search engine can actually
focus on retrieving pages that contain the keywords: develop information retrieval
applications which would more closely bring up pages that are really of interest. This is
just the basic intuition for using stop words. Stop words can by used in a whole range of tasks
and these are just a few:
Supervised machine learning removing stop words from the feature space
Clustering removing stop words prior to generating clusters
Information retrieval preventing stop words from being indexed
Text Summarization excluding stop words from contributing to summarization
scores and removing stop words when computing ROUGE scores.
Types of stop words: Stop words are generally thought to be a single set of words. It really
can mean different things to different application. For example, in some applications
removing all stop words right from determiners (e.g. the, a, an) to prepositions (e.g. above,
across, before) to some adjectives (e.g. good, nice) can be an appropriate stop word list. To
some applications however, this can be detrimental. For instance, in sentiment analysis
removing adjective terms such as good and nice as well as negations such as not can
throw algorithms off their tracks. In such cases, one can choose to use a minimal stop list
consisting of just determiners or determiners with prepositions or just coordinating
conjunctions depending on the needs of the application.
Examples of minimal stop word lists:
Determiners Determiners tend to mark nouns where a determiner usually will be followed
by a noun examples: the, a, an, another.
Coordinating Conjunctions Coordinating conjunctions connect words, phrases, and clauses.
Examples : form, an, nor, but, or, yet, so
Prepositions Prepositions express temporal or spatial relations. Examples : in, under,
towards, before.
4.1.3 Stemming:
Stemming is the term used in linguistic morphology and information retrieval to describe the
process for reducing inflected words to their word stem, base or root form-generally a written
word form. The stem needs not to be identical to the morphological root of the word; it is
usually sufficient that related words map to the same stem, even if this stem is not in itself a
valid root. Stemming is a pre-processing step in Text Mining applications as well as a very
common requirement of Natural Language processing functions. In fact it is very important in
most of the Information Retrieval systems. The main purpose of stemming is to reduce
different grammatical forms/word forms of a word like its noun, adjective, verb, adverb etc.
to its root form.[2] We can say that the goal of stemming is to reduce inflectional forms and
sometimes derivationally related forms of a word to a common base form.For example
'reader' and 'reading' are reduced to 'read' so that terms can lead to similarity detection.
8
Stemming does not seem to depend on the domain but depends on the language of text. But
our findings show that stemming effects to the semantics of term. It has been seen that most
of the times the morphological variants of words have similar semantic interpretations and
can be considered as equivalent for the purpose of IR applications. Since the meaning is same
but the word form is different it is necessary to identify each word form with its base form. In
stemming, conversion of morphological forms of a word to its stem is done assuming each
one is semantically related. There are mainly two errors in stemming over stemming and
under stemming. Over-stemming is when two words with different stems are stemmed to the
same root. This is also known as a false positive. Under-stemming is when two words that
should be stemmed to the same root are not. This is also known as a false negative. Paice has
proved that light-stemming reduces the over-stemming errors but increases the understemming errors. On the other hand, heavy stemmers reduce the under-stemming errors while
increasing the over-stemming errors.
Various Stemming algorithms available are:
Truncate(n): The most basic stemmer was the Truncate (n) stemmer which truncated a word
at the nth symbol i.e. keep n letters and remove the rest. In this method words shorter than n
are kept as it is. The chances of over stemming increases when the word length is small.
S-Stammer: An algorithm conflating singular and plural forms of English nouns. This
algorithm was proposed by Donna Harman. The algorithm has rules to remove suffixes in
plurals so as to convert them to the singular forms.
Lovins Stemmer: This was the first popular and effective stemmer proposed by Lovins in
1968. It performs a lookup on a table of 294 endings, 29 conditions and 35 transformation
rules, which have been arranged on a longest match principle [6]. The Lovins stemmer
removes the longest suffix from a word. Once the ending is removed, the word is recoded
using a different table that makes various adjustments to convert these stems into valid
words. It always removes a maximum of one suffix from a word, due to its nature as single
pass algorithm. The advantages of this algorithm is it is very fast and can handle removal of
double letters in words like getting being transformed to get and also handles many
irregular plurals like mouse and mice, index and indices etc.
Drawbacks of the Lovins approach are that it is time and data consuming. Furthermore, many
suffixes are not available in the table of endings. It is sometimes highly unreliable and
frequently fails to form words from the stems or to match the stems of like-meaning words.
The reason being the technical vocabulary being used by the author.
Porters Stemmer: Porters stemming algorithm is as of now one of the most popular
stemming methods proposed in 1980. Many modifications and enhancements have been done
and suggested on the basic algorithm. It is based on the idea that the suffixes in the English
language (approximately 1200) are mostly made up of a combination of smaller and simpler
suffixes. It has five steps, and within each step, rules are applied until one of them passes the
conditions. If a rule is accepted, the suffix is removed accordingly, and the next step is
performed. The resultant stem at the end of the fifth step is returned.
9
10
Eg: <s>Come September, and the UJF campus is abuzz with new and returning students.</s>
After POS tagging sentence will be:
<s>Come_VB September_NNP ,_, and_CC the_DT UJF_NNP campus_NN abuzz_JJ
with_IN new_JJ and_CC returning_VGB students_NNS ._.</s>
These labels actually came from PENN TAGSET and this PENN TAGSET came from
university of Pennsylvania which is a famous place for natural language processing work.
Foundation is based on noisy channel model.
Noisy Channel
(wn,wn-1,..............,w1)
(tm,tm-1,...............t1)
correct
Noisy transformation
sequence
Here noisy channel is a metaphor for a computation where input is coming and suspected to
noise at every stage of processing and an output is generated. On the input side we have word
sequence and on the output side we have the tag sequence.
Argmax Computation:
Let y=f(x) be a function.Then
y*=max(y) for all x.
Compare max with argmax: x*=argmax(f(x)) for all x.
Each tag here is considered as a state and ^(hat) is considered as the starting state and .(dot) is
considered as the ending state. If there are N words in a sentence then we get a tag sequence
having N+2 states because here hat and dot states are also considered. All the possible tag
sequences are tried and then the tag sequence with the maximum probability is assigned to
word sequence. So finding the tag sequence has been reduced to graph traversal.
Best tag sequence
= T*
=argmaxP(T|W)
=argmaxP(T)P(W|T)
P(T)=P(t0)P(t1|t0)P(t2|t1t0)P(t3|t2t1t0)....................P(tn|tn-1tn-2.........t0)P(tn+1|tntn-1..........t0)
= P(t0)P(t1|t0)P(t2|t1t0)P(t3|t2t1)....................P(tn|tn-1tn-2)P(tn+1|tntn-1)
(Trigram Assumption)
P(tn|tn-1tn-2)=no of times (tntn-1tn-2) sequence occurs divided by number of times (t n-1tn-2)
sequence occurs.
P(W|T)=P(w0|t0-tn+1)P(w1|w0t0-tn+1)P(w2|w1w0 t0-tn+1)...P(wn|w0-wn-1 t0-tn+1)P(wn+1|w0-wn t0-tn+1)
Assumption: a word is completely determined by its tag. This is inspired by speech
recognition.
P(W|T)=P(w0|t0)P(w1|t1)P(w2| t2)...P(wn|tn)P(wn+1|tn+1) (Lexical probability)
Example of calculation from actual data.
Corpus
Let the data be ^ Ram got many NLP books. He found them all very interesting.
POS Tagged
^N V A N N . N V N A R A.
Recording numbers(Bigram assumption)
13
^
0
0
0
0
0
1
^
N
V
A
R
.
N
2
1
1
1
0
0
V
0
2
0
0
0
0
A
0
1
1
0
1
0
R
0
0
0
1
0
0
.
0
1
0
1
0
0
Probabilities
^
N
V
A
R
.
^
0
1
0
0
0
0
N
0
1/5
2/5
1/5
0
1/5
V
0
0
0
A
0
1/3
0
0
1/3
1/3
R
0
0
0
1
0
0
.
1
0
0
0
0
0
P(ram|N)=P(wi=ram|ti=noun)=no of times ram occurs as noun/total number of nouns.
Lexical probabilities
Ram Got Man
y
^
N
V
A
R
.
NLP
Book
s
He
Foun
d
The
m
All
Very
Interesting
Red balls:30
Red balls:10
14
Red balls:60
Green balls:50
Blue balls:20
Green balls:40
Blue balls:50
Green balls:10
Blue balls:30
U3
0.5
0.2
0.3
B
0.2
0.5
0.3
G
0.5
0.4
0.1
Let the observation sequence is RRGGBRGR. We have to find out the state sequence. Many
problems in AI fall into this class predict hidden from observed.
Diagrammatic representation
If Hidden Markov Model is used to compute the tags , then complexity is going to be more
but his trigram assumption is taken into consideration i.e the current tag depends upon the
previous two tags and Viterbi algorithm is applied to perform tagging.
Viterbi Algorithm
Given:
The HMM which means:
a) Start state: s1
b) Alphabet A={a1 a2 ..... an}
c) Set of states S={s1s2....sn}
d) Transition probability P(si to sj via ak) for all i,j,k which is equal to P(sjak/si).
Find the output string a1a2.....at.
To find: the most likely sequence of states c1c2....ct which produces the given output
sequence i.e c1c2c3....ct=argmax(P(c/a1a2....at));
Data structures:
a) A N*T array called SEQSCORE to maintain the winner sequence always(N=#states ,
T=length of output sequence).
b) Another N*T array whose BACKPTR to recover the path.
c)
Three distinct steps in Viterbi implementation
a) Initilization
b) Iteration
c) Sequence identification
Initilization:
SEQSCORE(1,1)=1.0
BACKPTR(1,1)=0.0
For(i=2 to N) do
SEQSCORE(i,1)=0.0
Iteration:
For(t=2 to T) do
For(i=2 to N) do
SEQSCORE(i,t)=max(j=1,N)
BACKPTR(i,t)=index j that gives the max above.
Sequence Identification
C(T)=i that maximizes SEQSCORE(i,T)
For i from (T-1) to 1 do
C(i)=BACKPTR[C(i+1),i+1].
Understand Viterbi algorithm with a example
17
A1
0.012
A2
0.0081
S2
0.0
0.3
0.06
0.027
0.0054
BACKPTR table
E
A1
A2
A1
A2
S1
0
1
2
2
2
S2
1
2
1
2
By using the BACKPTR table state sequence is obtained. Best state sequence obtained is
S1S2S1S2S1.This has reduced the complexity to a greater extent.
4.1.4 Annotating Corpora and Searching patterns:
In this corpora is annoted as disease or medical terms. Sentences are tagged with disease
entities from the clean disease lexicon and drug entities from the drug list. The tagging was
based on case-insensitive exact string matching for high precision and efficiency. Then
pattern is searched between disease and drug. Pattern could be drug pattern disease if the drug
entity precedes the disease entity or disease pattern drug if disease precedes the drug.The
patterns that we are using for drug pattern disease are: in, in the treatment of, for, in patients
with, for the treatment of, treatment of, therapy for, therapy in, for treatment of, against, in
the management of, therapy of, treatment for, treatment in, in a patient with, in treatment of,
in children with, to cure, is used to cure , is used for curing , in the management of , is used to
manage, reduces, in the treatment of patients, prevents, is used to prevent, to prevent, for the
management of, to treat, can be used to control symptoms of , can be used as medication for,
can be used to improve symptoms, can be used as a antibiotic for, can be used to relieve sign
symptoms, can be used to relieve symptoms, can be used to reduce symptoms, can be
effective for, may be effective in the treatment of , can be used to prevent and the patterns
that are used for disease pattern drug are can be treated with , interventions to control
disease are, symptoms can be improved with, symptoms can be controlled with, symptoms
can be improved with, antibiotics for the disease are, antibiotics that can be used are, your
doctor may recommend, symptoms can be reduced with, can be prevented with.
How to check the quality of tagging
Three parameters:
Precision P=|A ^ O|/|O|
Harmonic mean
If every word is given a tag and no word is left out. So sizeof(A)=sizeof(O)
Therefore, presision=recall=F-score.
5. JAVA
19
Java is a programming language and a platform. Java is a high level, robust, secured and
object-oriented programming language. Any hardware or software environment in which a
program runs, is known as a platform. Since Java has its own runtime environment (JRE) and
API, it is called platform.
A simple java example:
class Simple
{
public static void main(String args[])
{
System.out.println("Hello Java");
}
}
According to Sun, 3 billion devices run java. There are many devices where java is currently
used. Some of them are as follows:
Mobile
Embedded System
Smart Card
Robotics
Games etc.
Features of JAVA
Simple : Java is a simple language because syntax is based on C++(so easier for
programmers to learn after C++). It has removed many confusing and/or rarely-used
features e.g., explicit pointers ,operator overloading etc. There is no need to remove
unreferenced objects because there is automatic garbage collection in java.
20
Secured : Java is secured because it has no explicit pointer and programs run inside
virtual machine sandbox.
Robust : Robust simply means strong. Java uses strong memory management. There
are lack of pointers that avoids security problem. There is automatic garbage
collection in java. There is exception handling and type checking mechanism in java.
All these points makes java robust.
High Performance : Java is faster than traditional interpretation since byte code is
"close" to native code still somewhat slower than a compiled language (e.g., C++)
21
The main advantage of multi-threading is that it shares the same memory. Threads are
important for multi-media, Web applications etc.
Distributed : applications can also be distributed in java. RMI and EJB are used for
creating distributed applications. We may access files by calling the methods from any
machine on the internet.
6. JDBC
Java JDBC is a java API to connect and execute query with the database. JDBC API uses jdbc
drivers to connect with the database. Before JDBC, ODBC API was the database API to
connect and execute query with the database. But, ODBC API uses ODBC driver which is
written in C language (i.e. platform dependent and unsecured). That is why Java has defined
its own API (JDBC API) that uses JDBC drivers (written in Java language). API (Application
programming interface) is a document that contains description of all the features of a
product or software. It represents classes and interfaces that software programs can follow to
communicate with each other. An API can be created for applications, libraries, operating
systems, etc.
22
Creating connection
Creating statement
Executing queries
Closing connection
Register the driver class: The forName() method of Class class is used to register the driver
class. This method is used to dynamically load the driver class.
Syntax of forName() method
public static void forName(String className)throws ClassNotFoundException
Create the Connection object: The getConnection() method of DriverManager class is used to
establish connection with the database.
Syntax of getConnection method
public static Connection getConnection(String url)throws SQLException
public static Connection getConnection(String url,String name,String password)
throws SQLException
Create the statement object: The createStatement() method of Connection interface is used to
create statement. The object of statement is responsible to execute queries with the database.
Syntax of createStatement method
public Statement createStatement()throws SQLException
Exeute the query: The executeQuery() method of Statement interface is used to execute
queries to the database. This method returns the object of ResultSet that can be used to get all
the records of a table.
Syntax of executequery() method
public ResultSet executeQuery(String sql)throws SQLException
Close the connection object: By closing connection object statement and ResultSet will be
closed automatically. The close() method of Connection interface is used to close the
connection.
Syntax of close method
public void close()throws SQLException
Connectivity with access with DSN
23
Connectivity with type1 driver is not considered good. To connect java application with type1
driver, create DSN first, here dsn name is mydsn.
import java.sql.*;
class Test
{
public static void main(String ar[])
{
Try
{
String url="jdbc:odbc:mydsn";
Class.forName("sun.jdbc.odbc.JdbcOdbcDriver");
Connection c=DriverManager.getConnection(url);
Statement st=c.createStatement();
ResultSet rs=st.executeQuery("select * from login");
while(rs.next())
{
System.out.println(rs.getString(1));
}
}
catch(Exception ee){System.out.println(ee);}
}
}
7. Conclusion
We tackle in this project is a task that has applications in information retrieval, information
extraction, and text summarization. We identify potential improvements in results when more
information is brought in the representation technique for the task of classifying short
medical texts. Experimental result shows that the technique used in the proposed work
minimizes the time and the work load of the doctors in analyzing information about certain
disease and treatment in order to make decision about patient monitoring and treatment. This
system
helps users especially doctors in saving their time and they can know easily about a disease
its treatment and symptoms and can analyses more about a various treatments associated with
a particular disease. This text mined document can be used in medical health care domain
where a doctor can analyse various kinds of treatment that can be given to patient with
particular medical disorder. The doctor can update the knowledge related to particular disease
or its treatment methodology or the details of medicine that are in research for a particular
disease. The doctor can gain idea about particular medicine that are effective for some patient
but causes side effect to patient with some additional medical disorder. The patient can also
use this extracted document to get clear understanding about a particular disease its
symptoms, side effects, its medicines, its treatment methodologies.
24
8. Future Scope
A wide future scope exists for this project. We can make this project more user-friendly by
allowing user to also extract information regarding cure, symptoms and prevention of
disease. It involves expanding the project to finding the root cause of the disease and then by
taking the patient history or condition and providing him the dose accordingly. The future
idea is based on viewing the composition of medicine and after applying it on patient report
identifying that is it be suiting him.
9. References
[1] Rong Xu and QuanQiu Wang Large- scale extraction of accurate drug-disease treatment
pairs from biomedical literature for drug repurposing, Issue 2013.
[2] Fadi Yamout, Further Enhancement to the Porters Stemming Algorithm, Issue 2006.
[3] Ray S and Craven M,Representing sentence structure in Hidden Markov Models for
information extraction, Proceedings of IJCAI-2001.
[4] M. S. Ryan and G. R. Nudd., The Viterbi Algorithm, Department of Computer Science,
University of Warwick, Coventry,England,Issue 1993.
[5] Jesse Davis jdavis Mark Goadrich, The Relationship Between Precision-Recall and ROC
Curves, Department of Computer Sciences and Department of Biostatistics and Medical
Informatics, University of Wisconsin-Madison,USA.
25