Professional Documents
Culture Documents
www.jatit.org
ABSTRACT
We proposed an Architectural model for detecting plagiarism in natural language text and
presented the analysis of various detection processes followed for effective plagiarism detection.
Other plagiarism detection mechanisms are based on parsing techniques where sentence and word
chunking is performed to extract phrases which are searched on internet in comparison to that we
performed sentence and work chunking but our method is more detailed and comprehensive and
further bifurcated these two techniques on the basis of the rearrangement of the text or word
pattern. We tried to put emphasis on principle behind performance of any search result and
developed an efficient plagiarism detection mechanism on the basis of the unitization process
which are tested for various input data and has shown considerable output. Our purposed
Unitization processes are different from the already available. Challenges faced during the
plagiarism detection are discussed with their purposed solutions. Results obtained after
experiment reveals empirical study of the performance factor
1150
Journal of Theoretical and Applied Information Technology
www.jatit.org
level based on the document size .This paper and-paste having minor or major
focuses on the architectural details of changes from Web-based sources and
plagiarism detection mechanism along with identifying the same content but paraphrased.
This is reflected by the increase in the online
the type of the functionality we choose for
services such as turn tin and
higher performance of the system. plagiarism.org. Services to track and
monitor commercial content have also
Natural language processing is a field related
received increased popularity as the media
to the artificial intelligence, this is of great report more cases of stolen digital content
importance as detecting plagiarized phrases (e.g. contentguard.com). Few researchers
in the given document becomes easier after reveal distinctive methods such as
the semantic analysis of the document. paraphrased comments and misspelled
Various Natural language principles make it identifier in detection plagiarism easily. The
process of automatic plagiarism detection
easier for carrying out the text analysis and
involves finding quantifiable discriminators
for parser generation in the given document. which can be used to measure the similarity
Plagiarism is defined as the practice of between texts. The complete design process
claiming or implying original authorship of has taken many set of assumptions which
(or incorporating material from) someone includes the text which is assumed as source
else's written or creative work, in whole or in text is plagiarism free or original text,
part, into one's own without adequate greater similarity means greater are the
chances of possible plagiarism. So far the
acknowledgement.
focus has been on lexical and structural
similarity in both program code and natural
We have considered the problem of language, but these become less effective
plagiarism which is one of most publicized when the degree of rewriting or form of
form of text reuse around us. The ultimate plagiarism becomes more complex. The
goal of this research is to devise an process of automatic plagiarism includes
automatic plagiarism detection which can developing suitable methods to compare
distinguish among the derived and non those discriminators, Finding suitable
derived texts. Most techniques have measures of similarity and Finding suitable
concentrated on finding unlikely structural discriminators of plagiarism which can be
patterns or vocabulary overlap between texts, quantified.
finding texts from large collections and
collaboration between texts. Some The goal of an automatic plagiarism
plagiarists are very clever and generally detection system is to assist manual
while copying data they make complex detection by: reducing the amount of time
editing so that even the sophisticated spent comparing texts, making comparison
methods of analysis are unable to detect between large numbers of multiple texts
plagiarism. Various procedures for text feasible and finding possible source texts
analysis are semantics, parsing, structure or from electronic resources available to the
discourse, morphological analysis. The most system. The systems must minimize the
dominant methods which can enhance number of incorrectly classed as plagiarized
research in plagiarism detection are lexical and those incorrectly classed as non-
overlap, syntax, semantics and structural plagiarized and maximize the number of true
features. positives.
www.jatit.org
1152
Journal of Theoretical and Applied Information Technology
www.jatit.org
2.2 Evaluation of the existing Models and The typical discriminators factors that
Methodology indicate the act of plagiarism in the text
document might include the following:
It plays an important role in the designing
process for plagiarism detection system. 1. If text stems from a source that is known and
Evaluating the already existing models or the it is not cited properly then this is an obvious
commercial tools actually help in case of plagiarism.
determining their performance and underline 2. Use of advanced or technical vocabulary
process they are following. This also helps in beyond that expected of the writer.
designing a new model by improving the 3. If the references in documents overlap
already algorithms based on the empirical significantly, the bibliography and other
analysis through the comparison of the parts may be copied. Dangling references,
performance, number of the valid links e.g. a reference appears in the text, but not in
obtained etc. the bibliography, a changing citing style
may be a sign for plagiarism.
Two major properties of Natural Language 4. Use of inconsistent referencing in the
are ambiguity and unconstrained vocabulary. bibliography suggesting cut-and-paste.
For example, words are replaced by their 5. Incoherent text where the flow is not
synonyms but because word senses are consistent or smooth, which may signal that
ambiguous, selection of the correct term are a passage has been cut-and-pasted from an
often non-trivial. The flexibility and existing electronic source.
complexity of natural language has driven 6. Inconsistencies within the written text itself,
researchers in many language engineering e.g. changes in vocabulary, style or quality.
tasks, not just plagiarism to apply simpler 7. A large improvement in writing style
methods of similarity involving a minimal compared to previous submitted work.
amount of natural language processing. As 8. Shared spelling mistakes or errors between
with detection between different programs, texts.
various methods of comparing texts have
been investigated, as well as defining 2.3 Purposed Prototype Model
suitable discriminators. The natural
language plagiarism detection is very This is based on the comprehensive approach
difficult and involves Web-based sources for detecting plagiarism. We have analyzed and
which need to be investigated using methods studied various plagiarism detections available.
of copy detection. Methods of detection We have performed intensive practical testing
originating from file comparison, of the systems available to actually understand
information retrieval, authorship attribution, where they lack and where they are ahead. On
compression and copy detection have all the basis of our analysis we have collected all
been applied to the problem of plagiarism the requirements that we need to put in the form
detection. Plagiarism detection in case of an advanced detection system.
multiple text involve finding similarities
2.4. Experiment and Comparisons
which are more than just coincidence and
While studying and designing the architecture
more likely to be the result of copying or
of the plagiarism detection mechanism we feel
collaboration. The second stage is to then
parsing or the document analysis is one of the
find possible source texts using tools such as
critical area upon which the performance of the
web search engines for unknown on-line
system depends hence we focused on the type
sources, or manually finding non-digital
of the documents analysis, searching module
material for known sources.
details and report generation section which is
sufficient for the formal verification of a
prototype and for the experimental verification
of our approach. The prototype that we are
1153
Journal of Theoretical and Applied Information Technology
www.jatit.org
going to discuss is helpful in carrying out a set attribute authorship to anonymous or disputed
of experiments on various set of documents. documents. It has legal as well as academic and
literary applications”. Imitating someone’s work
3. CHALLENGES AND LANGUAGE is not productive as understanding and learning
RELATED ISSUES are more important [2].
Identical sentences are easily modified to look 3.1 Lack of Exact Definition
different. This is a very common phenomenon
that is followed in any research and teaching
community. According to There is no objective, quantifiable standard for
www.thefreeDictionary.com Paraphrasing is what is and what is not plagiarism? Plagiarism
defined as a restatement of a text or passage in lacks in exact definition as still we don’t have
another form or other words, often to clarify an exact definition periphery for this word.
meaning, the restatement of texts in other words Many research groups have contributed to this
as a studying or teaching device. Paraphrasing but still we are struggling for achieving the
leads to serious cases of plagiarism General precise definition. According to S.E. Van
guidelines that have been issues by various Bramer [5], Widener University we can classify
research groups, universities, educational the Plagiarism according to the study level
institutes are as follows: concerned like for Undergrad students
plagiarism means straight forwardly copying the
a. If the author is taken any help from any of the data available from various web sources,
sources like internet, books, newspaper he/she Research Paper, books etc while they are
needs to put those phrases in quotes and put expected to take conceptual help from these and
reference number hence in turn we are write in your own way, But this doesn’t mean
acknowledging the actual work of the author. that we should take them for granted. While
b. Paraphrasing and writing in your words must be Post grad are expected to put forward new ideas
clearly differentiated as paraphrasing is based on their research work but problem lies if
reformation of the text but writing in your own independent creation of some ideas or concept
words means that you are taking help in the may coincide as we cannot rule out this option
sense you wants to understand the meaning of as well. Various Journal claim that they don’t
the phrase and using that idea in your own work. publish paper that have even small percentage
Even if you are using your own words you need of plagiarism but they fail to give accurate
to acknowledge the author from whose work boundary to this meaning. Some want that new
you have taken help as many times it happens paper submitted should have at least 40% new
that authors blame and raises questions on each ideas.
other for plagiarizing the work done by any of
them. 3.2 Lack of Knowledge
c. Acknowledging helps the reader of your work
to actually know the source from where you Computers are not able to understand the
have taken the help as this in turn helps him to Natural language and interpretation their
better understand whole idea of your work. meanings as still various research groups are
writing parsers application so as to make
Hence overall idea for this, always reference the computer more intelligent, as the effort to make
work from where you have taken the help as systems independent and decision makers but
this is to avoid the ambiguity about the actual this is still a vision. Natural Language
source of the work done. Do not paraphrase at processing has achieved a new pinnacle such as
any cost as this will spoil your image and your various language parsers have been made,
work done. Everyone should know how to write efficient grammar checkers, efficient algorithms
in your own way this is referred as “Stylometry for text parsers but still we are far away from
is the application of the study of linguistic style, the automatic language processor [1].
usually to written language, is often used to
1154
Journal of Theoretical and Applied Information Technology
www.jatit.org
www.jatit.org
www.jatit.org
Figure 4:
Stanford Copy Analysis program type of chunking performed largely affects
Architecture the accuracy of the system.
Stanford Copy Analysis program is based on
the principle of the word chunking where 5. PROPOSED MODEL
chunking is performed on the basis of word
end points. As the parser on detecting the This Model is based on the comprehensive
white space assume it to be one word. Hence approach for detecting plagiarism. We have
after collecting all those words word analyzed and studied various plagiarism
chunking are arranged and they are stored in detections available. We have performed
the database. Besides chunking semantic intensive practical testing of the systems
analysis is done to eliminate those words available to actually understand where they
which are not useful from the searching lack and where they are ahead. On the basis
point of view. Above figure shows the of our analysis we have collected all the
same that document having a b c as the requirements that we need to put in the form
phrase or paragraph which after chunking an advanced detection system. Various set of
produces a ,b ,c which are then stored in the requirements were
database and database in turn performs the
redundancy check. After that query 1. Ideal system should detect possible
optimizer takes the charge and same phrases of plagiarism in the shorter time span.
processing occurs as that in the COPS. The
1157
Journal of Theoretical and Applied Information Technology
www.jatit.org
2. Ideal system gives user an option for the the text extraction and text synthesis modules.
type of analysis he/she wants. Architectural Model can be discussed as the
3. Besides performing the intensive input in the form of text document is taken and
searching it first gives user an overview of fed to the Parser unit which actually performs
the possible percentage plagiarism. the parsing in three different forms
4. As after performing various set of particularly sentence chunking, word chunking
analysis on the present system it has been and random chunking. Unitization may be
revealed that we need a system that should defined as “the process of the configuration of
not fail after a certain numbers of words smaller units of information into large
lengths. coordinated units”. Unitization may also be
interoperated as Chunking. Hence Unitization
Our plagiarism detection mechanism takes is the process of converting the whole
into account the document architecture into document under consideration into smaller
consideration for developing efficient chunks or units where chunk or a unit
plagiarism detection. The success of any represents the word or a sentence. Natural
mechanism is based on the following factors language processing is a field which is
interrelated to the artificial intelligence; this is
1. Document Analysis Unit: Implementation of great importance as detecting plagiarized
of the document analysis unit as if the phrases becomes easier after the semantic
document analysis unit is really efficient then analysis of the document .Various artificial
whole model will be productive [8]. intelligence principles make it easier for
2. Relevance Factor Calculation Relevance carrying out the text analysis and for parser
of the source you are taking into generation in the given document.
consideration while developing the
conclusion on plagiarized document as many The detection is performed in two phases.
times happen the source what we choose for First, the parser processes input collection file
the analysis purpose is in turn a plagiarized by file, and generates set of parsed files.
source. Example information regarding the Second, detection model checks the parsed
yahoo search API development is present on files for similarity. The designer needs to
the www.developer.yahoo.com but the same perform various set of experiments which
information is available on some web blog in includes parsing, tokenizing and preprocessing.
this case web blog is a plagiarized source and Such model have some major drawbacks that
www.developer.yahoo.com is the actual parser destroys the initial word order in every
source. sentence of the input text. Therefore, the
3. Efficiency Factor: Efficiency and relevance plagiarism detection system cannot precisely
of the report generation phase as this section highlight similar text in blocks of text in
manipulates the entire results that are original file pairs. There are two ways for
produced after the internet search. So the representing plagiarism which includes either
report generation algorithm will help to system should be programmed to highlight the
deduce the actual link from where the whole plagiarized sentences and parser should
document has been prepared. generate some metadata about the parsed files
or it can represent the word chains
5.1 Document Analysis Unit representing the copied data.
Developing any Plagiarism detection 5.1.1 Simple Examination Parser
mechanism you need to implement an If no document base is given, a candidate
effective Document/parsing system. Parsing document base has to be constructed using
system designing is very crucial from the search interfaces for sources like the Web,
performance point of view as the results you digital libraries, homework archives, etc.
will display for searches you perform is Standard search interfaces are topic-driven, i.e.
entirely dependent upon the effectiveness of
1158
Journal of Theoretical and Applied Information Technology
www.jatit.org
they require the specification of keywords to unit which in turns performs the search
deliver documents. Here keyword extraction operation. Finally the output is then fed to the
algorithms come into play since extracted manipulation chamber.
keywords serve as input for a query generation
algorithm. Most of the keyword extraction 5.1.1.2 Modified Sentence Unitization
algorithms are designed to automatically
annotate documents with characteristic terms. The case we have so far discussed has
Extracted keywords shall not only summarize certain flaws like the sentence breaking
and categorize documents, but also process is entirely dependent upon the
discriminate them from other documents. To
occurrences of the conjunction in the text.
automatically identify such terms is an
ambitious goal, and several approaches have
For optimum results the sentence length
been developed, where each of which fulfills needs to be in a threshold range. Threshold
one or more of these demands. Existing value is the region marking a boundary.
keyword extraction algorithms can be The worst cases that appear in this
classified by the following properties: scenario are:
www.jatit.org
www.jatit.org
www.jatit.org
in the paragraphs, rewording, paraphrasing etc. 5.1.2.4 Over lapped Hashed Break Point
After all these operation our analysis has shown Unitization/Chunking
that this process of hashing has given fruitful
results. Hashed Breakpoint chunking is actually the This is modified hashed break point
application of hashing but in different way [14]. Here unitization where we choose value of the
we prepare the list of hash value corresponding to each parameter n more than once i.e. after deciding
word. After this we need words to be arranged in a the parameter n value for first operation we
particular way for this we define a parameter n that
increase the value of n by some factor this
actually helps in determining the phrase chunk end.
Parameter n is chosen such that in a sequence have 8-9 operation is repeated after every operation
word chunks, fourth chunk gets completely divisible [13] example:
by the n as at least we need three chunks for
performing the searching operation. Actually with Step 1: Prepare Hash Table for entire words
experiments we find that value of the n is dependent in the document.
upon the language features, field of the education
system to which Paper belongs, size of the document Step 2: Choose suitable value for Parameter n
etc. Figure 7 shows the behavior of both hashed and
hashed breakpoint chunking to the document size , Step 3: Start dividing the words with n.
value of the n parameter selected, type of the
document chosen for experimentation. Step 4: Find word getting completely
divisible
Figure 7 Graph for Percentage similarity
versus document size for hashed chunking
and hashed breakpoint chunking.
1162
Journal of Theoretical and Applied Information Technology
www.jatit.org
This is based on the special kind of chunking In order to access the performance of the
that is not exactly sentence or word chunking. proposed approach we have carried out an
Here we basically locate for random chunks extensive set of the experiments on a test data
of the sentences for which there can be which is obtained from the set of documents
numerous ways such as select first and last from different education fields. Search results
line of all the paragraphs and chunk them, differ from one search engine to other due to
select only those sentences that have length different crawlers and indexing algorithms
example Google search results for a query is
1163
Journal of Theoretical and Applied Information Technology
www.jatit.org
different from that of the Yahoo Search. is plotted between the accuracy level and
During the experiment session we have incremental value. Studies show that if
realized plagiarism detectors show abrupt incremental value of a link is high than
changes when size of the document increases. accuracy will also be highest .These results
This section mainly focuses on our we have prepared after intensive testing
observations while designing plagiarism taking 100 different documents from different
detection mechanism. This was certainly one approximate 15 different fields. We have also
of the major postulates of our research work. calculated the empirical formula for
We tried to get empirical results from all the calculating incremental values on the basis of
experiments that we have carried out. The the field taken into consideration.
experimental framework is largely based on
the proposed methods of document analysis 6.3.1 Selection Chamber
which in turn relies on the comparison of
those methods. This unit actually proceeds on the links
returned by the searching chamber and stored
6.1 Observation 1 in our database. This unit creates the Binary
search tree using all the links stored the
We have used approx 80 Documents from database. After creating the Binary Search
various fields such as medicine, technical and Tree Selection unit will calculate the link
research paper, universities web pages, repetition by removing the duplication links
educational assignments, new technologies from the total links that we got by increasing
related white papers, work specific their count. The link with higher increment
assignments etc. According to our empirical value will be the actual link form where the
analysis random analysis has least accuracy document has been prepared.
but it’s quick. Advanced/Detailed parsing has
the highest average accuracy as its worst and 6.3.2Percentage Plagiarism Calculator
the best case are nearly same. Simple parsing
takes time less than advanced but more than This takes the output of the selection unit as
random analysis. Simple parsing is the input and prepares the final out put that need
default analysis in our detection mechanism. to be shown to the user through our user
Figure 9 represents the graph for word length interface section. Various techniques that are
versus number of the valid links returned used for the plagiarism calculation can be
when words are searched on the internet. The calculate
graph has been plotted by repeating the
experiment through varying the word
document size. The study shows that as the
size of the document increases the
performance factor reduces.
6.2 Observation 2
www.jatit.org
Figure 9
a) Graph for Accuracy versus Incremental value.
b) Incremental value calculated for different fields
Educational Assignments 21
University Departments 12
Research Papers 12
Advertisements 7
Total 100
Figure 10 Table displays the document number and the corresponding field
www.jatit.org
and percentage database size required. We hashed breakpoint requires highest database
found simple sentence chunking is having size. Figure 11 shows the relevance of our
the highest speed and comparable to that of results.
hashed breakpoint. We have also calculated
database size required where we have found
Figure 11 Graph for percentage chunking speed and percentage database size required versus
various document analysis techniques.
testing for calculating percentage reliability
Observation 5 to find plagiarism in large size documents.
Our results have shown that overlapping
Our foremost objective was to find the word chunking is having the highest
plagiarism in the test documents. We have percentage reliability of finding plagiarism.
put forward various documents analysis We have also compared the three types of
methods and for deciding the best from the the document analysis i.e. simple analysis,
given methods we have performed intensive detailed and random. This comparison was
1166
Journal of Theoretical and Applied Information Technology
www.jatit.org
done to find the accuracy versus the number techniques. Our study is empirical but
of the links returned taking specific helping enough to study the response of the
documents under the test environment. plagiarism detection mechanism and
Figure 12 shows the graphs for percentage functional units which constitutes plagiarism
reliability for finding plagiarism and detector.
comparison of various document analysis
Figure 12. Graph for percentage reliability for finding plagiarism versus various document
analysis mechanisms and percentage Accuracy versus various document analysis techniques.
1167
Journal of Theoretical and Applied Information Technology
www.jatit.org
after submitting document through internet, search engine where we are already using
web server will perform the user access the search API’s of the prominent search
authorization and document is imported with engines such as Google Search , Yahoo
the desired import configuration settings. Search etc through the medium of the web
Third portion is Parser/ Search Server which services. Finally after finding the web links
consists of parsing engine and search engine, which are returned corresponding to the
now parser engine will perform the chunks searched, all these are fed to the
document analysis depending upon the manipulation chamber located in the web
desired analysis mechanism selected by the server where final plagiarism percentage is
user under configuration setting. After calculated and the results are fed to report
performing parsing the chunks/sentences are generation module which will display the
stored in the database where the indexing is final plagiarism percentage.
performed. Indexed chunks are fed to the
Figure13. Block Diagram of the Business model for plagiarism detection system
www.jatit.org
and copied as it is hence, no tool till date can needs improvements like in our analysis we
detect. The requirement of the time is to didn’t take any noise ratio into consideration
create very stronger Cross Language as after handling noise issue the system can
information Retrieval (CLIR) and
be made more precise. Plagiarism is
Multilingual Copy Detection systems. This
is very difficult to attain because of different increasing with leaps and bounds and we
orthographic structures of different need more comprehensive study to combat
languages. this disease. We hope our effort will help the
research community to develop more
Size of Text collection chosen: Till date, we advanced tools than our system.
don't have any standard collection of text for
plagiarism detection in natural languages 10. REFERENCES
exists, thereby making comparison between
various approaches impossible, unless the [1] Council of writing program
same text is used. Many different areas of administrators: Defining and Avoiding
language analysis such as document analysis, Plagiarism: The WPA Statement on Best
information retrieval, summarization, Practices.
authorship analysis have been used in the http://www.wpacouncil.org/node/9
detection process but still researchers are not [2] Virtual Salt - Anti-Plagiarism Strategies
able to reach at a common size for the text for Research Paper
that produces nearest results. It is important <http://www.virtualsalt.com/antiplag.ht
as it will enable communities in comparing m>
different approaches. It would help to [3] SCAM: A Copy Detection Mechanism
stimulate research in automatic plagiarism for Digital Documents By Shivakumar
detection. and Hector Garcia-Molina -Proceedings
of 2nd International Conference in
Detecting Plagiarism in Single text: Theory and Practice of Digital Libraries
Detecting plagiarism within a single text is (DL'95) , Austin, Texas, June '95
one of the hardest problems ever in the area <http://stanford.edu/pub/papers/scam.ps
of language engineering. Till date, we prefer >
manual inspection and tools which are [4] Brin, S.; Davis, J.; Garcia-Molina, H.:
available in the market can't change manual Copy Detection Mechanisms for Digital
checking in any way. The approaches such Documents <
as authorship attribution and style analysis http://dbpubs.stanford.edu/pub/1995-43
are not able to handle this problem. Many >
inconsistencies exist in detecting possible [5] How to Avoid Plagiarism in a Research
source text having single or two word texts. Paper: By eHow Education Editor<
http://www.ehow.com/how_9265_avoid
9. CONCLUSION -plagiarism-research.html >
[6] Plagiarism Detection and Document
We believe that our plagiarism detection Chunking Methods <
mechanism is efficient than the already http://www2003.org/cdrom/papers/poste
present plagiarism detectors. The reasons for r/p186/p186-Pataki.html>
support of this answer lies in our model, our [7] A Decision Tree Approach to Sentence
Chunking<
model supports three types of the views such
http://www.springerlink.com/default.mp
as detailed analysis , simple and advanced x>
analysis. Our detector is a complete package [8] Sentence Boundary Detection
in itself as it has both sentence and word (Chunking) by Gary Cramblitt <
chunking implemented. Our system still
1169
Journal of Theoretical and Applied Information Technology
www.jatit.org
http://lists.freedesktop.org/archives/acce
ssibility/2004-December/000006.html.
[9] Natural Language Processing – IJCNLP:
First International Joint Conference on
Natural Language Processing, IJCNLP
2004, held in Hainan Island, China in
March 2004 <
http://www.springer.com/computer/artifi
cial/book/978-3-540-24475-2 >.
[10] Plagiarism Finder: Tool for
Detecting Plagiarism http://plagiarism-
finder.mediaphor-ag.qarchive.org/.
[11] The Stanford Natural Language
Processing Group
http://nlp.stanford.edu/software/tagger.s
html.
[12] Digital Assessment Suite
http://turnitin.com/static/home.html.
[13] Steven Bird Ewan Klein Edward
Lope Chunk Parsing
<http://research.microsoft.com/india/nlp
summerschool/data/files/stevenbird%20-
%20chunking.pdf>
[14] Yoshimasa Tsuruoka and Jun’ichi
Tsujii Chunk Parsing Revisited:
[15] <http://www-tsujii.is.s.u-
tokyo.ac.jp/~tsuruoka/papers/IWPT05-
tsuruoka.pdf >
[16] Miriam Butt, Chunk/Shallow
Parsing
[17] <http://ling.uni-
konstanz.de/pages/home/butt/teaching/c
hunk.pdf>
1170