Professional Documents
Culture Documents
CHAPTER1
PREAMBLE
1.1 Introduction
Many approaches have been successfully developed to detect online spam. Fangtao Li
and colleagues1 initially analyzed several attributes related to spam behavior, such as content,
sentiment, product, and metadata features, and exploited a two-view semi supervised method to
identify spam reviews. Song Feng and colleagues2 defined three types of reviewers (any-time,
multi-time, and single- time reviewers) and statistically made distributional footprints of
deceptive reviews by using neuro-linguistic programming (NLP) techniques. Geli Fei and
colleagues3 proposed a model to detect spammed products or product groups by comparing the
differences in rating behaviors between suspicious and normal users.
All these models rely on content features that can be easily found by inserting special
characters, but other features, such as temporal and network information, have been employed as
well. Qian Xu and colleagues4 collected large-scale real-world datasets from telecommunication
service providers and combined temporal and user network information to classify spammers
using Short Message Service (SMS). Sihong Xie and colleagues5 proposed a model that uses
only temporal features, with no semantic or rating behavior analysis, to detect abnormal bursts
as the number of reviews increases. Finally, Tyler Moore and colleagues6 studied the problem of
temporal correlations between spam and phishing websites.
Intuitively, these works can also be used to uncover sophisticated spam strategies.
Amazon has sued more than 1,000 product review sellers who sell fake promotions on
Fiverr.com (one of the most famous being Spam Reviewer Cloud;
http://money.cnn.com/2015/10/18 /technology/amazon-lawsuit-fake -reviews). On such user
cloud platforms, business owners can purchase anonymous comments generated by real users by
paying for them. It makes spam detection very challenging, as the advent of a massive number
of apparently genuine fake reviewers (which we refer to as “genuine fakes” in this article) makes
the fraud pattern much more nebulous to track.
To date many third party platforms have created various fake review markets for online
product sellers and fake review providers. In real-world business processes, massive numbers of
random but genuine fake review providers conduct real transactions and write positive
comments to claim a bonus (many e-commerce websites think they can reduce spam reviews by
allowing only real buyers to write them). Existing research ignores the latent connections in
product networks, which are difficult to discover, especially when these spam activities have
become a hyping and advertising investment that has gained increased popularity among
homogeneous competitors online. Thus, antispam rules can be easily avoided, which also
impairs the efficiency and effectiveness of detection performance. In this work, we coin a new
solution— collaborative marketing hyping detection—that aims to detect groups of online stores
that simultaneously adopt marketing hyping. [1]
A new solution aims to identify spam comments and detect products that adopt an evolving
spam strategy for promotion. Specifically, an unsupervised learning model combines
heterogeneous product review networks to discover collective hyping activities.
Traditional features such as semantic clues or user relations might no longer be suitable
for discovering fraud due to rapidly evolving spam strategies. Hence, we need to choose
dedicated features according to our specific scenario.
Disadvantages:
Decreases the inaccuracy caused by only using the user name information.
Stores usually purchase fake reviews periodically.
Advantages:
Gained increased popularity among homogeneous competitors online.
Efficiency and effectiveness of detection performance.
Aims to detect groups of online stores that simultaneously adopt marketing hyping.
CHAPTER 2
LITERATURE SURVEY
Learning to Identify Review Spam by Fangtao Li, Minlie Huang, Yi Yang and Xiaoyan
Zhu [1]
In this paper, we study the review spam identification task in our product review mining
system. We manually build a review spam collection based on our crawled reviews. We first
employ supervised learning methods and analyze the effect of different features in review spam
identification. We also observe that the spammer consistently writes spam. This provides us
another view to identify review spam: we can identify if the author of the review is spammer.
Based on the observation, we provide a two-view semi-supervised methods to exploit the large
amount of unlabeled data. The experiment results show that the two-view co-training algorithms
can achieve better results than the single-view algorithm. Our designed machine learning
methods achieve significant improvements as compared with the heuristic baselines.
Exploiting Burstiness in Reviews for Review Spammer Detection by Geli Fei, Arjun
Mukherjee & Bing Liu [3]
In this paper, we proposed to exploit bursts in detecting opinion spammers due to the
similar nature of reviewers in a burst. A graph propagation method for identifying spammers
was presented. A novel evaluation method based on supervised learning was also described to
deal with the difficult problem of evaluation without ground truth data, which classifies reviews
based on a different set of features from the features used in identifying spammers. Our
experimental results using Amazon.com reviews from the software domain showed that the
proposed method is effective, which not only demonstrated its effectiveness objectively based
on supervised learning (or classification), but also subjectively based on human expert
evaluation. The fact that the supervised learning/classification results are consistent with human
judgment also indicates that the proposed supervised learning based evaluation technique is
justified.
Topic: SMS Spam Detection Using Non-Content Features by Qian Xu, Evan Wei Xiang
and Qiang Yang [4]
In this paper, we have examined mobile-phone SMS message features from static,
network and temporal views, and proposed an effective way to identify important features that
can be used to construct an anti-spam algorithm. We exploited a temporal analysis to design
features that can detect SMS spammers with both high performance, and incorporated these
features into an SVM classification algorithm. Our evaluation on a real SMS dataset showed that
the temporal features and network features can be effectively incorporated to build an SVM
classifier, with a gain of around 8% in improvement on AUC, as compared with those that are
only based on conventional static features.
Topic: Temporal Correlations between Spam and Phishing Websites by Tyler Moore,
Richard Clayton & Henry Stern [6]
Empirical study of malicious online activity is hard. Attackers remain elusive,
compromises happen fast, and strategies change frequently. Unfortunately, each of these factors
cannot be changed. In this paper, we have combined phishing website lifetimes with detailed
spam data, and consequently we have provided several new insights. First, we have
demonstrated the gravity of the threat posed by attackers using fast-flux techniques. They send
out 68% of spam while hosting only 3% of all phishing websites. They also transmit spam
effectively: the bulk is sent out early, it stops once the site is removed, and keeps going
whenever websites are overlooked by the take-down companies. In this respect, we also
conclude that long-lived phishing websites continue to cause harm and should be taken down.
A Shapelet Transfer for Time Series Classification by Jason Lines, Luke M. Davis &
Anthony Bagnall [8]
In this paper, we have proposed a shapelet transform for TSC that extracts the k best
shapelets from a dataset in a single pass. We implement this using a novel caching algorithm to
store shapelets, and apply a simple, parameter-free cross-validation approach for extracting the
most significant shapelets. We transform a total of 26 data sets with our filter and demonstrate
that a C4.5 decision tree classifier trained with transformed data is competitive with an
implementation of the original shapelet decision tree. We show that our filtered data can be
applied to further, non-tree based classifiers to achieve improved classification performance,
whilst still maintaining the interpretability of shapelets. We provide two implementations of the
filter using different quality measures for discriminating between shapelets; we use information
gain as proposed by in the first, and introduce the application of the F-statistic as an evaluation
method for shapelets in the second. We show that classifiers trained using features derived from
an F-statistic filter are competitive with classifiers trained with the information gain approach,
whilst being easier to apply to multi-class classification problems. Finally, we provide
exploratory data analysis of the shapelets extracted by our Filter on the Gun=NoGun problem
and compare them with the output of 20. We show that the shapelets we find are consistent with
the discriminatory shapelet in the original work, and show that our approach can lead to further
insight into the problem by looking at a number of the top shapelets.
CHAPTER 3
SYSTEM DESIGN
The reason for the design is to arrange the arrangement of the issue determined by the
necessities report. This stage is the initial phase in moving from issue to the arrangement space.
As such, beginning with what is obliged; outline takes us to work towards how to full fill
those needs. The configuration of the framework is maybe the most basic component influencing
the nature of the product and has a noteworthy effect on the later stages, especially testing and
upkeep.
Framework outline depicts all the significant information structure, document
arrangement, yield and real modules in the framework and their Specification is chosen.
Fake
Reviewers
($) : Store owners pay fake reviewers and purchasing cost through user cloud
Use case diagram shows the various interactions of actors with a system. Use case is a
coherent piece of functionality that a system can provide by interacting with actors. Actors are
the external end users of the system.
Register
Login
Browse products
User
Buy products
Give reviews
Detect spams
Display spammers
REQUEST
SEARCH LIST OF DATA
PRODUCTS PRODUCTS
EXTRACT
DATA
USER FETCH DETAILS WEBSITE
BUY
PRODUCT
REVIEW PRODUCTS
GET TOTAL
REVIEW
BLOCKED
USER
EXTRACT SET OF
FEATURE REVIEWS
NLP
GET
DETAILS
SPAM
DETECTION
BROWSE PRODUCTS
BUY PRODUCTS
FEEDBACK
PROCESS REVIEW
CHAPTER 4
System Requirement Specification (SRS) is a central report, which frames the establishment of
the product advancement process. It records the necessities of a framework as well as has a
depiction of its significant highlight. A SRS is essentially an association's seeing (in composing)
of a client or potential customer's frame work necessities and conditions at a specific point in
time (generally) before any genuine configuration or improvement work. It's a two-way
protection approach that guarantees that both the customer and the association comprehend
alternate's necessities from that viewpoint at a given point in time.
The composition of programming necessity detail lessens advancement exertion, as watchful
audit of the report can uncover oversights, mistaken assumptions, and irregularities ahead of
schedule in the improvement cycle when these issues are less demanding to right. The SRS talks
about the item however not the venture that created it, consequently the SRS serves as a premise
for later improvement of the completed item.
The SRS may need to be changed, however it does give an establishment to proceed with
creation assessment. In straightforward words, programming necessity determination is the
beginning stage of the product improvement action. The SRS means deciphering the thoughts in
the brains of the customers – the information, into a formal archive – the yield of the prerequisite
stage. Subsequently the yield of the stage is a situated of formally determined necessities, which
ideally are finished and steady, while the data has none of these properties.
The most common set of requirements defined by any operating system or software
application is the physical computer resources, also known as hardware, A hardware
requirements list is often accompanied by a hardware compatibility list (HCL), especially
in case of operating systems. The hardware requirements are a follows.,
Memory : 4GB.
Database : MySQL
Architecturally, JSP may be viewed as a high-level abstraction of Java servlets. JSPs are
translated into servlets at runtime each JSP, servlet is cached and re-used until the original JSP is
modified.
JSP allows Java code and certain pre-defined actions to be interleaved with static web
markup content, with the resulting page being compiled and executed on the server to deliver a
document. The compiled pages, as well as any dependent Java libraries, use Java bytecode rather
than a native software format. Like any other Java program, they must be executed within a Java
virtual machine (JVM) that integrates with the server's host operating system to provide an
abstract platform-neutral environment.
JSPs are usually used to deliver HTML and XML documents, but through the use of
Output Stream, they can deliver other types of data as well. The Web container creates JSP
implicit objects like page Context, Servlet Context, session, request & response.
Web Browser
JavaBeans
Instantiate
(Model)
Server
Data Sources/Database
A JavaServer Pages compiler is a program that parses JSPs, and transforms them into
executable Java Servlets. A program of this type is usually embedded into the application
server and run automatically the first time a JSP is accessed, but pages may also be recompiled
for better performance, or compiled as a part of the build process to test for errors. Some JSP
containers support configuring how often the container checks JSP file timestamps to see
whether the page has changed. Typically, this timestamp would be set to a short interval
(perhaps seconds) during software development, and a longer interval (perhaps minutes, or even
never) for a deployed Web application.
Java Servlet
The servlet is a Java programming language class used to extend the capabilities of
a server. Although servlets can respond to any types of requests, they are commonly used to
extend the applications hosted by web servers, so they can be thought of as Java applets that run
on servers instead of in web browsers. These kinds of servlets are the Java counterpart to other
dynamic Web content technologies such as PHP and ASP.NET.
Response Request
System
(a)
JSP Page JSP
(.JSP) Translator (b)
(Tomcat) Text
Buffer
(in
memory)
Servelet Java Compiler
Source Code (embedded
(Java) server)
Execution
Phase
Server
Class JRE
(.class)
JSP Container
Translation Phase
(a) Translation occurs at this point, if JSP has been changed or is new.
(b) If not, translation is skipped.
Fig. 4.2 Life of a JSP File
Department of CSE, TOCE 16
Collective Hyping Detection System To Identify Online Spam Activities Using AI 2018
Three methods are central to the life cycle of a servlet. These are init(), service(),
and destroy(). They are implemented by every servlet and are invoked at specific times by the
server.
During the initialization stage of the servlet life cycle, the web container initializes the
servlet instance by calling the init() method, passing an object implementing the
javax.servlet.ServletConfig interface. This configuration object allows the servlet to
access name-value initialization parameters from the web application.
After initialization, the servlet instance, can service client requests. Each request is serviced
in its own separate thread. The web container calls the service() method of the servlet for
every request. The service() method determines the kind of request being made and
dispatches it to an appropriate method to handle the request. The developer of the servlet
must provide an implementation for these methods. If a request is made for a method that is
not implemented by the servlet, the method of the parent class is called, typically resulting in
an error being returned to the requester.
Finally, the web container calls the destroy() method that takes the servlet out of service.
The destroy() method, like init(), is called only once in the lifecycle of a servlet.
The servlet may also formulate an HTTP response for the client.
5. The servlet remains in the container's address space and is available to process any other
HTTP requests received from clients.
The service() method is called for each HTTP request.
6. The container may, at some point, decide to unload the servlet from its memory.
The algorithms by which this decision is made are specific to each container.
7. The container calls the servlet's destroy() method to relinquish any resources such as file
handles that are allocated for the servlet; important data may be saved to a persistent store.
8. The memory allocated for the servlet and its objects can then be garbage collected.
MySQL
SQL was one of the first commercial languages for Edgar F. Codd's relational model, as
described in his influential 1970 paper, "A Relational Model of Data for Large Shared Data
Banks." Despite not entirely adhering to the relational model as described by Codd, it became
the most widely used database language.
NetBeans IDE
NetBeans IDE is the official IDE for Java 8. With its editors, code analyzers, and
converters, you can quickly and smoothly upgrade your applications to use new Java 8 language
constructs, such as lambdas, functional operations, and method references. Batch analyzers and
converters are provided to search through multiple applications at the same time, matching
patterns for conversion to new Java 8 language constructs. With its constantly improving Java
Editor, many rich features and an extensive range of tools, templates and samples, NetBeans
IDE sets the standard for developing with cutting edge technologies out of the box. An IDE is
much more than a text editor. The NetBeans Editor indent lines, matches words and brackets,
and highlight source code syntactically and semantically. It also provides code templates, coding
tips, and refactoring tools. The editor supports many languages from Java, C/C++, XML and
HTML, to PHP, Groovy, Javadoc, JavaScript and JSP. Because the editor is extensible, you can
plug in support for many other languages. Keeping a clear overview of large applications, with
thousands of folders and files, and millions of lines of code, is a daunting task. NetBeans IDE
provides different views of your data, from multiple project windows to helpful tools for setting
up your applications and managing them efficiently, letting you drill down into your data
quickly and easily, while giving you versioning tools via Subversion, Mercurial, and Get
integration out of the box. When new developers join your project, they can understand the
structure of your application because your code is well-organized.
Design GUIs for Java SE, HTML5, Java EE, PHP, C/C++, and Java ME applications
quickly and smoothly by using editors and drag-and-drop tools in the IDE. For Java SE
applications, the NetBeans GUI Builder automatically takes care of correct spacing and
alignment, while supporting in-place editing, as well. The GUI builder is so easy to use and
intuitive that it has been used to prototype GUIs live at customer presentations. The cost of
buggy code increases the longer it remains unfixed. NetBeans provide static analysis tools,
especially integration with the widely used FindBugs tool, for identifying and fixing common
problems in Java code. In addition, the NetBeans Debugger lets you place breakpoints in your
source code, add field watches, step through your code, run into methods.
The NetBeans Profiler provides expert assistance for optimizing your application's speed and
memory usage, and makes it easier to build reliable and scalable Java SE, JavaFX and Java EE
applications. NetBeans IDE includes a visual debugger for Java SE applications, letting you
debug user interfaces without looking into source code. Take GUI snapshots of your applications
and click on user interface elements to jump back into the related source code.
Department of CSE, TOCE 19
Collective Hyping Detection System To Identify Online Spam Activities Using AI 2018
Apache
The Apache HTTP Server is a web server software notable for playing a key role in the
initial growth of the World Wide Web. In 2009 it became the first web server software to
surpass the 100 million web site milestone. Apache is developed and maintained by an open
community of developers under the auspices of the Apache Software Foundation. Since April
1996 Apache has been the most popular HTTP server software in use. As of November 2010
Apache served over 59.36% of all websites and over 66.56% of the first one million busiest
websites.
Navicat Premium
Navicat Premium is a multi-connections database administration tool allowing you to
connect to MySQL, MariaDB, SQL Server, and SQLite, Oracle and PostgreSQL databases
simultaneously within a single application, making database administration to multiple kinds of
database so easy.
Navicat Premium combines the functions of other Navicat members and supports most of
the features in MySQL, MariaDB, SQL Server, SQLite, Oracle and PostgreSQL including
Stored Procedure, Event, Trigger, Function, View, etc.
Navicat Premium enables you to easily and quickly transfer data across various database
systems, or to a plain text file with the designated SQL format and encoding. Also, batch job for
different kind of databases can also be scheduled and run at a specific time. Other features
include Import/ Export Wizard, Query Builder, Report Builder, Data Synchronization, Backup,
Job Scheduler and more. Features in Navicat are sophisticated enough to provide professional
Department of CSE, TOCE 20
Collective Hyping Detection System To Identify Online Spam Activities Using AI 2018
developers for all their specific needs, yet easy to learn for users who are new to database server.
Establish a secure SSH session through SSH Tunnelling in Navicat. You can enjoy a
strong authentication and secure encrypted communications between two hosts. The
authentication method can use a password or public / private key pair. And, Navicat comes with
HTTP Tunnelling while your ISPs do not allow direct connections to their database servers but
allow establishing HTTP connections. HTTP Tunnelling is a method for connecting to a server
that uses the same protocol (http://) and the same port (port 80) as a webserver does.
CHAPTER 5
IMPLEMENTATION
Java is a little, basic, safe, item situated, translated or rapidly improved, byte coded,
engineering, waste gathered, multithreaded programming dialect with a specifically exemption
taking care of for composing circulated and powerfully extensible projects.
With most programming dialects, you either accumulate or translate a project so you can
run it on your PC. The Java programming dialect is irregular in that a project is both
accumulated and deciphered. The stage autonomous codes deciphered by the mediator on the
Java stage. The mediator parses and runs every Java byte code guideline on the PC. Aggregation
happens just once; understanding happens every time the project is executed. The accompanying
figure delineates how this function You can consider Java byte codes as the machine code
directions for the Java Virtual Machine (Java VM). Each Java mediator, whether it’s an
advancement device or a Web program that can run applets, is an execution of the Java VM.
We’ve already been introduced to the Java VM. It’s the base for the Java platform and is ported
onto various hardware-based platforms.
The Java API is a large collection of ready-made software components that provide many useful
capabilities, such as graphical user interface (GUI) widgets. The Java API is grouped into
libraries of related classes and interfaces; these libraries are known as packages.
The figure depicts a program that’s running on the Java platform. As the figure shows, the Java
API and the virtual machine insulate the program from the hardware.
myProgram.java
Java API
Java Platform
Shapelet Learning Model: Shapelets are discriminative subsequence of time series that
best predict the target variable, while shapelet learning models are usually designed with
a classification purpose that aims to identify the similarity between two items.[4]
CHAPTER 6
SYSTEM TESTING
Types of Testing:
Unit Testing
Individual component are tested to ensure that they operate correctly. Each
component is tested independently, without other system component. This system
was tested with the set of proper test data for each module and the results were
checked with the expected output. Unit testing focuses on verification effort on the
smallest unit of the software design module. This is also known as MODULE
TESTING. This testing is carried out during phases, each module is found to be
working satisfactory as regards to the expected output from the module.
Integration Testing
Integration testing is another aspect of testing that is generally done in order to
uncover errors associated with flow of data across interfaces. The unit-tested
modules are grouped together and tested in small segment, which make it easier to
isolate and correct errors. This approach is continued unit I have integrated all
modules to form the system as a whole.
System Testing
System testing is actually a series of different tests whose primary purpose is to fully
exercise the computer-based system. System testing ensures that the entire integrated
software system meets requirements. It tests a configuration to ensure known and
predictable results. An example of system testing is the configuration oriented
system integration testing. System testing is based on process description and flows,
emphasizing pre-driver process and integration points.
Performance Testing
The performance testing ensure that the output being produced within the time limits
and time taken for the system compiling, giving response to the users and request
being send to the system in order to retrieve the results.
Validation Testing
The validation testing can be defined in many ways, but a simple definition is that.
Validation succeeds when the software functions in a manner that can be reasonably
expected by the end user.
Acceptance Testing
This is the final stage of testing process before the system is accepted for operational
use. The system is tested within the data supplied from the system procurer rather
than simulated data.
CHAPTER7
CONCLUSION
As previously discussed, the MSSD model identifies spam stores or products one by one by
detecting abnormal singleton reviewers appearing in an assigned time window. However, this
method misses the latent information that underlies evolving hyping activities. We pick up two
of the representative cases, which were tagged as “spam” by the MSSD model but that our
model placed in a “clean” class. Apparently, there’s a remarkable purchasing burst in both of
them, with 80 percent of buyers in this time window being singleton reviewers. In our
experiment, we define customers who have made fewer than five transactions online since their
registration as singleton reviewers. Because of the different customer level–segmentation
strategies and privacy policies in Taobao, this provides the best match with the definition of
singleton reviewers in the MSSD model.
REFERENCES
[1] F. Li, M. Huang, Y. Yang, and X. Zhu, “Learning to Identify Review Spam,” Proc. Int’l
Joint Conf. Artificial Intelligence, 2011, pp. 2488–2493.
[2] S. Feng et al., “Distributional Footprints of Deceptive Product Reviews,” Proc. Int’l Conf.
Web and Social Media, 2012, pp. 98–105.
[3] G. Fei et al., “Exploiting Burstiness in Reviews for Review Spammer Detection,” Proc. Int’l
Conf. Web and Social Media, 2013, pp. 175–184.
[4] Q. Xu et al., “SMS Spam Detection Using Noncontent Features,” IEEE Intelligent Systems,
vol. 27, no. 6, 2012, pp. 44–51.
[5] S. Xie et al., “Review Spam Detection via Temporal Pattern Discovery,” Proc. ACM Int’l
Conf. Knowledge Discovery and Data Mining, 2012, pp. 823–831.
[6] T. Moore, R. Clayton, and H. Stern, “Temporal Correlations between Spam and Phishing
Websites,” Proc. 2nd Usenix Conf. Large-scale Exploits and Emergent Threats, 2009, p. 5.
[7] J. Grabocka et al., “Learning Time- Series Shapelets,” Proc. ACM Int’l Conf. Knowledge
Discovery and Data Mining, 2014, pp. 392–401.
[8] J. Lines et al., “A Shapelet Transform for Time Series Classification,” Proc. ACM Int’l Conf.
Knowledge Discovery and Data Mining, 2012, pp. 289–297.
[9] Q. Zhang et al., “Exploring Heterogeneous Product Networks for Discovering Collective
Marketing Hyping Behavior,” Proc. Pacific-Asia Conf. Knowledge Discovery and Data Mining,
2016, pp. 40–51.
Appendix A
Snapshots
Fig A 4 IP Blocking
Appendix B
Conference Details
Presenting and publishing paper entitled “Collective Hyping Detection System To Identify
Online Spam Activities Using AI” in proceedings of National Conference on Science
Engineering and Management (NCSEM-2018) which will be help on 24-25th May 2018 at
bThe Oxford College of Engineering, Bengaluru.