You are on page 1of 20

NEW HORIZON COLLEGE OF ENGINEERING

(Autonomous Institution Affiliated to VTU &Approved by AICTE)


Accredited by NAAC ‘A’, Accredited by NBA
Outer Ring Road, Panathur Post, Kadubisanahalli,
Bangalore – 560103
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

Guide:
Ms. Uma N , By :
Assistant Professor, Bishal Kumar Sah
Dept. of CSE,
NHCE, Bangalore
Contents
• Abstract
• Introduction
• Literature Survey
- Existing System-Advantages and Disadvantages
• Proposed System-Advantages and Disadvantages
• Detailed Description of Modules
• Project Design and Architecture
- High Level design
- Data Flow Diagram
• Details of the selected techniques/algorithm/methodology-why was it chosen- -
Advantages and Disadvantages
• Hardware Requirements and Software Requirements
• Expected Outcome
• Project Timeline – to be shown with a chart
• Conclusion

Dept of CSE,NHCE
ABSTRACT

With increasingly popular social media, cyberbullying has emerged as a serious problem afflicting
children , adolescents and young adults. Machine learning techniques make automatic detection of
bullying messages in social media possible, and this could help to construct a healthy and safe
social media environment. In this meaningful research area, one critical issue is robust and
discriminative numerical representation learning of text messages. In this project, we propose a
new representation learning method named Semantic-Enhanced Marginalized Denoising Auto-
Encoder (smSDA) which is developed via semantic extension of the popular deep learning model
stacked denoising autoencoder to tackle this problem. Our proposed method is able to exploit the
hidden feature structure of bullying information and learn a robust and discriminative
representation of text.

Dept of CSE,NHCE
INTRODUCTION
Cyberbullying -Aggressive, intentional actions performed by an individual group of people via digital
communication methods
Eg : sending messages and posting comments against a victim.
Characteristics:
Can take place anywhere at any time.
For bullies, they are free to hurt their peers’ feelings because they do not need to face someone and can hide
behind the Internet.
For victims, they are easily exposed to harassment and mental stress leading to suicides ,depression and other
related issues.
Solutions: To automatically detect and promptly bullying messages so that proper measures can be taken to
prevent possible tragedies.
Implementation : Natural language processing (NLP) and machine learning are powerful tools to study
bullying . Cyberbullying detection can be formulated as a supervised learning problem. A classifier is first
trained on a cyberbullying corpus labeled by humans, and the learned classifier is then used to recognize a
bullying message. Three kinds of information including text, user demography, and social network features are
often used in cyberbullying detection . Since the text content is the most reliable, our work here focuses on text-
based cyberbullying detection.

Dept of CSE,NHCE
Literature Survey
Advantages of Existing System
• Previous works on computational studies of bullying have shown that natural
language processing and machine learning are powerful tools to study bullying.
• Cyberbullying detection can be formulated as a supervised learning problem. A
classifier is first trained on a cyberbullying corpus labeled by humans, and the
learned classifier is then used to recognize a bullying message.
• Yin et.al proposed to combine BoW features, sentiment features and contextual
features to train a support vector machine for online harassment detection.
• Dinakar et.al utilized label specific features to extend the general features, where
the label specific features are learned by Linear Discriminative Analysis. In
addition, common sense knowledge was also applied.
• Nahar et.al presented a weighted TF-IDF scheme via scaling bullying-like features
by a factor of two. Besides content-based information, Maral et.al proposed to
apply users’ information, such as gender and history messages, and context
information as extra features

Dept of CSE,NHCE
Disadvantages of Existing System
• The first and also critical step is the numerical representation learning for text messages.
• Secondly, cyberbullying is hard to describe and judge from a third view due to its intrinsic
ambiguities.
• Thirdly, due to protection of Internet users and privacy issues, only a small portion of
messages are left on the Internet, and most bullying posts are deleted.

Dept of CSE,NHCE
Proposed System
• Three kinds of information including text, user demography, and social network features are often used in
cyberbullying detection.
• Deep learning method named stacked denoising autoencoder (SDA)- SDA stacks several denoising
autoencoders and concatenates the output of each layer as the learned representation. Each denoising autoencoder in
SDA is trained to recover the input data from a corrupted version of it. The input is corrupted by randomly setting
some of the input to zero, which is called dropout noise. This denoising process helps the autoencoders to learn
robust representation.
• In addition, each autoencoder layer is intended to learn an increasingly abstract representation of the input.
• Marginalized stacked denoising autoencoders (mSDA), which adopts linear instead of nonlinear projection to
accelerate training and marginalizes infinite noise distribution in order to learn more robust representations.
• We utilize semantic information to expand mSDA and develop Semantic-enhanced Marginalized Stacked Denoising
Autoencoders (smSDA). The semantic information consists of bullying words. An automatic extraction of bullying
words based on word embeddings is proposed so that the involved human labor can be reduced. During training of
smSDA, we attempt to reconstruct bullying features from other normal words by discovering the latent structure,
i.e. correlation, between bullying and normal words. The intuition behind this idea is that some bullying messages
do not contain bullying words. The correlation information discovered by smSDA helps to reconstruct bullying
features from normal words, and this in turn facilitates detection of bullying messages without containing bullying
words.

Dept of CSE,NHCE
Advantages of Proposed System
• Our proposed Semantic-enhanced Marginalized Stacked Denoising Autoencoder is able to
learn robust features from BoW representation in an efficient and effective way
• Semantic information is incorporated into the reconstruction process via the designing of
semantic dropout noises and imposing sparsity constraints on mapping matrix.
• Finally, these specialized modifications make the new feature space more discriminative and
this in turn facilitates bullying detection.
• Comprehensive experiments on real-data sets have verified the performance of our proposed
model.

Dept of CSE,NHCE
Detailed Description of Modules
• Document Representation :The purpose of this module is to transform documents, which in this
case are comments on social media into a representation that is suitable for the classifier to work
with. For a machine learning classifier the representation is typically a vector of some sort. Hence
the module would take a comment and output the vector representation of that comment.
• Classifier: The classifier has one clear role. To classify documents as potential cyberbullying or non
cyberbullying. It takes as input a document that has been preprocessed by the document
representation module and outputs whether that comment was cyberbullying or not.
• Storage: The storage module is responsible for keeping information about users, which profiles
these users are monitoring and bullying comments connected to any monitored profile.
• User Interface :The user interface allows users to add and remove profiles for monitoring, look at
monitored profiles and any cyberbullying comments connected to the monitored profiles. The user
interface communicates with the storage module for displaying and storing information.
• Social Media Scanner: The task of the social media scanner is to scan monitored profiles for any
new activity and output it. For instance if three new comments have been posted on the monitored
profile since it was last checked, the scanner would return these three new comments.
• Coordinator: The coordinator ties it all together by using the other modules. It asks the storage
module for profiles that should be scanned. It then requests new activity for these profiles from the
social media scanner. New activity in the form of documents are then passed to the document
representation module for transformation. The transformed documents are classified by the
classifying module and passed back to the coordinator. The coordinator then updates the storage with
any new cyberbullying activity.

Dept of CSE,NHCE
Project Design and Architecture

Dept of CSE,NHCE
Data Flow Diagram

Dept of CSE,NHCE
Data Flow Diagram Explanation
Cyberbullying is a supervised learning problem, where the system is
trained with data labelled by humans as cyberbullying. Later, the
trained system can be used to detect cyberbullying automatically. The
first step in cyberbullying detection is to learn a reliable numeric
representation for the text messages. This is achieved through a
word2vec model of insulting seeds and use of cosine similarity to
extend the seeds to a larger set. The insulting seeds are the set of words
which can be called as the bullying words and its extensions (different
spellings, repetitions etc.). This constructed bullying features are then
given to a stacked denoising auto encoder. The output of the
autoencoder provides robust and discriminative features, which is used
with a support vector machine (SVM) classifier. The classifier performs
a binary classification and detects if the given message is bullying or
not. The dataset used in this project is openly available and is taken
from social networking websites such as Myspace or Twitter.

Dept of CSE,NHCE
Details of the selected techniques
Machine learning is a subfield of artificial intelligence (AI). The goal of
machine learning generally is to understand the structure of data and fit
that data into models that can be understood and utilized by people.
Although machine learning is a field within computer science, it differs
from traditional computational approaches. In traditional computing,
algorithms are sets of explicitly programmed instructions used by
computers to calculate or problem solve. Machine learning algorithms
instead allow for computers to train on data inputs and use statistical
analysis in order to output values that fall within a specific range.
Because of this, machine learning facilitates computers in building
models from sample data in order to automate decision-making
processes based on data inputs.

Dept of CSE,NHCE
Semantic Enhancement for mSDA

Details of the selected techniques


• Semantic Enhancement for mSDA
The advantage of corrupting the original input in mSDA can be explained by feature co-occurrence
statistics. The cooccurrence information is able to derive a robust feature representation under an
unsupervised learning framework, and this also motivates other state-of-the-art text feature learning
methods such as Latent Semantic Analysis and topic models . A denoising autoencoder is trained to
reconstruct these removed features values from the rest uncorrupted ones. Thus, the learned
mapping matrix W is able to capture correlation between these removed features and other features.
It is shown that the learned representation is robust and can be regarded as a high level concept
feature since the correlation information is invariant to domain-specific vocabularies. We next
describe how to extend mSDA for cyberbullying detection. The major modifications include semantic
droup out noise and sparse mapping constraints.

Dept of CSE,NHCE
Methodology
• Cyberbullying is a supervised learning problem, where the system is trained with data labelled
by humans as cyberbullying. Later, the trained system can be used to detect cyberbullying
automatically. The first step in cyberbullying detection is to learn a reliable numeric
representation for the text messages. This is achieved through a word2vec model of insulting
seeds and use of cosine similarity to extend the seeds to a larger set. The insulting seeds are
the set of words which can be called as the bullying words and its extensions (different
spellings, repetitions etc). This constructed bullying features are then given to a stacked
denoising auto encoder. The output of the autoencoder provides robust and discriminative
features, which is used with a support vector machine (SVM) classifier. The classifier
performs a binary classification and detects if the given message is bullying or not. The
dataset used in this project is openly available and is taken from social networking websites
such as Myspace or Twitter.

Dept of CSE,NHCE
HARDWARE REQUIREMENTS

Processor : Any Processor above 500 MHz


RAM : 8GB
Hard Disk : 1 TB
Input device : Standard Keyboard and Mouse
Output device : VGA and High Resolution Monitor

SOFTWARE REQUIREMENTS

Operating system : Windows XP,7,8,8.1,10


Front End : python
IDE : Jupyter
Libraries : sklearn,word2vec,pandas and tensflow
Expected Outcome
The major goal of this project is understanding the main factor to detect
cyber bullying detection. Thus, implementing machine learning algorithms
namely SVM and smSDA it’s easy to understand the cyber bullying
detection . The graphical representation that will be obtained in the end of
the analysis makes sure to achieve automatic cyberbullying detection using
an autoencoder network and to learn semantic features absorbing
information about cyber bullying.
Although there are many techniques to determine cyber bullying , but
no such techniques will ensure the detection of cyber bullying. Thus, the
proposed system has an advantage with regards to this aspect. Hence, the
outcome that will be obtained can be used to achieve automatic
cyberbullying detection using an autoencoder network and to learn
semantic features absorbing information about cyber bullying.

Dept of CSE,NHCE
Final
Project initial presentation
report Code
Title Name Final demo with report
05/12/2018 implementation
15/11/2018 18/03/2019 15/04/2019
21/01/2019

2018 NOV DEC JAN FEB MAR APR 2019

project presentation Initial demo


24/11/2018 20/02/2019

15/11/2018 Phase 1 05/12/2018 45


wks
23.8
wks

Phase
08/01/2019 Phase 2 III
15/04//2019
Conclusion
• This project addresses the text-based cyberbullying detection problem, where
robust and discriminative representations of messages are critical for an effective
detection system. By designing semantic dropout noise and enforcing sparsity, we
have developed semantic-enhanced marginalized denoising autoencoder as a
specialized representation learning model for cyberbullying detection. In addition,
word embeddings have been used to automatically expand and refine bullying word
lists that is initialized by domain knowledge. The performance of our approaches
has been experimentally verified through two cyberbullying corpora from social
medias: Twitter and MySpace. As a next step we are planning to further improve
the robustness of the learned representation by considering word order in messages.

Dept of CSE,NHCE