You are on page 1of 12

Computer Science Review 29 (2018) 44–55

Contents lists available at ScienceDirect

Computer Science Review


journal homepage: www.elsevier.com/locate/cosrev

A recent review of conventional vs. automated cybersecurity


anti-phishing techniques
Issa Qabajeh b , Fadi Thabtah a, *, Francisco Chiclana b
a
Digital Technology Department, Manukau Institute of Technology, Auckland, New Zealand
b
Centre for Computational Intelligence, De Montfort University, Leicester, UK

article info a b s t r a c t
Article history: In the era of electronic and mobile commerce, massive numbers of financial transactions are conducted
Received 12 September 2017 online on daily basis, which created potential fraudulent opportunities. A common fraudulent activity
Received in revised form 23 May 2018 that involves creating a replica of a trustful website to deceive users and illegally obtain their credentials
Accepted 28 May 2018
is website phishing. Website phishing is a serious online fraud, costing banks, online users, governments,
and other organisations severe financial damages. One conventional approach to combat phishing is to
raise awareness and educate novice users on the different tactics utilised by phishers by conducting
Keywords:
Classification periodic training or workshops. However, this approach has been criticised of being not cost effective
Computer security as phishing tactics are constantly changing besides it may require high operational cost. Another anti-
Phishing phishing approach is to legislate or amend existing cyber security laws that persecute online fraudsters
Machine learning without minimising its severity. A more promising anti-phishing approach is to prevent phishing attacks
Web security using intelligent machine learning (ML) technology. Using this technology, a classification system is
Security awareness integrated in the browser in which it will detect phishing activities and communicate these with the end
user. This paper reviews and critically analyses legal, training, educational and intelligent anti-phishing
approaches. More importantly, ways to combat phishing by intelligent and conventional are highlighted,
besides revealing these approaches differences, similarities and positive and negative aspects from the
user and performance prospective. Different stakeholders such as computer security experts, researchers
in web security as well as business owners may likely benefit from this review on website phishing.
© 2018 Elsevier Inc. All rights reserved.

Contents

1. Introduction......................................................................................................................................................................................................................... 45
2. Phishing background .......................................................................................................................................................................................................... 45
2.1. Phishing history...................................................................................................................................................................................................... 45
2.2. Phishing process ..................................................................................................................................................................................................... 46
2.3. Phishing as a classification problem ..................................................................................................................................................................... 46
3. Common traditional anti-phishing methods .................................................................................................................................................................... 47
3.1. Legal anti-phishing legislations............................................................................................................................................................................. 47
3.2. Simulated training.................................................................................................................................................................................................. 48
3.3. User experience: Anti-phishing online communities.......................................................................................................................................... 48
3.4. Discussion non intelligent anti-phishing solutions ............................................................................................................................................. 48
4. Computerised anti-phishing techniques ........................................................................................................................................................................... 49
4.1. Databases (blacklist and whitelist) ....................................................................................................................................................................... 49
4.2. Intelligent anti-phishing techniques based on ML .............................................................................................................................................. 50
4.2.1. Decision trees and rule induction.......................................................................................................................................................... 50
4.2.2. Associative classification (AC)................................................................................................................................................................ 51
4.2.3. Neural network (NN) .............................................................................................................................................................................. 51
4.2.4. Support vector machine (SVM).............................................................................................................................................................. 52
4.2.5. Fuzzy logic ............................................................................................................................................................................................... 52
4.2.6. CANTINA term frequency inverse document frequency approach ..................................................................................................... 52

* Corresponding author.
E-mail addresses: P12047781@myemail.dmu.ac.uk (I. Qabajeh), fadi.fayez@manukau.ac.nz (F. Thabtah), chiclana@dmu.ac.uk (F. Chiclana).

https://doi.org/10.1016/j.cosrev.2018.05.003
1574-0137/© 2018 Elsevier Inc. All rights reserved.
I. Qabajeh et al. / Computer Science Review 29 (2018) 44–55 45

5. Conclusions.......................................................................................................................................................................................................................... 53
References ........................................................................................................................................................................................................................... 53

1. Introduction support vector machines and other artificial intelligence search


methods. Little information concerning non-technical solution
With the advanced development of computer hardware, espe- were provided. Instead, the authors paid full attention to review
cially computer networks and cloud technology services, online automated solutions that can be integrated within email systems
and mobile commerce have significantly increased in the last few to detect phishing attacks. Lastly, the authors reported different
years [1]. Indeed, the number of customers who perform online disseminated research results in a table format to show the per-
purchase transactions has dramatically increased and large mon- formance of various different machine learning techniques against
etary values are daily exchanged through electronic means, such email phishing data. However, it will be hard to generalise such
as private payment gateways, that are usually verified by secure performance due to the fact that these results have been derived
socket layer (SSL) [2]. Despite the convenience associated with from datasets with different characteristics. Overall, the survey
online transactions from both user and business prospectives, an provided was insightful and it provide rich information to users
online threat has emerged: phishing. in order to reduce the chance of falling into email phishing attacks.
Phishing attacks are attempts to access online users’ sensitive Gupta et al. [16] reviewed different types of phishing attacks
financial information using fake websites that are visually similar and then discussed a number of anti-phishing approaches includ-
to authentic websites [3]. In phishing attacks, social engineering ing social engineering ones. More importantly, the authors showed
techniques are normally utilised to redirected users to the mali- features related to phishing attacks that have been collected from
cious website. Specifically, an email is sent to users from apparent previous research works including [14,17,18] and [32] among oth-
trustworthy sources, urging them to adjust their login information ers. Lastly, the authors highlighted emergent trends in phishing
by clicking/following a hyper link [4]. Phishing techniques include and modern technologies such as the Internet of Things. Gupta et
spear phishing, which is a focused attack in which emails are sent al. [19] highlighted recent challenges and new emergent trends
to employees of a business in an attempt to access a company’s in phishing attacks. The focus of the researcher was on the new
computer system, or whaling, that targets senior corporate exec- technology of the Intent of Things and spear phishing. The authors
utives [5]. Unfortunately, the consequences of phishing are fatal also discussed recent phishing datasets and their features.
because affected legitimate users become vulnerable to identity Most of phishing reviews have covered partly one or more of
theft and information breach and no longer trust online commerce phishing aspects. For instance, Suganya [10] and Sahu and Dubey
and electronic banking [6]. For instance, Gartner Group [7] pub- [13] briefly reviewed phishing attacks without showing the ways
lishes periodic reports that revealed financial damages caused by to combat them or their pros and cons. Mohammad et al. [11,12]
phishing attacks. In addition, to raise awareness about phishing discussed in general common solutions of website phishing with-
an international body that aims to minimise online threats includ- out providing grounds for recommendations besides not covering
ing pharming, spoofing, phishing and malware, the Anti-Phishing specific intelligent approaches. Almomani et al. [14] reviewed in-
Work Group (APWG), was created [8]. APWG periodically dissem-
telligent solutions to detect phishing emails. Lastly, Basnet et al.
inates reports for the online community on recent cyber-attacks,
[15] compared only few intelligent anti-phishing solutions without
with a recent report stating the rapid increase of phishing websites
on elaborating the other computerised and classic approaches of
to 17,000 in the month of December 2014 alone [9]. A recent
anti-phishing. Therefore, this article not only comprehensively
report published by APWG revealed that there were approximately
reviews phishing from wider prospective but also it critically anal-
1,220,523 phishing attacks in 2016.
yses traditional and automated anti-phishing solutions.
It seems imperative that users, as well as businesses, adopt re-
This paper serves researchers, organisations’ managers, com-
newable anti-phishing tools or strategies to reduce phishing activi-
puter security experts, lecturers, and students who are interested
ties and protect themselves from their potential negative impacts.
in understanding phishing and its corresponding intelligent solu-
This is important because phishing attacks are constantly chang-
tions. This is since wider potential solutions have been critically
ing and new deceptions are emerging all the time. Anti-phishing
analysed and experimentally compared besides presenting classic
solutions adopting DM (ML) are shown to be more practical and
solutions including educational, legal, and software based. This
effective in combating phishing because they work automatically
paper is structured as follows: Section 2 presents the phishing
and are capable of revealing concealed knowledge that online users
problem, its history, and its lifecycle. Section 3 critically analyses
are not aware of, especially with respect to the relationship among
common classic methods of combating phishing besides critically
website features and phishing activities. This hidden knowledge,
analysing them. Section 4 is devoted to intelligent anti-phishing
when combined with human experience, can result in an effective
shield for protecting users from phishing (add a reference). solutions that employ different strategies in deriving the anti-
In this paper, we investigate the phishing problem and de- phishing models. Section 5 provides the conclusions.
fine it in a classification ML context. We then discuss common,
traditional, strategies in addition to computerised techniques de- 2. Phishing background
veloped to combat phishing. More importantly, the paper thor-
oughly investigates traditional and ML anti-phishing classification 2.1. Phishing history
techniques and critically analyses their benefits and disadvantages
theoretically. There have been few former reviews on phishing Phishing comes from the word ‘‘fishing’’, in which the phisher
such as Suganya [10], Mohammad et al. [11,12], Sahu and Dubey throws a bait and awaits for potential users to take a bite. Phishing
[13], Almomani et al. [14] and Basnet et al. [15] among others. is not recent as an online risk, with its origin rooted in a social en-
For instance, Almomani et al. [14] reviewed a number of filtering gineering method using telephones known as ‘‘phone phreaking’’
techniques to combat phishing. The authors have focused only [20]. It was during the 1990s period when the internet community
on technical solutions of detecting phishing emails by reviewing started to grow that phishing was originally observed as an online
techniques related to Bag of Words, frequency analysis, blacklists, threat, especially in the United States [15].
46 I. Qabajeh et al. / Computer Science Review 29 (2018) 44–55

2.3. Phishing as a classification problem

Generally speaking, websites can be classified by hand-crafted


methods based on certain features such as URL length, pre-
fix_suffix, domain, sub_domain, etc. Initially, scholars in the area of
online security [22,23] developed different knowledge bases using
their experience and expertise to distinguish phishing from legit-
imate websites. Recently, there have been studies and proposals
aiming at deriving intelligent rules to detect the fine line between
legitimate and phishing websites using statistical analysis [18,24,
25]. For instance, Aburrous et al. [26] and Mohammad et al. [25]
defined a number of hand crafted rules based on various website
features using simple statistical analysis on websites (instances)
Fig. 1. Example of an early phishing attack (Blogonlymyemail.com). collected from different sources including Phishtank and Yahoo
directory [27]. More advanced decision rules have been developed
in [18] in which the authors used further statistical analysis on a
According to McFredies [21], the first phishing incident was larger phishing dataset collected from varying sources.
noticed in the mid-1990s when phishers attempted to obtain regis- ML and DM have proved to be powerful data analysis tools
in many application domains such as medical diagnosis, market
tered online users’ account information from the internet provider
basket analysis, weather forecasting and events processing, to cite
America Online (AOL). Phishers during this era frequently utilised
some [1]. This is due to the fact that ML and DM techniques usually
instant messages (IM) in AOL chat rooms or emails to reach users reveal concealed meaningful information from large datasets so
so that they would reveal their passwords (Fig. 1 illustrates an they can be utilised in management decisions related to develop-
example of an early phishing attack), which were subsequently ment, planning, and risk management. Generally speaking, ML and
used by phishers to leverage the victims’ accounts and begin email- DM can be seen as an automated and intelligent tool embedded
ing spam to other online users. Obviously, phishers realised that within management information systems to guide decision making
they could further trick victims if the IMs and emails requested processes in both business and scientific domains. Common tasks
them to update their billing information. With this realisation, the or problems that ML and DM handle are clustering, association rule
attackers expanded their aim and using the same electronic means discovery, regression analysis, classification, pattern recognition,
(IMs and emails) attempted to access other financial information time series analysis, trends analysis, and multi-label learning [3,
from victims such as social security numbers, addresses, credit card 28].
information, etc. One of the frequent task of ML is the forecasting of a target
variable within datasets based on other available variables [28].
One of the most common beliefs that ordinary users have about
This forecasting process occurs in an automated manner using a
phishing websites is that grammatical errors and typos are typical
classification model, normally named the classifier, that is derived
within these websites [20]. This has misled users into believing
from a labelled training dataset. The goal of the classifier is to
that a website without grammatical errors must be trustworthy, ‘‘guess’’ the value of the target variable in unseen data, referred to
which has proven not necessarily be true. Phishers in today’s as the test dataset, as accurately as possible. This task description
cyber world become more innovative and work systematically in falls under the umbrella of supervised learning and is known as
groups sometimes even orchestrating phishing campaigns moti- classification. Abdelhamid and Thabtah [1] defined classification
vated by potential financial gains. In fact, phishers are continuously as the ability to ‘‘accurately’’ predict class attributes for a test
changing their spoofing methods based on the counterpart security instance using a predictive model derived from a training dataset.
measures taken by organisations, and recently they have directed Since the problem of website phishing involves automatic cat-
their attention to the mobile commerce platform. This means that egorisation of websites into a predefined set of class values (legit-
the profile of phishers has changed from egocentric purposes into imate, suspicious, phishy) based on a number of available features
more organised and serious cybercrime that keeps evolving, which (variables) then this problem can be considered a classification
makes it hard to detect. problem. To be more specific, the training dataset will consist of
a set of predefined features and the class attribute and instances
are basically the websites’ feature values. These instances can be
2.2. Phishing process extracted from different sources such as Phishtank and online
directories. The aim will be to build an anti-phishing classifier that
Phishing attacks are initiated through an email sent to potential can predict the type of website based on hidden knowledge dis-
users. Other ways a phisher may start an attack include Instant covered from the training set features during the data processing
Messaging, online blogs and forums, short message services, peer phase. Usually the goodness of the classifier is measured using
to peer file sharing services, and social media websites [18]. We accuracy, which primarily relies on the correlations of the features
can summarise the phishing lifecycle as follows (see Fig. 2): and the class [29]. Fig. 3 shows phishing as a classification problem
from the ML prospective. As discussed before, phishing websites
(1) A link is sent using one of the aforementioned channels to are dynamics and consequently it can modelled as a dynamic
potential victims. supervised learning classification problem. Therefore, an effective
anti-phishing classifier should be adaptable to any new features
(2) When clicked, the link will redirect potential victims to a
observed in order to handle and manage the dynamic nature of the
malicious website.
problem.
(3) Users become vulnerable as they try to login using their In this review paper, we focus on content based methods that
credentials on the malicious website. fall under website phishing attacks as shown in Fig. 4. We omit
(4) Login credentials are then transferred to a server, or a key non-content based methods such as domain popularity, restricted
logger is installed into the user’s computing device. from filing, DSN based features, water marking, one time password,
(5) The phisher can then utilise the credentials to perform ad- layout similarity and crowd sourcing among others. Also email
ditional cybercrimes. finishing techniques are omitted from the graph since they are out
of scope.
I. Qabajeh et al. / Computer Science Review 29 (2018) 44–55 47

Fig. 2. Phishing life cycle [18].

Fig. 3. Website phishing as a classification process.

3. Common traditional anti-phishing methods

Since phishing causes serious breaching of user confidentiality,


as well as organisations including government agencies, there
have been different methods proposed to combat phishing. These
approaches can be categorised into three main categories:

• Education and legal Fig. 4. Taxonomy of website phishing approaches.


• Computerised using human-crafted methods
• Intelligent ML methods.

In this section, we examine the literature on phishing and critically emails, or any other methods to ask or solicit information from on-
analyse different techniques based on the above categories. Focus, line users by claiming ones self as a business without the authority
however, will be on the intelligent anti-phishing solutions since of that business. Other US States such as Texas have also introduced
it is believed to be the way forward in shielding the web from new cybercrime legislations that include phishing, and in 2005 the
phishing threats and promising results have recently been derived General Assembly of Virginia added phishing attacks to their list
by this category in [6,30–33], [2,4], among others. of computer crimes [35]. These new laws empowered companies
such as America Online to file lawsuits in Virginia against phishers
in 2006 [36]. However, most states in US have not legislated spe-
3.1. Legal anti-phishing legislations cific laws incriminating phishing and usually prosecute phishers
using other computing crime laws such as fraud.
Governments have been slow in responding and opposing At Federal level in US, lawmakers and congressional repre-
started to oppose phishing. California State in the US was the first to sentatives have not passed anti-phishing legislation either. There
issue anti-phishing legislation in 2005 [34]. This legislation stated were a few attempts between 2004 and 2006 following the Anti-
that it is unlawful to use any electronic means such as websites, Phishing Act of 2004 to pass specific bills incriminating phishing
48 I. Qabajeh et al. / Computer Science Review 29 (2018) 44–55

and instigating tougher prison sentences, but these bills were business processes performed by employees by experimenting to
stopped at the committee level in Congress. Nevertheless, federal measure a certain outcome Arachchilage and Love [42]. In phish-
law enforcement can incriminate phishers using other laws that ing, the authors of Arachchilage et al. [44] used the embedded
are related to identity theft and fraud such as ‘‘18 U.S.C. section training methodology to measure phishing awareness at a uni-
1028’’ [36]. Businesses have also joined the Government in fighting versity. The authors sent malicious emails from the administrator
phishing. For example, in 2005 Microsoft filed over 115 lawsuits to participants without informing them of the training material
in Washington’s Western District Court accusing a single Internet content. These emails urged users to click on a link that would
user of utilising various deceptive methods to access some of redirect them to a malicious website where they would input
the company’s users’ information (add reference). In mid-2006, their login credentials. This aim was directed at identifying the
the then president George W. Bush established a new cybercrime number of users who would actually click on the link. During
identity theft task force [37], with a single goal: reduce the risks of the experiment, the user was interrupted immediately when he
cybercrime, especially phishing. clicked the link and was then provided with the training material.
The United Kingdom (UK) has followed the US by strengthening The embedded training proposed by the authors was based on a
its legal system against severe cybercrimes, including fraud and preliminary pilot study conducted by them on a limited number of
identity theft. In 2006, the UK introduced the new Fraud Act, university students.
which increased prison sentences to up to ten-years for online
fraud offences [38]. This same act prevents possession of a phishing 3.3. User experience: Anti-phishing online communities
website with the intent to deceive users and commit fraud. Fur-
ther, Microsoft decided to collaborate with other law enforcement One of the approaches to reducing the impact of phishing on
agencies outside US to bring justice to phishers. In doing so, the online users and organisations is to build an anti-phishing commu-
company signed an agreement with the Australian government to nity to monitor recent phishing activities and provide news to the
train law enforcement agents in preventing phishing [39]. Also, different stakeholders. Users’ experiences are practical and based
in 2010 Canada introduced an Anti-spam Act that incriminates on real cases related to different types of phishing. Such efforts
cybercrime and that aims to protect Canadian online consumers by users and organisations have resulted in new proactive online
and businesses when globally trading [40]. communities and data repositories. These accumulated and useful
resources are of interest since they can be employed to study ways
3.2. Simulated training to make the Internet safer and free from phishing.
The Monitoring and Takedown (MaT) approach enables individ-
One of the easy, yet helpful, policies to oppose cybercrimes is uals who recognise phishing activity to report it via public anti-
to educate users on the ways employed to access their informa- phishing communities including APWG, PhishTank, Millersmiles,
tion. When novice users are aware of the circumstances around and Symantec among others [8,27,46,47]. These anti-phishing
phishing, they may be able to minimise this risk or stop it as communities allow users to report phishing content and warn
early as possible. Unfortunately, ordinary web browsing users are other users and organisations as well. Users can also report phish-
unaware of how phishing attacks start or how visually to recog- ing content to the Federal Trade Commission’s Complaint Depart-
nise an untruthful website and differentiate it from one that is ment, becoming directly part of the campaign towards combating
trustworthy [11,12]. Moreover, basic security indicators and anti- phishing. Many reputable companies also have an internet fraud
phishing software counterparts are still vague for many online department that allows users to report any fraudulent or suspi-
shoppers [24]. Subsequently, these increase the pace of phishing cious activity such as phishing. PhishTank was created in 2003 as
and motivate phishers to launch further attacks. For instance, a a subsidiary of OpenDNS in order to provide the parent company,
security survey was conducted by Julie et al. [41], which revealed as well as the online community, with a phishing repository. This
the lack of knowledge on cybercrimes, including phishing, held by large collection of stored phishing websites has given computer
online users. In addition, some respondents in the survey showed security experts, users, researchers and business owners’ exten-
security awareness yet were reluctant in using their financial infor- sive information about phishing attacks and the features of their
mation for payment purposes, even within trustworthy websites. associated emails and websites. Another example of a good use of
There have been a number of studies on educating people as user experiences is Cloudmark, which is an alerting-based anti-
to the severity of phishing. For example, Arachchilage and Love phishing method with user rating system [48]. When a user is
[42] investigated whether mobile games can be a helpful method visiting a website and experiencing any kind of threat, they can
for raising awareness of phishing attacks. The authors evaluated then rate that website to alert other online users. Finally, Web
learning curves of users who played with a mobile game about of Trust (WOT) is another example of an anti-phishing approach
phishing developed by Arachchilage and Cole [43], and assessed based on the user feedback rating model [49].
whether an interactive mobile platform is effective in educating
users in contrast to traditional security training. A comparison of 3.4. Discussion non intelligent anti-phishing solutions
user responsiveness to phishing has also been conducted using the
developed mobile game, along with a website designed by APWG. Legislators in the US, UK and Canada, among others countries,
The results showed that users who played the anti-phishing mobile have approved legislative bills that include serious jail sentences
game were able to spot non-genuine websites with a higher rate of for incriminated phishers. This has been made clear in several high
accuracy than other users who only used the APWG website. profile cases, especially in the US. Nevertheless, these legislative
There are a number of organisations and research studies, such bills have not achieved a decrease of phishing attacks. On the
as Arachchilage et al. [44] and Ronald et al. [45], that have adopted contrary, phishing has now become more severe than ever and
a relative training to warn users of phishing. This training involves businesses as well as individual users have suffered from sub-
sending participants simulated malicious emails from a genuine stantial financial losses as a result. One of the primary reasons
source to evaluate their exposure to phishing. At the end of the for legal actions not to be as effective as expected in minimising
training, participants are given the training material and informed phishing is due to the fact that often a phishing website has a short
about their vulnerability to phishing. life span (normally about two days), which helps the phisher to
Embedded training is another way to measure a users’ vul- disappear quickly once the fraud has been committed, making law
nerability to phishing. This training often mimics primary daily enforcement difficult.
I. Qabajeh et al. / Computer Science Review 29 (2018) 44–55 49

As previously mentioned, raising awareness of phishing risks • Domain/URL Based. These are real time URL lists that con-
and educating users has shown promising initial results [45]. Com- tain malicious domain names and normally look for spam
puter security scholars have adopted different ways to disseminate URLs within the body of emails.
the seriousness phishing may cause to society, with [8,27] using • Internet Protocol Based. These are real time URL or domain
web-based material to teach novice users phishing fraud tech- server blacklists that contain IP addresses who, in real-time,
niques; while others, such as Arachchilage et al. [44], developing change their status. Often, mailbox providers, such as Ya-
contextual and embedded trainings based on simulated phishing hoo for example, check domain server blacklists to evaluate
emails coming from genuine sources; or educational material on whether the sending server (source) is run by someone who
phishing based on mobile games in order to increase the motiva- allows other users to send from their own source.
tion factor among [42].
Even though educating users may positively affect the global ef- Users, businesses, or computer software enterprises can create
forts of combating phishing, this approach demands high costs and blacklists. Whenever a website is about to be browsed, the browser
requires users to be equipped with computer security knowledge. checks the URL in the blacklist. If the URL exists in the blacklist,
Large organisations and governments are periodically investing in a certain action is taken to warn the user of the possibility of a
the development of anti-phishing materials in both hard and soft security breach. Otherwise, no action will be taken as the web-
forms as well as websites and mobile applications. However, since site’s URL is not recognised as harmful. Currently, there are a few
phishing techniques keep changing/evolving, small to medium hundred blacklists which are publically available, among which
enterprises might not have the resources large organisations have we can mention the ATLAS blacklist from Arbor Networks, BLADE
to enable them to invest in their users’ education. Therefore, a Malicious URL Analysis, DGA list, CYMRU Bogon list, Scumware.org
large portion of the online community realistically cannot afford list, OpenPhish list, Google blacklist, and Microsoft blacklist [52].
the continuous additional costs to keep updating current anti- Since any user or small to large organisation can create blacklists,
phishing material. Furthermore, phishing techniques are becoming the currently public available blacklists have different levels of
more sophisticated because of the group efforts of phishers who security effectiveness, particularly with respect to two factors:
employ systematic attack strategies, which make it harder for even
security experts and specialised law enforcement agents to keep 1. Times the blacklist gets updated and its consistent availabil-
their skills updated. This makes ordinary users vulnerable, even if ity.
they were equipped with basic knowledge about phishing. Thus, 2. Results quality with respect to accurate phishing detection
more advanced, cheaper and intelligent approaches are needed for rate.
their implementation both within educational and legislative solu-
tions to further reduce phishing attacks. We have seen thoughtful Marketers, users, and businesses tend to use Google and Mi-
attempts that evolved from user experiences, user ratings, and crosoft blacklists when compared with other publically available
users’ social networking (such as Phishtank, Cloudmark, and APWG blacklists commonly use because of their lower false positive rates.
among others helping novice users and enterprises avoid falling A study by [2] analysing blacklists concluded that they contain on
prey to phishing). Effectiveness of these user community based average 47% to 83% phishing websites.
approaches relies mainly on the following factors: (1) User experi- Blacklists often are stored on servers, but can also be available
ence; (2) User knowledge; (3) User honesty; and (4) Accessibility locally in a computer machine as well [25]. Thus, the process
and validity of the user community’s website data. Unfortunately, of checking whether a URL is part of the blacklist is executed
these factors are difficult to measure and validate, thus relying whenever a website is about to be visited by the user, in which
on user experience and knowledge alone necessitates careful care case the server or local machine uses a particular search method to
and accuracy. We hypothesise that by ‘‘only’’ considering users’ verify the process and derive an action. The blacklist usually gets
experience in judging a websites’ legitimacy is not enough to updated periodically. For example, Microsoft blacklist is normally
combat phishing, although it can be a supporting approach to a updated every nine hours to six days, whereas Google blacklist
more advanced intelligent solution based on ML/DM. gets updated every twenty hours to twelve days [11,12]. Hence,
the time window needed to amend the blacklist by including new
4. Computerised anti-phishing techniques malicious URLs, or excluding a possible false positive URLs, may
allow phishers to launch and succeed in their phishing attacks. In
There has been development of anti-spam software tools that
other words, phishers have significant time to initiate a phishing
can block suspicious emails, however, these programmes con-
attack before their websites get blocked This is an obvious limita-
stantly block a large number of genuine emails and classify them
tion of using the blacklist approach in tracking false websites [18].
as junk emails [11,12]. Emails misclassified as spam are simply
Another study by APWG revealed that over 75% of phishing do-
false positive instances. Thus, one of the ultimate goals of the
mains have been genuinely serving legitimate websites and when
computerised anti-phishing tool is to reduce false positives and
blocked imply that several trustworthy websites will be added
increase true positives so users can be confident of their mailbox’s
filter results without having to manually check their junk email to the blacklist, which causes a drastic reduction in the website’s
folder. revenue and hinder its reputation [9].
After the creation of blacklists, many automated anti-phishing
4.1. Databases (blacklist and whitelist) tools normally used by software companies such as McAfee,
Google, Microsoft, were proposed. For instance, The Anti-Phishing
A database driven approach to fight phishing, called black- Explorer 9, McAfee Site Advisor, and Google Safe Base are three
list, was developed by several research projects [2,50,51]). This common anti-phishing tools based on the blacklist approach.
approach is based on using a predefined list containing domain Moreover, companies such as VeriSign developed anti-phishing
names or URLs for websites that have been recognised as harm- internet crawlers that gather massive numbers of websites to iden-
ful. A blacklisted website may lose up to 95% of its usual traffic, tify clones in order to assist in differentiating between legitimate
which will hinder the website’s revenue capacity and eventually and phishing websites.
profit [23]. This is the primary reason that web masters and web There have been some attempts to look into creating whitelists,
administrators give great attention to the problem of blacklisting. i.e. legitimate URL databases, in contrast to blacklists [53]. Unfor-
According to Mohammad et al. [11,12], there are two types of tunately, since the majority of newly created websites are initially
blacklists in computer security: identified as ‘‘suspicious’’, this creates a burden on the whitelist
50 I. Qabajeh et al. / Computer Science Review 29 (2018) 44–55

approach. To overcome this issue, the websites expected to be (1) Decision trees (ID3, C4.5 and successors) [56].
visited by the user should exist in the whitelist. This is sometimes (2) Probabilistic models (Naïve Bayes, Bayesian Network and
problematic in practise because of the large number of possible successors) [57].
websites that a user might browse. The whitelist approach is sim- (3) Rule-based classification
ply impractical since ‘‘knowing’’ in advance what users might be
browsing for might be different to those actually visited during the a. Associative classification (AC)
browsing process. Human decision is a dynamic process and often i. Classification based Association (CBA and succes-
users change their mind and start browsing new websites that they
sors) [58].
initially never intended to.
ii. Classification based on Multiple Association
One of the early developed whitelist was proposed by Chen and
(CMAR and successors) [59].
Guo [53], which was based on users’ browsing trusted websites.
iii. Multiclass Classification-based Association
The whitelist monitors the user’s login attempts and if a repeated
login was successfully executed this method prompts the user to (MCAR and successors) [60].
insert that website into the whitelist. One clear limitation of Chen b. Rule induction such as FOIL, RIPPER and successors
and Guo’s method is that it assumes that users are dealing with [61].
trustful websites, which unfortunately is not always case. c. Covering or greedy, such as PRISM [62] and eDRI [29,
Phishzoo is another whitelist technique developed by Afroz and 30].
Greenstadt [5]. This technique constructs a website profile using
a fuzzy hashing approach in which the website is represented by (4) Neural Networks (NN) methods and their successors [63].
several criteria that differentiate one website from another includ- (5) Support Vector Machine (SVM) [64,65]
ing images, HTML source code, URL, and SSL certificate. Phishzoo (6) Fuzzy Logic (FL) [66]
works as follows: (7) Boosting and paging methods, and their successors [67].
(8) Search methods such as Genetic Algorithms (GA) [68].
1. When the user browses a new website, PhishZoo makes a
specific profile for that website. The rest of this section critically analyses intelligent
2. The new website’s profile is contrasted with existing profiles anti-phishing attempts based on ML. We show how these ap-
in the PhishZoo whitelist. proaches derive a classification anti-phishing system along with
• If a full match is found, the newly browsed website is their benefits and weaknesses.
marked trustful.
• If partly matching, then the website will not be added 4.2.1. Decision trees and rule induction
since it is suspicious Fette et al. [69] explored email phishing utilising the C4.5 deci-
• If no match is found but the SSL certificate is matched, sion tree classifier among other methods including Random Forest,
PhishZoo will instantly amend the existing profile in SVM and Naïve Bayes. As a result, a new Random Forest method
the whitelist. called ‘‘Phishing Identification by Learning on Features of Email
• If no match is found, then a new profile will be created Received’’ (PILFER) was developed. Experiments on a set of 860
for the website in the whitelist. phishy and 695 ham emails were conducted. Various features for
Recently, Lee et al. [31] investigated the personal security im- distinguishing phishing emails identified included: IP URLs, time
ages whitelist approach and its impact on internet banking users’ of space, HTML messages, number of connections inside the email,
security. The authors utilised 482 users to conduct a pilot study and JavaScript. The authors claim that PILFER can be improved
on a simulated bank website. The results revealed that over 70% towards grouping messages by joining all ten features discovered
of the users during the simulated experiments had given their in the classifier apart from ‘‘Spam filter output’’.
login credentials despite their personal security image test not Mohammad et al. [25] investigated a number of rule induction
being performed. Results also revealed that novice users do not pay algorithms on the problem of website phishing classification. The
high levels of attention to the use of personal images in ebanking, authors compared RIPPER, C4.5 (Rules), CBA, and PRISM on a secu-
which can be seen as a possible shortcoming for this anti-phishing rity dataset they collected containing 2500 instances and 16 fea-
approach. tures. A special hand crafted rule to collect the data was developed
by the authors based on simple statistical analysis performed on
4.2. Intelligent anti-phishing techniques based on ML the initial dataset’s features. Experiments of the four rule-based
classification methods showed that there are eight effective fea-
Since phishing is a typical classification problem, ML and DM tures that can be employed by the classification algorithm in com-
techniques seem appropriate for deriving knowledge from website bating phishing: SSL and HTTPS, Domain-age, Site-traffic, Long-
features that can assist in minimising the problem. The key to URL, Request-URL Sub-domain, Multi-sub-domain, Suffix–prefix,
success in developing automated anti-phishing classification sys-
and IP-address.
tems is a website’s feature. Since there are a tremendous number
Khadi and Shinde [5] studied the problem of email-based phish-
of features linked with a website, a necessary step to enhance
ing and proposed a potential solution based on combining a RIPPER
the predictive system performance is to pre-process the set of
classifier with fuzzy logic. The role of fuzzy logic is to pick the
features in order to pick up the ‘‘most’’ effective. Feature effective-
ness can be measured using different computational intelligence main features of the email and rank them based on a probability
methods such as information gain, correlation analysis, and chi- score. Meanwhile, the role of RIPPER is to automatically use these
square among others [54,55]). features to classify the type of emails as ham or phishy. Two com-
Once an initial features set is chosen, the intelligent algorithm ponents of the email were utilised by Khadi and Shinde: the email
can be applied on the selected features to come up with the message (spelling errors, embedded link) and URL (IP address,
predictive system. There are many ML and DM algorithms for Length, Long URL, Suffix_Prefix, Crawler URL, Non matching URL).
classification that have been developed by scholars in the last two Moreover, very limited data consisting of just 100 instances from
to three decades as covered in Chapter 2. Most of these algorithms phishtank was in experiments involving the WEKA software tool.
use one of the following major classification approaches in deriving No comparison with other fuzzy logic or rule-based classifications
their predictive systems: was conducted by the authors. Results showed that there are
I. Qabajeh et al. / Computer Science Review 29 (2018) 44–55 51

twelve rules generated by RIPPER from the dataset with an 85.4% updated several parameters, like the learning rate, in a dynamic
prediction rate. way before adding a new neuron to the hidden layer. The process
Aburrous et al. [26] investigated rule induction methods to seek of updating these NN features is performed during the building of
their applicability for categorising websites based on phishing fea- the classification model and based on the network environment,
tures. Website features were initially manually classified into six behaviour of the desired error rate, and the computed error rate at
criteria as described in an earlier report on phishing by Aburrous that point. The dynamic NN model was applied to detect phishing
et al. [22]. Using WEKA, a number of experiments with four clas- on a large dataset from UCI containing over 11 000 websites [12].
sification algorithms (RIPPER, PART, PRISM, C4.5) were conducted Experiments using different epoch sizes (100, 200, 500, 1000) have
against 1006 instances downloaded from Phishtank. The focus of been conducted, and the results obtained exhibited better predic-
the experiments was the classification accuracy of the classifiers tive systems when compared to Bayesian Network and Decision
produced. Results revealed that rule induction is a promising ap- Trees.
proach because it was able to detect, on average, 83% of phishing The ANN Back Propagation algorithm [70] was investigated on a
websites. The authors suggested that results obtained could be security dataset concerning website phishing by Mohammad et al.
further enhanced if a careful feature selection were employed. [71]. The authors collected a dataset with over 2000 instances from
different legitimate and phishing sources. Processing the dataset,
4.2.2. Associative classification (AC) they tried to measure the correlation between the features and
The two AC methods CBA and MCAR have been evaluated on target attributes using basic univariate statistical analysis (fre-
a Phishtank dataset to seek their applicability in cracking phish- quency of features values and the target attribute values). Finally,
ing ([58,60], Abourrous et al., 2010b). Abourrous et al. (2010b) used they applied the Back Propagation ANN algorithm to derive anti-
a dataset consisting of over 1000 instances with 27 different fea- phishing models. The results of the study indicated that ANN is a
tures and applied CBA, MCAR, and four other rule-based classifiers promising approach for combating phishing, particularly since the
using the WEKA DM tool. The aim was to assist security managers results showed increased accuracy of the models generated from
within organisations by building an intelligent anti-phishing tool the Back Propagation algorithm when compared with decision
within browsers that can detect phishing as accurately as possible. trees and probabilistic.
Experimental results of the six ML algorithms revealed that AC Mohammad et al. [32] have developed an anti-phishing NN
methods generated more rules than the rest of the algorithms, model that relies on constantly improving the learned predictive
yet had higher predictive classifiers. More specifically, the AC sys- model based on previous training experiences, Since phishers con-
tems produced showed high correlations among features linked tinuously update their deception methods, new features become
with three major criteria: URL, Domain Identity, and Encryption. apparent while others become insignificant. In order to cope with
Nevertheless, the massive number of rules derived by MCAR and these changes, the authors proposed a self-structuring NN classi-
CBA may overwhelm end-users since they might not be able to fication algorithm that deals with the vitality of phishing features.
control the anti-phishing system. Furthermore, the authors did The algorithm employs validation data to track the performance of
not implement the AC rules within a browser to evaluate its real the constructed network model and make the appropriate decision
performance, which does not facilitate measuring the success or based on results obtained against the validation dataset. For in-
failure of their classification systems. stance, when the achieved error against the network is lower than
Recently, more domain specific AC anti-phishing systems have the minimum achieved error, the algorithm saves the network’s
been created [4,18]. These new models take into account not only weights and continues the training process. However, when the
two class values of the phishing problem (legitimate, phishy) but achieved error is larger than the minimum achieved error so far,
also considers a harder case to detect: the ‘‘suspicious’’ class label. the algorithm continues the training process without saving the
Instances that cannot be fully classes as phishy nor as legitimate are weights. Other important network parameters are also updated
very hard to detect by typical ML algorithms, thus increasing their when necessary during the building of the classification model
false positive rates. Abdelhamid et al. [18] and [4] have therefore without waiting until the model has been entirely built. Results
enhanced current intelligent classification systems by including obtained against a phishing dataset of thirty features and over
two distinct advantages: (1) extending the phishing problem to 10 000 instances showed that the self-structuring NN model is
include suspicious cases, making it more realistic; and (2) propos- able to generate anti-phishing models more accurately than tra-
ing a new multi-label learning phase that can discover disjunctive ditional classification approaches such as C4.5 and probabilistic
in addition to conjunctive rules. These additional disjunctive rules approaches.
are tossed out by existing AC methods. This new multi-label phase Recently, a new machine learning technique based on Long
enhances predictive power and provides more useful knowledge to Short Term Memory (LSTM) ANN proposed to deal with spear
the end-user. The authors used a dataset that has 16 features and phishing posts on social media [72]. The LSTM model was trained
over 1500 instances, comparing the performance of their classifiers on different posts on social media that were represented as word
with other rule-based classifiers with respect to the knowledge vectors. The author enhanced the classification model by using
derived and its accuracy. The authors employed the chi-square clustering techniques. Experimental results revealed that the LSTM
testing method to measure the features goodness and discrim- ANN classification model is more accurate that manual classifica-
inate among features with respect to their impact on phishing. tion and other models obtained from former email attack cam-
Processed data results showed high competitive performance of paigns.
the new multi-label associative classifiers when compared with Feed Forward NN (FFNN) was applied on an email phishing
CBA, MCAR, rule induction, and decision trees. classification problem by Jameel and George [33]. Basic imple-
mentation of a multilayer FFNN based on Back Propagation was
4.2.3. Neural network (NN) used to differentiate suspicious from legitimate emails. Eighteen
One of the common ways to train a NN is trial and error [32]. binary features were extracted from the email (header and HTML
However, this methodology has been criticised because of the time body) and made available as the training dataset attributes. These
spent to tune the parameters and the requirement of an available features were given values based on human rules developed by
domain expert. Thabtah et al. [30] proposed a NN anti-phishing security domain experts. To derive the NN models, 6000 emails
model based on self-structuring the classification system rather were used. The results obtained showed that FFNN is able to
than using trial and error. The algorithm proposed by the authors categorise emails with high speed and with an error rate below
52 I. Qabajeh et al. / Computer Science Review 29 (2018) 44–55

2%. However, the authors have not yet embedded their FFNN into Table 1
browsers for live testing. Phishing features per category [22].

In 2007, an experimental study contrasting five ML algorithms Criteria N Phishing indicators


on the problem of classifying emails as ham or suspicious was 1 IP address
conducted by Abu-Nimeh et al. [73]. The authors chose Classi- 2 Abnormal request URL
URL 3 Abnormal URL of anchor
fication and Regression Trees (CART), NN, Random Forests (RF),
4 Abnormal DNS record
Bayesian Additive Regression Trees (BART), and Logistic Regression 5 Abnormal URL
(LR) to measure the most successful approaches in email phish-
1 Using SSL certificate (Padlock Icon)
ing detection. A training dataset consisting of 2889 emails and 2 Certificate authority
43 email’s features was used. To produce the results, the testing Encryption
3 Abnormal cookie
method employed was ten-fold cross validation and the evaluation 4 Distinguished names certificate
measures used were precision, recall, and harmonic mean. Results 1 Redirect pages
revealed that RF achieved a lower error rate while NN generated 2 Straddling attack
the highest error rate among the tested classifiers. Moreover, de- Source code 3 Pharming attack
4 OnMouseOver to hide the Link
spite RF generating the highest predictive classifiers, it derived 5 Server Form Handler (SFH)
the least false positive rate among all contrasted algorithms. The
1 Spelling errors
authors suggested though that more carefully chosen features may 2 Copying website
improve the performance of the anti-phishing email tool. Page style & contents 3 Using forms with Submit button
4 Using pop-ups windows
4.2.4. Support vector machine (SVM) 5 Disabling right-click
Proposed by Pan and Ding [74], the SVM classification method 1 Long URL address
evaluates the discrepancy between a website’s identity, its HTTP 2 Replacing similar char for URL
Web address 3 Adding a prefix or suffix
transactions, and structural features. The anti-phishing solution
4 Using the @ Symbol to confuse
proposed contains two layers: 5 Using hexadecimal char codes
1 Emphasis on security
◦ Website Identity: The set of characters appearing inside the
Human 2 Public generic salutation
domain name. 3 Buying time to access accounts
◦ Structural Features Classifier: Features that are related to
the website identity and HTTP transactions.

Once a new website identity and its structural features are cap- dataset was assigned three possible values by the authors: Phishy,
tured (Abnormal URL, Abnormal anchors, Server Form Handler, Genuine, and Doubtful. Limited results indicated that there are two
Abnormal certificate in SSL, Abnormal DNS, Abnormal cookies), effective indicators to distinguish phishiness in websites: Domain
then a SVM algorithm is trained on a historical dataset consisting Identity and URL.
of the same features in order to derive the new website type. Almomani et al. [17] proposed a promising solution to deal with
Experimental results on six features using the proposed SVM a vital types of email phishing attacks called zero-day. This type of
indicated that the first helps towards increasing the detection email phishing attacks involves the utilisation of hosts by attackers
rate since malicious websites are not correlated. Furthermore, the that do not appear inside the blacklists of phishing emails. The
SVM model achieved just over 83% prediction rate, and therefore authors developed a detection system that they name phishing
more investigation is needed into the feature selection phase by dynamic evolving neural fuzzy framework (PDENF). This system
including other features that could improve the performance of the was able to successfully redflag phishing emails using classification
classifier. rules learnt by semi-supervised learning techniques. In particular,
the authors have used clustering to easy the process of classifica-
4.2.5. Fuzzy logic tion using neural fuzzy technique.
Phishing in electronic banking (Ebanking) applications has been A fuzzy based ANN model was proposed in 2015 by Nguyen
investigated by Aburrous et al. [22] utilising Fuzzy Logic. A simu- et al. [6] to classify websites based on a smaller set of phishing
lated phishing email was sent by the authors with the help of the features related to the website’s URL (PrimaryDomain, SubDomain,
security manager at Jordan Ahli Bank to measure security indica- PathDomain) and its rank (PageRank, AlexaRank, AlexaReputa-
tors of phishing among a sample of 120 employees after obtaining tion). The proposed fuzzy ANN model does not use any rules set,
the necessary authorisation (www.ahlionline.com.jo). The email rather it employs a computational function to split data instances
urged the chosen employees to reactivate their accounts by logging (websites) into ‘‘genuine’’ and ‘‘non-genuine’’ categories. Their
in because server maintenance conducted the previous two days model was tested against 21 600 websites from legitimate and
required account reactivations. Shocking results were obtained: phishing sources such as Phishtank and DMOZ. They also compared
37% of the targeted employees submitted their credentials without the generated results with that of Aburrous et al. [26] and Zhang
investigation, of which 7% were Information Technology employ- and Yuan (2008). It was discovered that their fuzzy NN model was
ees. The authors’ goal with the simulated email was to determine able to slightly enhance the phishing detection rate.
features that users may look for inside the email when they suspect
phishing to be used within a FL system to help in differentiating 4.2.6. CANTINA term frequency inverse document frequency approach
types of email. Carnegie Mellon Anti-phishing and Network Analysis Tool
FL has been used as an anti-phishing model to help classify (CANTINA) is a content based anti-phishing method that deter-
websites into legitimate or phishy in [22]. The authors claimed mines suspicious websites using the statistical measure of Term
that FL could be effective in identifying phishing activities because Frequency Inverse Document Frequency (TF–IDF). Term Frequency
it provides a simple way of dealing with intervals rather than (TF) is a statistical formula that measures keyword significance in
specific numeric values. Their proposed FL classification model was a document while Inverse Document Frequency (IDF) measures
built manually to categorise websites using the six criteria listed the importance of that keyword across a large collection of docu-
in Table 1. Each of those criteria contains a number of phishing ments [28]. CANTINA evaluates the website content (links, anchor
indicators as described in the same table. Each feature in the tags, forms tags, images, text, etc.) for TF–IDF to produce a lexical
I. Qabajeh et al. / Computer Science Review 29 (2018) 44–55 53

Table 2
Common anti-phishing methods based on ML.
Method name ML technique First Author Reference
Dynamic rule induction Rule induction learning Qabajeh Issa Qabajeh, et al. 2014
Enhanced dynamic rule induction Rule induction and covering Thabtah Fadi Thabtah et al. [29,30]
approaches
Classification based association AC Aburrous Maher Aburrous et al. [23,26]
Multi-label classifier based associative AC Abdelhamid Neda Abdelhamid et al. [18]
classification
Self-structuring neural network NN Mohammad Rami Mohammad et al. [25,32]
Neural network trained with NN Mohammad Rami Mohammad et al. [71]
back-propagation
Feed forward neural network NN Jameel Noor Ghazi Jameel and George [33]
Fuzzy DM Fuzzy logic Aburrous Maher Aburrous et al. [22]
Fuzzy DM Fuzzy logic Khadi Anindita Khadi and Shinde [4]
PILFER Decision tree Fette Ian Fette et al. [69]
Page classifier SVM Pan Ying Pan and Ding [74]
PDENF Fuzzy and clustering Almomani, Ammar Almomani et al. [14,17]
CANTINA Term frequency and inverse document Sanglerdsinlapachai Nuttapong Sanglerdsinlapachai and Rungsawang [75]
frequency
Biased SVM, LIBSVM, ANN, self-organising NN, SVM and other ML techniques Basnet Ram Basnet et al. [15]
map

signature of the website. This signature (top ranked TF–IDF key others. Unlike existing phishing reviews that were based around
words) will be passed into the search engine to seek their rank in only intelligent techniques such as machine learning and data
domain names and decide the type of the website. The description mining this paper focuses on raising awareness and educating
of the CANTINA based classification process is as follows: users on phishing from training and legal prospective. This indeed
will equip individuals with knowledge and skills that may pre-
1. Parse the webpage. vent phishing on a wider context within the community. In this
2. Compute the TF–IDF for the common terms of the website. paper, we review conventional anti-phishing approaches such as
3. Select the top five terms according to the computed scores law enforcement, user training, and education and then critically
of all TF–IDF terms. analyses their different methods. Then the attention is directed
4. Add the top five terms to the URL to locate the lexical to review predictive ML method particularly rule-based methods,
signature. decision trees, associative classification, SVM, NN, and computa-
5. Input the lexical signature into a search engine. tional intelligence. We contrast the ways these methods detect
6. Check whether the domain name of the current website phishing activities, their performance and their advantages and
matches the domain names of the top N search results (often disadvantages.
N = 30). While many countries such as the USA have taken a lead to
7. Return ‘‘Legitimate’’ when there is a match or ‘‘Phishy’’ when criminalise phishing activities and put together more severe leg-
there is no match. islations, it is still hard to find attackers basically since phishing
attacks have a short life span. Despite this limitation, it is still
When the search results in an empty set, the current website is crucial that law enforcement agencies improve their information
classified as ‘‘phishy’’. To overcome the ‘‘no results’’ problem the sharing work as well as jurisdiction. Moreover, educating novice
authors merged TF–IDF with other content features such as ‘‘IP users using visual cues can partly improve their abilities to detect
Address’’, ‘‘domain age’’. ‘‘suspicious Images’’, ‘‘suspicious Link’’, phishing; however, many novice users still not paying high atten-
and ‘‘suspicious URL’’. tion to visual cues when browsing the internet which make them
Sanglerdsinlapachai and Rungsawang [75] have used CANTINA vulnerable to phishing attacks. Users need to be exposed to repet-
TF–IDF, and added a few more features such as ‘‘Forms’’ and ‘‘Top itive training about phishing attacks since phishers continuously
pages’’ similarity linked with the domain’’, and removed features change the deception tactics.
such as ‘‘domain age’’ and ‘‘known images’’. A dataset consisting of Online phishing communities gather data that allow users to
200 websites was used in the experiments, and three DM methods share information about phishing attacks such blacklisted URLs,
were applied to the dataset. Results obtained, despite being lim- which is useful information centre for users. However, this ap-
ited, revealed that the reduced features set maintained a similar proach necessitates good awareness about web security indicators
detection rate with that of the CANTINA features set. Moreover, besides blacklisted URLs become outdated as updates are not per-
adding the new features slightly enhanced the detection rate for formed in real-time.
most of the learning methods considered in the experiments. Finally, anti-phishing methods based around ML especially AC
Table 2 shows a brief summary of the common anti-phishing and rule induction are suitable to combat phishing due to their high
approaches that are based on automated learning along with the detection rate and more importantly the easy to understand out-
comes they offer (If-Then rules). These rules empower novice users
name of the method, the learning approached used, the first author,
as well as security experts to understand and manage security
and their reference details.
indicators. However, adding a visualisation layer into ML learning
methods is advantageous to novice users as they may react quickly
5. Conclusions
to visual cues.
In near future we are intend to design and implement a knowl-
Website phishing classification is a fundamental problem due
edge base using rule induction that can on real time warns online
to the very large online transactions performed by businesses,
users of any possibility of phishing attacks.
individuals and governments. While many users are vulnerable to
the phishing attacks, playing catch-up to the phishers’ evolving References
strategies is not an option. There have been different approaches
to combat phishing ranging from legal, educational, simulation, [1] N. Abdelhamid, F. Thabtah, Associative classification approaches: Review and
online community forums, black lists and machine learning among comparison, J. Inf. Knowl. Manage. (JIKM) 13 (3) (2014) 1450027.
54 I. Qabajeh et al. / Computer Science Review 29 (2018) 44–55

[2] S. Sheng, M. Holbrook, N.A.G. Arachchilage, L. Cranor, J. Downs, Who falls for [35] General Assembly of Virginia, 2005. CHAPTER 827. http://leg1.state.va.us/cgi-
phish?: a demographic analysis of phishing susceptibility and effectiveness of bin/legp504.exe?051+ful+CHAP0827 [Accessed 01.04.16].
interventions, in: CHI ’10 Proceedings of the 28th International Conference on [36] G.H. Pike, Lost data: The legal challenges, Inf. Today 23 (10) (2006) 1–3.
Human Factors in Computing Systems, ACM, New York, NY, USA, 2010. [37] Executive-Order-13402, 2006. Executive Order 13402. http://www.gpo.gov/
[3] N. Abdehamid, Multi-label rules for phishing classification, Appl. Comput. Inf. fdsys/pkg/FR-2006-05-15/pdf/06-4552.pdf [Accessed 19.05.16].
11 (1) (2015) 29–46. [38] BBC News, 2005. http://news.bbc.co.uk/2/hi/uk_news/england/lancashire/43
[4] A. Khadi, S. Shinde, Detection of phishing websites using data mining tech- 96914.stm [Accessed 11.04.16].
niques, Int. J. Eng. Res. Technol. 2 (12) (2014). [39] Government of Australia, Hackers, Fraudsters and Botnets: Tackling the Prob-
[5] Afroz, R. Greenstadt, PhishZoo: Detecting phishing websites by looking at lem of Cyber Crime. Report on Inquiry into Cyber Crime, 2011.
them, in: Fifth International Conference on Semantic Computing (September [40] ClickDimensions, 2014. www.clickdimensions.com/sites/default/files/PDF/W
18 –September 21), IEEE, Palo Alto, California USA, 2011. hitePaper-CASL.pdf [Accessed 12.05.16].
[6] L.A.T. Nguyen, B.L. To, H.K. Nguyen, An efficient approach for phishing detec- [41] S.D. Julie, H. Mandy, L.F. Cranor, Behavioral response to phishing risk, in: The
tion using neuro-fuzzy model, J. Autom. Control Eng. 3 (6) (2015). Anti-Phishing Working Groups, 2nd Annual ECrime Researchers Summite,
[7] McCall, Gartner, Inc. 2011. http://www.gartner.com/newsroom/id/565125 Crime ’07, ACM, New York, NY, USA, 2007.
[Accessed 05.06.16]. [42] N.A.G. Arachchilage, S. Love, A game design framework for avoiding phishing
[8] D. Jevans, Anti-Phishing Working Group (APWG): http://www.antiphishing. attacks, Comput. Hum. Behav. 29 (3) (2013) 706–714.
org/ [Accessed 20.06.16], 2003. [43] N.A.G. Arachchilage, M. Cole, Design a mobile game for home computer users
[9] G. Aaron, R. Manning, APWG Phishing Reports, 2014. http://docs.apwg.org/ to prevent from phishing attacks, in: 2011 International Conference on Infor-
reports/apwg_trends_report_q4_2014.pdf [Accessed 20.03.16]. mation Society (i-Society), 2011, pp. 485–489.
[10] V. Suganya, A review on phishing attacks and various anti phishing techniques, [44] N.A.G. Arachchilage, Y. Rhee, S. Sheng, S.H. Hasan, A. Acquisti, L.F. Cranor, J.
Int. J. Comput. Appl. (0975–8887) 139 (1) (2016) 20–23. Hong, Getting users to pay attention to anti-phishing education: evaluation
[11] R. Mohammad, F. Thabtah, L. McCluskey, Tutorial and critical analysis of of retention and transfer, in: ECrime ’07 Proceedings of the Anti-Phishing
phishing websites methods, Comput. Sci. Rev. J. 17 (2015) 1–24. Elsevier. Working Groups 2nd Annual ECrime Researchers Summit, ACM, Pittsburgh,
[12] R. Mohammad, F. Thabtah, L. McCluskey, Phishing websites dataset. 2015, PA, USA, 2007.
Available: https://archive.ics.uci.edu/ml/datasets/Phishing+Websites Accessed [45] D.J.C. Ronald, C. Curtis, F.J. Aaron, Phishing for user security awareness,
January 2016. Comput. Secur. 26 (1) (2007) 73–80.
[13] K.R. Sahu, J. Dubey, A Survey on phishign attacks, Int. J. Comput. Appl. (0975– [46] M. Bright, MillerSmiles. 2011, [Online] Available at: http://www.millersmiles.
8887) 88 (10) (2014) 42–45. co.uk/ [Accessed 09.01.16].
[14] A. Almomani, B.B. Gupta, S. Atawneh, A. Meulenberg, E. Almomani, A survey of [47] B. Nahorney, The MessageLabs Intelligence Annual Security Report: 2009
phishing email filtering techniques, IEEE Commun. Surv. Tutor. 15 (4) (2013) Security Year in Review. 2015. http://www.symantec.com/content/en/us/
2070–2090. enterprise/other_resources/intelligence-report-06-2015.en-us.pdf [Accessed
[15] R. Basnet, S. Mukkamala, A.H. Sung, Detection of phishing attacks: A machine 09.06.16].
learning approach, Soft Comput. Appl. Ind. (2008) 373–383. [48] Cloudmark Org. Cloudmark. 2002. http://www.cloudmark.com/en/home [Ac-
[16] B.B. Gupta, N.A.G. Arachchilage, K.E. Psannis, Defending against phishing at- cessed 10.02.16].
tacks: taxonomy of methods, current issues and future directions, Telecom- [49] WOT, Web of Trust. 2006. http://www.mywot.com/ [Accessed 24.03.16].
mun. Syst. 2017 (2017) 1–21. [50] Google Safe-Browsing, 2010. Google Safe Browsing. http://code.google.com/p/
[17] A. Almomani, B.B. Gupta, TC. Wan, A. Altaher, S. Manickam, Phishing dynamic google-safe-browsing/ [Accessed 10.04.16].
evolving neural fuzzy framework for online detection zero-day phishing email. [51] McAfee SiteAdvisor, 2006. McAfee SiteAdvisor. http://www.siteadvisor.com/
2013. arXiv preprint arXiv:1302.0629(2013). [Accessed 19 February 2016].
[18] N. Abdelhamid, F. Thabtah, A. Ayesh, Phishing detection based associative [52] Retun Path, 2016. https://blog.returnpath.com/blacklist-basics-the-top-email
classification data mining, Expert Syst. Appl. J. 41 (2014) 5948–5959. -blacklists-you-need-to-know-v2/ [Accessed 22.03.16].
[19] B.B. Gupta, A. Tewari, AK. Jain, D.P. Agrawal, Fighting against phishing attacks: [53] J. Chen, C. Guo, Online detection and prevention of phishing attacks (Invited
state of the art and future challenges, 2017. Paper), in: First International Conference on Communications and Networking
[20] M. Rader, S. Rahman, Exploring historical and emerging phishing techniques in China, ChinaCom ’06, IEEE, Beijing, 2006.
and mitigating the associated security risks, Int. J. Netw. Secur. Appl. (IJNSA) [54] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, I. Witten, The WEKA
5 (4) (2015). http://dx.doi.org/10.5121/ijnsa.2013.540223. July 2013. data mining software: An update, SIGKDD Explor. 11 (1) (2009).
[21] McFredies, P (n.d.) Phishing. 2016. http://www.wordspy.com/words/phishing. [55] H. Liu, R. Setiono, Chi2: Feature selection and discretization of numeric at-
asp [Accessed 15.05.16]. tribute, in: Proceedings of the Seventh IEEE International Conference on Tools
[22] M. Aburrous, A. Hossain, K. Dahal, F. Thabtah, Intelligent quality performance with Artificial Intelligence, November 5–8, 1995, p. 388.
assessment for E-Banking security using fuzzy logic, in: Proceedings of the [56] J. Quinlan, C4.5: Programs for Machine Learning, Morgan Kaufmann, San
7th IEEE International Conference on Information Technology (ITNG 2008). Las Mateo, CA, 1993.
Vegas, USA, 2008. [57] R.O. Duda, P.E. Hart, Pattern Classification and Scene Analysis, John Wiley &
[23] M. Aburrous, M. Hossain, K.P. Dahal, F. Thabtah, Associative Classification Sons, New York, 1973.
techniques for predicting e-banking phishing websites, in: Proceedings of the [58] B. Liu, W. Hsu, Y. Ma, Integrating classification and association rule mining, in:
2010 International Conference on Information Technology, Las Vegas, Nevada, Proceedings of the Knowledge Discovery and Data Mining Conference- KDD,
USA, 2010, pp. 176–181. 1998, pp. 80–86. New York.
[24] I. Qabajeh, F. Thabtah, F. Chiclana, Dynamic classification rules data mining [59] W. Li, J. Han, J. Pei, 2001 CMAR: Accurate and efficient classification based
method, J. Manage. Anal. 2 (3) (2015) 233–253. Wiley. on multiple-class association rule, in: Proceedings of the IEEE International
[25] R. Mohammad, F. Thabtah, L. McCluskey, Intelligent rule based phishing web- Conference on Data Mining-ICDM, pp. 369–376.
sites classification, J. Inf. Secur. (ISSN: 17518709) (2) (2014) 1–17. IET. [60] F. Thabtah, P. Cowling, Y. Peng, MCAR: Multi-class classification based on asso-
[26] M. Aburrous, M. Hossain, K.P. Dahal, F. Thabtah, Experimental case studies ciation rule approach, in: Proceedings of the 3rd IEEE International Conference
for investigating e-banking phishing techniques and attack strategies, J. Cogn. on Computer Systems and Applications, 2005, pp. 1–7.
Comput. 2 (3) (2010) 242–253. Springer Verlag. [61] W.W. Cohen, Fast effective rule induction, in: Proceedings of the Twelfth
[27] PhishTank, 2011. PhishTank. http://www.phishtank.com/ [Accessed 16.01.16]. International Conference on Machine Learning, Morgan Kaufmann, Tahoe City,
[28] I.H. Witten, E. Frank, Data Mining: Practical Machine Learning Tools and California, 1995.
Techniques, 2005. [62] J. Cendrowska, PRISM: An algorithm for inducing modular rules, Int. J. Man-
[29] F. Thabtah, R. Mohammad, L. McCluskey, A dynamic self-structuring neural Mach. Stud. 27 (4) (1987) 349–370.
network model to combat phishing, in: The Proceedings of the 2016 IEEE [63] Grossberg, Nonlinear neural networks: Principles, mechanisms, and architec-
World Congress on Computational Intelligence. Vancover, Canada, 2016. tures, Neural Netw. 1 (1) (1988) 17–61.
[30] F. Thabtah, I. Qabajeh, F. Chiclana, Constrained dynamic rule induction learn- [64] H. Joachims, Making Large-Scale Support Vector Machine Learning Practical,
ing, Expert Syst. Appl. 63 (2016) 74–85. Advances in Kernel Methods: Support Vector Learning, MIT Press, Cambridge,
[31] J. Lee, L. Bauer, L.M. Mazurek, The effectiveness of security images in internet MA, 1999.
banking, IEEE Internet Comput. 19 (1) (2015) 54–62. [65] J. Platt, Fast training of SVM using sequential optimization, in: B. Scholkopf, C.
[32] R. Mohammad, F. Thabtah, L. McCluskey, Predicting phishing websites based Burges, A. Smola (Eds.), Advances in Kernel Methods–Support Vector Learning,
on self-structuring neural network, J. Neural Comput. Appl. (ISSN: 0941-0643) MIT Press, Cambridge, 1998, pp. 185–208.
25 (2) (2014) 443–458. Springer. [66] L.A. Zadeh, ‘‘Fuzzy Sets,’’ Information and Control 8 (3) (1965) 338–353. http:
[33] N.Gh. Jameel, L. George, Detection of phishing emails using feed forward neural //dx.doi.org/10.1016/S0019-9958(65)90241-X.
network, J. Comput. Appl. 77 (7) (2013) 10–15. [67] Y. Freund, R.E. Schapire, A decision-theoretic generalization of on-line learning
[34] Information Week (n.d.), 2016. http://www.informationweek.com/california- and an application to boosting, J. Comput. Syst. Sci. 55 (1) (1997) 119–139.
enacts-tough-anti-phishing-law-/d/d-id/1036636? [Accessed 17.03.16].
I. Qabajeh et al. / Computer Science Review 29 (2018) 44–55 55

[68] E. Goldberg, Genetic Algorithms in Search, Optimization, and Machine Learn- [72] J. Seymour, P. Tully, Generative Models for Spear Phishing Posts on SocialMe-
ing, MA: Addison Wesley., 1989. dia. Technical report, 2018.
[69] I. Fette, N. Sadeh, A. Tomasic, Learning to detect phishing emails, in: Proceed- [73] S. Abu-Nimeh, D. Nappa, X. Wang, Nair, A comparison of machine learning
ings of the 16th international conference on World Wide Web. 2007, pp. 649– techniques for phishing detection, in: The 2nd Annual Anti-Phishing Working
656. Groupse Crime Researchers, eCrime ’07, ACM, New York, NY, USA, 2007.
[70] David E. Rumelhart, Geoffrey E. Hinton, Ronald J. Williams, Learning represen- [74] Y. Pan, X. Ding, Anomaly based web phishing page detection, in: The 22nd
tations by back-propagating errors, Nature 323 (6088) (1986) 533–536. Annual Computer Security Applications Conference, (ACSAC), IEEE, Miami
[71] R.M. Mohammad, F. Thabtah, L. McCluskey, Predicting phishing websites using Beach, Florida, USA, 2006.
neural network trained with back-propagation, in: World Congress in Com- [75] N. Sanglerdsinlapachai, A. Rungsawang, Using domain top-page similarity
puter Science, Computer Engineering, and Applied Computing, Las Vigas, 2013, feature in machine learning-based web, in: Third International Conference on
pp. 682–686. Knowledge Discovery and Data Mining, IEEE, 2010.

You might also like