You are on page 1of 219

CYBERCRIME AND CYBERSECURITY RESEARCH

KNOWLEDGE DISCOVERY
IN CYBERSPACE

STATISTICAL ANALYSIS AND


PREDICTIVE MODELING

No part of this digital document may be reproduced, stored in a retrieval system or transmitted in any form or
by any means. The publisher has taken reasonable care in the preparation of this digital document, but makes no
expressed or implied warranty of any kind and assumes no responsibility for any errors or omissions. No
liability is assumed for incidental or consequential damages in connection with or arising out of information
contained herein. This digital document is sold with the clear understanding that the publisher is not engaged in
rendering legal, medical or any other professional services.
CYBERCRIME AND
CYBERSECURITY RESEARCH

Additional books in this series can be found on Nova’s website


under the Series tab.

Additional e-books in this series can be found on Nova’s website


under the eBook tab.
CYBERCRIME AND CYBERSECURITY RESEARCH

KNOWLEDGE DISCOVERY
IN CYBERSPACE

STATISTICAL ANALYSIS AND


PREDICTIVE MODELING

KRISTIJAN KUK
AND
DRAGAN RANĐELOVIĆ
EDITORS

New York
Copyright © 2017 by Nova Science Publishers, Inc.

All rights reserved. No part of this book may be reproduced, stored in a retrieval system or transmitted
in any form or by any means: electronic, electrostatic, magnetic, tape, mechanical photocopying,
recording or otherwise without the written permission of the Publisher.

We have partnered with Copyright Clearance Center to make it easy for you to obtain permissions to
reuse content from this publication. Simply navigate to this publication’s page on Nova’s website and
locate the “Get Permission” button below the title description. This button is linked directly to the
title’s permission page on copyright.com. Alternatively, you can visit copyright.com and search by
title, ISBN, or ISSN.

For further questions about using the service on copyright.com, please contact:
Copyright Clearance Center
Phone: +1-(978) 750-8400 Fax: +1-(978) 750-4470 E-mail: info@copyright.com.

NOTICE TO THE READER


The Publisher has taken reasonable care in the preparation of this book, but makes no expressed or
implied warranty of any kind and assumes no responsibility for any errors or omissions. No liability is
assumed for incidental or consequential damages in connection with or arising out of information
contained in this book. The Publisher shall not be liable for any special, consequential, or exemplary
damages resulting, in whole or in part, from the readers’ use of, or reliance upon, this material. Any
parts of this book based on government reports are so indicated and copyright is claimed for those parts
to the extent applicable to compilations of such works.

Independent verification should be sought for any data, advice or recommendations contained in this
book. In addition, no responsibility is assumed by the publisher for any injury and/or damage to
persons or property arising from any methods, products, instructions, ideas or otherwise contained in
this publication.

This publication is designed to provide accurate and authoritative information with regard to the subject
matter covered herein. It is sold with the clear understanding that the Publisher is not engaged in
rendering legal or any other professional services. If legal or any other expert assistance is required, the
services of a competent person should be sought. FROM A DECLARATION OF PARTICIPANTS
JOINTLY ADOPTED BY A COMMITTEE OF THE AMERICAN BAR ASSOCIATION AND A
COMMITTEE OF PUBLISHERS.

Additional color graphics may be available in the e-book version of this book.

Library of Congress Cataloging-in-Publication Data


ISBN:  (eBook)

Published by Nova Science Publishers, Inc. † New York


CONTENTS

Preface vii
Chapter 1 Computer-Based Data Analysis Techniques:
The Potential Application to Crime Investigation
in Cyber Space 1
D. Marinković and T. Civelek
Chapter 2 Spatial Data Visualization as a Tool for Analytical
Support of Police Work 19
N. Milić, B. Popović, V. Ilijazi and E. Ilijazi
Chapter 3 Cybercrime Influence on Personal, National and
International Security while Using the Internet 53
I. Cvetanoski, J. Achkoski, D. Rančić
and R. Stainov
Chapter 4 Some Aspects of the Application of Benford’s
Law in the Analysis of the Data Set Anomalies 85
D. Joksimović, G. Knežević, V. Pavlović, M. Ljubić
and V. Surovy
Chapter 5 Behaviour and Attitudes vs. Privacy Concerns
of Social Online Networks 121
G. Savić and M. Kuzmanović
vi Contents

Chapter 6 Information Retrieval and Development of


Conceptual Schemas in E-Documents for
Serbian Criminal Code 151
V. Nikolić, P. Đikanović and S. Nedeljković
Chapter 7 Development of the Android-Based Secure
Communication Device 175
A. Jevremović, M. Veinović, G. Šimić, N. Savanović
and D. Ranđelović
Author Contact Information 195
Index 197
PREFACE

This book on knowledge discovery in cyberspace consists of a current


collection of research with contributions by authors from different nations in
different disciplines. After reading this book, the reader should be able to:
understand the fundamental nature of cyberspace; understand the role of
cyber-attacks; learn analytical techniques and the challenges of predicting
events; learn how languages and culture are influenced by cyberspace; and
learn techniques of the cyberspace public opinion detection and tracking
process. Understanding cyberspace is the key to defending against digital
attacks. The book takes a global perspective, examining the skills needed to
collect and analyze event information and perform threat or target analysis
duties in an effort to identify sources for signs of compromise, unauthorized
activity and poor security practices. The ability to understand and react to
events in cyberspace in a timely and appropriate manner will be key to future
success. Most of the collections are research-based practices that have been
done throughout the years.
This book is a practical handbook of research on dealing with
mathematical methods in crime prevention for special agents, and discusses
their capabilities and benefits that stem from integrating statistical analysis and
predictive modeling. The authors hope that the presented work will be of great
use to police investigators and cyber special agents interested in predictive
analytics.

Synopsis of Book Chapters


The book comprises seven chapters. In Chap. 1, Marinković, D. and
Civelek, T. present potential applications based association rules algorithms.
viii K. Kuk and D. Ranđelović

The goal of used algorithms is that tendencies (often ‘hidden’) can be revealed
as well internet identity, political opinion, music preferences, religion etc.
However, the Web space, especially social media, is invaluable source of data
and often the first place crime investigators are referring to it in order to obtain
relevant information. Automatic data search and matching is a powerful tool
enabling the fast and efficient searching of large databases for crime
investigators. In their experiment authors tried to establish how efficient is
above afforementioned a_priori algorithm in blog analysis when association
rules are utilized.
Chapter 2 by Milić, N. et al. present some of the GIS technology
capabilities in the function of analytical support of police work at all levels of
police organization and management. In addition to the visualization of
geospatial data, GIS technology provides analytical capacity primarily in
analyzing geospatial distribution of crime incidents. Comparing to the textual
crime reports (bulletins), crime maps inform law enforcement officers much
faster and easier about the spatial distribution of crime. The innovative
functionalities of predictive analytic solutions are briefly described by authors.
The aim of chapter 3 by Cvetanoski, I. et al. is to stress the danger of
cybercrime activities in cyberspace and its impact on personal, national and
international security. Today, modern technology gives great opportunity to
use on-line tools for performing cybercrime activities, which means that
anyone can create malicious software for crime activities in cyberspace. The
authors ware used a simple linear regression in order to predict computer
crime in the future. The results of their research present that simple linear
regression model can be used to make a prediction for computer crime in a
year or two in the future, but it is not a good model to make a prediction very
far into a future.
In Chap. 4. Joksimović et al. presented generally accepted theoretical
analysis and assumptions regarding the implementation of the Benford's Law.
The implementation of this law in the analysis of the anomalies in some
numerical data in various scientific disciplines is also part of this chapter. The
authors shows how the mutual usage of the Benford's Law and specific laws of
mathematical statistics, successfully detects potential irregularities in the
numerical data and leads the forensic analyst forward in the area of the
detection of the potential fraud.
The next chapter (Chap. 5, by Savić, G. and Kuzmanović, M.) discusses
intend to investigate the relationship between concerns of Online Social
Networks or OSN users and their behavior and attitudes towards privacy.
Therefore, the behavior is investigated throughout the groups of questions
Preface ix

concerning the habits of using OSNs, leaving real personal data or connecting
with unknown people. In this chapter, authors focus on the relationship
between privacy concerns and actual behavior of the online social network
users in Serbia.
Text analysis and classification techniques might be used to improve
efficiency and effectiveness of e-government services, especially the ones
provided by law enforcement agencies by using techniques of automatic text
reports analysis, Nikolić, V. et al. (Chap. 6) propose concepts representing the
documents via the conceptual schemas. The authors presented data mining and
Lucene library spaces architecture, as well as the core Lucene, and then the
possibility of its application. Their case study deal with the possibilities of
Lucene indexing and Lucene searching of data and documents within
unstructured crime text documents in Serbian language.
The final chapter by Jevremović A., et al. presents a model for the
development of its custom system for secure communication depending on
custom encryption algorithms. They discuss key issues related to the
development of mobile devices for secure communication based on the
Android platform. This chapter also analyzes the choice between processing
encryption systems and absolutely secured encryption systems. Therefore, the
solution presented in the chapter proposes building of its custom algorithm in
the form of Linux kernel modules.
We would like to express our special thanks to the eight reviewers who
participated in the peer review process:

 Alaa Hussein Al-hamami, Dean of Database Security, College of


Computer Sciences and Informatics, Amman Arab University, Jordan
 Damir Delija, Digital Forensic Department, INsig2, Croatia
 Dimitar Bogatinov, Military academy "General Mihailo Apostolski"-
Skopje, "Goce Delchev" University – Shtip, Macedonia
 Ejub Kajan, State University of Novi Pazar, Serbia
 Lepiokhin Alexander, Department of legal informatics, The Ministry
of internal Affairs of the Republic of Belarus, Belarus
 Ranka Stanković, HLT Group, University of Belgrade, Serbia
 Wei Wang, Department of Public Security Fundamentals, National
Police University of China, China
 Yuri Savva, Orel State University, Russian Federation
Kristijan Kuk,
Dragan Ranđelović
In: Knowledge Discovery in Cyberspace ISBN: 978-1-53610-566-7
Editors: K. Kuk and D. Ranđelović © 2017 Nova Science Publishers, Inc.

Chapter 1

COMPUTER-BASED DATA
ANALYSIS TECHNIQUES:
THE POTENTIAL APPLICATION TO
CRIME INVESTIGATION IN CYBER SPACE

Darko Marinković1,*, PhD and Turhan Civelek2, PhD


1
Academy of Criminalistic and Police Studies,
Department of Criminalistics, Belgrade, Serbia
2
Kirklareli University, Engineering Faculty,
Software Engineering Department, Kırklareli, Turkey

ABSTRACT
Collecting the most versatile information about individuals and their
storing in different databases represent the reality of the contemporary
society. The growth in the quantity of information has exceeded man’s
power to process and analyze them in a traditional manner, making
computerized techniques and means, especially data mining techniques, a
necessity for these purposes. Although widely applied in the public
administration and economy domain, computerized data search and
comparison so far have not reached their full potential in the area of
crime investigation and forensics. Law enforcement agencies and forensic
laboratories collect large quantities of various data originating not only

* Corresponding
author: D. Marinkovic, Email: darko.marinkovic@kpa.edu.rs.
2 Darko Marinković and Turhan Civelek

from a person’s criminal activities, but his/her social activities as well


(i.e., blogs, social media, etc.). The very success of a crime investigation
depends to a large extent on the availability of relevant data (referring to
persons, objects or events) and finding often hidden relationship among
them. Web space, especially social media, is invaluable source of data
and often the first place crime investigators refer to in order to obtain
relevant information.

Keywords: data analysis, data surveillance, data mining, cyber space, crime
investigation

INTRODUCTION
It is a general view today that exceptional organization of human society
inevitably relies on collecting and managing the most various data related to
their members. The efficient functioning of the government as well as non-
governmental sector requires the existence of numerous information registers
about different entities (individuals, organizations, etc.), covering all aspects
of their activities. Utilization of computer-based information systems has
largely increased the possibility of collecting, processing and analyzing data
for different purposes including surveillance of individuals and their
behaviour. The essential importance of computer storing and processing of
information is not only in the speed of carrying out various operations, but
primarily in the possibility to access the integrated mutually linked data
coming from different sources. The state-of-the art information technology
makes it possible to get these data from different networked databases in split
seconds.
Collecting and storing (naturally legal and legitimate) different kind of
information about individuals represents the reality of the modern society, as is
the fact that the persons these data refer to cannot have the absolute power
over them. However, they have the right to feel secure from possible misuses
of these data. This is why the issue of the protection of personal data today is
highlighted even more, particularly being prominent in functioning and
performing the activities of state administration institutions and judiciary,
including police. Accordingly, with regards to the availability of personal data,
they must have certain limitations due to the general interest, and in the same
or similar way as when their other freedoms and rights are limited. The task of
the legal science, law-makers and legal practice is to define standard
foundations for collection and management of the most versatile data, i.e., the
Computer-Based Data Analysis Techniques 3

conditions under which they can be used for socially justified purposes. On the
other hand, the actual (primarily technical) possibilities are increasing from
day to day for more comprehensive, complex and sophisticated exploiting of
personal data, including the data on individual activities in all fields of life and
work. Among other things, the exploiting of such data can yield good results
in fighting crime as well.
The explosive growth of quantity of data and databases they are stored in
has exceeded man’s power to process and analyze them by traditional means
requiring new and different (naturally computer-based) analyzing techniques
and means. Regardless of the purpose they are used for, automatic data search
and comparison are based on the databases where certain data are stored, on
the one hand, and on the other hand, on the application of computers
(understood as hardware) and related programs (software) used for the search,
comparison and analysis of these data.

DATA SURVEILLANCE AS A SPECIAL FORM OF


PERSONAL SURVEILLANCE
Surveillance may be defined as a systematic investigation or monitoring of
movements or communications of one or more persons in order to collect
information about them, their activities and connections. For a long time the
surveillance has been implemented by direct physical observation, as well as
by various devices used for support, including telescopes, cameras, directed
microphones, telephone bugs, etc. The conventional surveillance forms require
hard work, cost much and last long [1].
During the 20th century the work of public administration has increasingly
included the intensive use of personal data. The expansion of network traffic
and flow of information has additionally contributed to the huge amounts of
data interchanged to be widely available. Personal surveillance through
personal data has become easily achievable, and at the same time much more
inexpensive and simpler than conventional techniques of physical or electronic
surveillance. As a result, new technology called data surveillance emerged,
enabling and facilitating surveillance of large number of people by comparing
and pairing data (collected from various sources) referring to them. Ever since
it started being applied, the data surveillance has become a topic of numerous
government publications and its effects and influences have been discussed by
many researchers, mostly sociologists and to some extent lawyers as well.
4 Darko Marinković and Turhan Civelek

In the Anglo-Saxon literature this phenomenon is usually referred to as


dataveillance, essentially representing the control, comparison and analysis of
systemized personal data obtained while monitoring their activities during
criminal investigations. According to Clarke [2], two essential modalities of
personal surveillance through data are:

1. personal dataveillance, such as checking or proving the veracity of


concrete, extraordinary or extra works and transaction, which are
contrary to internal regulations of a certain service or organization,
and
2. the surveillance of a large, usually unidentified number of persons
(mass dataveillance), such as checking and proving of veracity of all
transactions which are contrary to internal regulations of a certain
organization.

In addition to two above mentioned models, there are also a number of


facilitative and support techniques, such as techniques for integration of data
stored in various databases. In comparison to conventional forms of
surveillance, dataveillance is automated, and therefore cheaper and more
reliable. This is why its application during the last 30 years has flourished, in
the beginning in wealthy societies with the developed and sophisticated
information technologies, but recently also in the developing countries where
many of them have legislation problems due to insufficiently developed
mechanisms of civil rights protection.

CRIME INVESTIGATION ASPECTS OF COMPUTER


SEARCH AND COMPARISON OF DATA
Computer search, analysis and comparing of data for crime-investigation
purposes may be versatile, with various expectations and results of application.
In the same manner as the large number of data stored in appropriate databases
serves to the efficient performance of public administration or banking, it can
be very useful in crime investigation.
The factors helping in the evaluation of relevance of data mining
techniques application in crime suppression range from the activities from
which the databases result to their quality (the degree of uncertainty, precision
and completeness). Police agencies and forensic laboratories collect large
Computer-Based Data Analysis Techniques 5

quantities of various data as a result of processing a variety of criminal


activities. Accordingly, one group of data obtained within the forensic crime
scene investigation consists of the information referring to collected material
of physical origin (for instance, biological traces, traces of tools, fingerprints,
shoeprints, illegal drugs seizures, etc.). This kind of data may be presented
numerically and may be subject to categorization. The features extracted from
these materials are often imprecise (in principle because of the instruments
used for analysis and measurements), incomplete (fragmentary) and uncertain.
The discovered and processed material samples are usually categorized
into three groups: 1) useless samples (for instance, the obvious clarity of the
contents without any calculations or they are irrelevant for the problem
observed), 2) useful samples, which provide direct important information that
can be worked with, and 3) pattern that requires interpretation, and which
cannot be classified into two previous categories, and therefore must be
examined by the experts in the given field [3].
The researchers have developed various automated data mining techniques
which can be utilized for crime suppression both in the field of local police
work and at the national level. Thus the entity extraction technique identifies
the patterns from databases such as texts, images or audio materials. It is used
for the automatic identification of faces, addresses, vehicles or personal
characteristics from narrative police reports. This technique provides
important data for crime analysis [4], but its achievements depend to a large
extent on the availability of large quantity of pure input data.
Cluster techniques systematize data into groups with similar
characteristics in order to maximize or minimize the similarity of the data
within a certain group, for instance, for identification of the suspects who
commit crimes in the similar manner or to differentiate between criminal
groups belonging to different gangs.
Association rule discovery finds the groups of data that appear often in
one database and the patterns of their appearance are defined as regularities.
This technique is often used to trace computer network intruders so that the
certain rules of association could be deduced from the history of interaction
among the users. The researchers can also use this technique for intruder
profiling in order to try to detect (or prevent) possible attacks on the network.
For detection of social media users or tracking of their behaviours, external
attacks from network, determination of network hackings, there are algorithms
that can be used to solve these issues.
Sequence pattern detection (or string pattern detection) finds sequences
that appear often in one set of transactions that occurred at various times.
6 Darko Marinković and Turhan Civelek

Pointing to the hidden patterns is useful for crime analysis, but in order to
obtain meaningful results a rich and highly structured database is required.
Deviation detection uses certain measures for the study of data which
noticeably differ from other data. The researchers may use this technique to
detect frauds, hacking into network systems and for other crime analyses.
However, such activities may sometimes seem usual at first sight, which
makes identification of deviating data more difficult.
Classification finds common features among various criminal entities and
organizes them into previously defined classes. This technique is used for the
identification of so called spam e-mail messages, based on linguistic patterns
and structural features of the sender. Often used for prediction of crime trends,
classification may reduce time required for identification of criminal entities.

COMPUTER DATA SEARCH AND COMPARISON TECHNIQUE


There should be a terminological difference between the concepts of
(computer) data search and comparison. The search includes reviewing and
analysis of data contained in certain databases in order to find information
referring to a certain person, action or process which are not visible at first
sight. Defined in such a way, the computer data search is mostly contained in
data mining techniques. On the other hand, comparison implies to have a
certain amount of data or features in advance, which are then entered and
compared with other data from a certain database in order to find common
characteristics between them which connect them and make them similar or
the same (pairing). The procedure of computer comparison is almost entirely
equalized with the procedure of computer matching [5].
In various fields of research (primarily statistics and artificial intelligence)
the automated analysis procedures have been developed that reveal hidden
contents within large datasets. The process used to achieve this is usually
called data mining. It refers to the automated analytic process shaped for the
effective and efficient exploration of large data sets in order to reveal and use
valuable, “hidden” information which refer to hitherto unknown facts and
relations [6]. In other words, data mining can be understood as finding the
previously unknown and potentially useful information or knowledge from
large data sets. The basic principle is to create computer programs which scan
such data sets and automatically search for certain, previously defined
patterns. The potential data mining technology depends much on the nature of
the available data sets and it is successfully applied in various professional
Computer-Based Data Analysis Techniques 7

fields, for instance in the remote resource management, biometrics, speech


recognition or business and marketing [7]. The data mining procedure uses
algorithms in order to find the important hidden contents in large sets, the
interpretation and understanding of which enables better diagnostics of state of
affairs, better predictions and finally better decision-making.
The basic functions of data mining are: 1) classification, i.e., exploring of
entity features and their sorting into previously determined classes; 2)
clustering, i.e., segmenting of a heterogeneous set of entities into
homogeneous sub-groups, clusters; 3) evaluation, i.e., predicting of unknown
values of continuous variables; 4) detection of changes and deviations in data
from previously measured or standard values; 5) detecting associations and
finding items in transaction which imply the presence of other items in the
same transaction, etc. Some authors [9] classify data mining functions into two
sets - the first one is directed analysis, based on supervised learning, including
classification, evaluation and prediction, and the second one is undirected
analysis, based on unsupervised learning, including grouping, association
rules, description and visualization. The dominant view of the nature of data
mining is that it helps reveal only the hypotheses about complex facts and their
relations (see Figure 1).

Association Rule Mining

Association rules algorithms often generate many irrelevant rules that are
subsequently rejected during the validation process. Domain expert has to
specify constraints on the types of rules of interest before the rule discovery
stage and reduce the number of discovered rules that are irrelevant.
Data mining algorithms are finite set of steps that find out patterns in large
data sets. Patterns may be described by rules IF ... THEN, decision tables,
neural networks, genetic algorithms, linear and nonlinear models. There is not
a generally acceptable and good data mining algorithm applicable to all
situation and decision problems [10].
This algorithm search should not be performed only by one algorithm,
instead various machine learning algorithms should be applied. In addition,
each of them will have optimal results for specific (different) data type. To
determine the best learning algorithm, Kappa values, F-measure values can be
comprised with true learning ratio. However, knowing learning algorithms
success rates does not always bring us to the end. In this case, accuracy rate,
8 Darko Marinković and Turhan Civelek

precision, recall and F-measure values should be considered to compare


success rates [11].

Data and Method of Analysis

A data mining tool - WEKA [12] evaluates the data from the surveys by
using machine learning algorithms. Also, it produces a confusion matrix which
is a digital output summary of made predictions.
Accuracy rate is used for model performance. It is the ratio of correctly
classified sample numbers to total sample numbers.

(TP  TN ) (1)
Accuracy rate =
(TP  FP  FN  TN )

Error rate, is the contemporary part of accuracy rate. İt is the ratio of


incorrectly classified numbers of samples to total numbers of samples.

( FP  FN ) (2)
Error Rate =
(TP  FP  FN  TN )

Precision is the ratio of positive numbers of samples to positive numbers


of samples plus total numbers of samples.

TP
Precision ( P)  (3)
(TP  FP )

Recall is the ratio of positive numbers sample to total positive numbers


sample.

TP
Recall ( R)  (4)
(TP  FN )

F-measure is the harmonic average of recall and precision values. If value


is closer to 1, it means that learning is good.
Computer-Based Data Analysis Techniques 9

2 DK
F-measure ( F )  (5)
(D  K )

Kappa value (K) determines the concurrence reliability of comparison


between observers. Kappa value is between -1 and 1. If value is 1, there is an
exact concurrence comparison between observers.

( Po  Pc )
Kappa value ( K )  (6)
(1  Pc )

Negative values of kappa (K < 0) are meaningless in terms of reliability. If


kappa value is between 0.41 and 0.60, it is acceptable; if it is between 0.61 and
1, then learning is successful. When F-measure value is closer to 1 at current
algorithm it means that machine learning is better than other algorithms [13].
Data suitable for association rules must be in transactional form.
Basically, data mining systems can identify frequent sets in transactional
databases and perform data analysis. The apriori is a classic algorithm for
frequent item set mining and association rule learning over the transactional
databases [14]. Data must be in transaction databases where one table exists
with attributes Transaction ID (unique identifies the record) and transaction
items (items which belong to one transaction) as shown in Table 1. It is
obvious that such relation is not in the first normal form, and therefore a
necessary preparation of data from transactional databases must be performed
in order to implement apriori algorithm.

Figure 1. The process of knowledge discovery in databases (KDD) [8].


10 Darko Marinković and Turhan Civelek

Table1. Data in the transaction database form

Transaction ID Transaction Items


1 A, B, C
2 A, B, C, D, E
3 A, C, D
4 A, C, D, E
5 A, B, C, D

Figure 2. Flowchart of apriori algorithm.

The goal of apriori algorithm is extracting rules in the form XY, where
X and Y are itemsets. Two main parameters are considered for evaluation
metrics: support [15] and confidence [16]:

 Support: The rule holds with support supp in T (the transaction data
set) if supp % of transactions contain X U Y.

(7)
Computer-Based Data Analysis Techniques 11

 Confidence: The rule holds with confidence conf in T if conf % of


transactions that contain X also contain Y.

(8)

Association rules mining using apriori algorithm uses a bottom-up


approach, breadth-first search and a hash tree structure to count the candidate
itemsets efficiently. A two-step apriori algorithm is explained with the help of
flowchart as shown in Figure 2 and explained below:

Step 1. Generate frequent itemsets of length 1.


Step 2. Repeat until no new frequent itemsets are identified

a) Generate length (k + 1) candidate itemsets from length k


frequent itemsets
b) Prune candidate itemsets containing subsets of length k that are
infrequent
c) Count the support of each candidate by scanning the database
d) Eliminate candidates that are infrequent, leaving only those that
are frequent.

To select interesting rules from the set of all possible rules generated,
constraints on various measures of significance and interest can be used. The
best-known constraints are minimum thresholds on support and confidence.

Table 2. Blogger dataset attributes

Attributes Data Type Description


Degree Nominal Education (high, medium, low)
Caprice Nominal Political caprice (left, middle, right)
Topic Nominal Topics (Impression, political, tourism, news, scientific)
LMT Nominal Local media turnover (yes, no)
LPSS Nominal Local, political and social space (yes, no)
PB Nominal Professional blogger (yes, no)
12 Darko Marinković and Turhan Civelek

POSSIBILITIES OF DATA MINING UTILIZATION IN


ANALYSING CYBER SPACE TENDENCIES
Apriori algorithm can be a very useful tool for pattern discovery in cyber
space, which can be helpful in crime investigation. From law enforcement
agencies point of view, blog is considered a form of public data providing
most valuable information about their creator and his followers. Blog is a web
site that contains online personal reflections, comments, and often hyperlinks
provided by the writer, and represents a very popular form among various
social media users (especially mature ones presenting political statements).
They are considered as exceptional way for Internet user interconnection based
on similar interests. Many researchers (mostly sociologists) consider blogs as
social phenomena in cyber space [17].
It is also known that blog writer is often unaware that different methods
for blog analysis can reveal some personal information about him/her.
Rosanna et al. have shown in their study that some personal features (e.g., a
log of daily activities, information about friendships and relationships) can be
efficiently extracted by analysing blogs and public behaviour in web [18].
There are a lot of companies which support free blog facilities in order to
collect data and perform different analyses (economical, social, political, etc.).
Beside common features such as age, education, or blog topic, a lot of other
(often ‘hidden’) tendencies can be revealed as well (internet identity, personal
affinities such as political opinion, music preferences, religion, etc.).
In our experiment we tried to establish how efficient the above mentioned
apriori algorithm is in blog analysis when association rules are utilized. For
that purpose we used real and authentic dataset obtained from UCI machine
learning repository website [19]. The title of the dataset is ‘Blogger’, and it is
prepared by using real data from users tend to cyber space in Kohkiloye and
Boyer Ahmad Province in Iran. This dataset contains a total number of 6
attributes and 100 instances including the rate and importance of education,
the role of political beliefs, interesting topics, the effect of state mass media
and political and social conditions related to the professional tendency field,
etc. All data provided in this dataset is nominal. The data in each instance
belong to different types of bloggers: professional bloggers and seasonal
(temporary) bloggers. In this dataset, the variable “PB” defining professional
bloggers is described as ‘those internet users in the survey who adopt blog as
an effective digital media and are interested in digital writing in continuous
time intervals. Seasonal (temporary) bloggers are not professional bloggers
Computer-Based Data Analysis Techniques 13

and follow blogging in discrete time periods.’ Finally the followings are
considered as the main data fields which include: education, political caprice,
topics, local media turnover (LMT) and local, political and social space
(LPSS), as shown in Table 2.
WEKA has been used to explore the behaviour of the apriori algorithm
for extracting the significant patterns for users’ professional tendency
detection of blogs. The data is initially stored in MS Excel sheet, then
converted into attribute relation file format (ARFF file), which is the
acceptable format to WEKA tool. Minimum support defined by the tool for the
generated rule is 0.25 (25 instances) and minimum confidence is 0.9.
Association rules for personal features detection of the analyzing public
behaviour in webs are defined as:

• Rule 1. topic = political pb = yes 28 ==> lmt = yes 28 < conf: (1) >
lift:(1.16) lev:(0.04) [3] conv:(3.92)
• Rule 2. topic = political lpss = yes pb = yes 26 ==> lmt = yes 26 <
conf:(1) > lift:(1.16) lev:(0.04) [3] conv:(3.64)
• Rule 3. degree = high lpss = yes 31 ==> lmt = yes 30 < conf:
(0.97) > lift:(1.13) lev:(0.03) [3] conv:(2.17)
• Rule 4. degree = high topic = political 28 ==> lmt = yes 27 <
conf:(0.96) > lift:(1.12) lev:(0.03) [2] conv:(1.96)
• Rule 5. lpss = yes pb = yes 48 ==> lmt = yes 46 < conf:(0.96) >
lift:(1.11) lev:(0.05) [4] conv:(2.24)
• Rule 6. caprice = left lpss = yes 38 ==> lmt = yes 36 < conf:
(0.95) > lift:(1.1) lev:(0.03) [3] conv:(1.77)
• Rule 7. topic = political 35 ==> lmt = yes 33 < conf:(0.94) >
lift:(1.1) lev:(0.03) [2] conv:(1.63)
• Rule 8. topic = political lpss = yes 31 ==> lmt = yes
29 < conf:(0.94) > lift:(1.09) lev:(0.02) [2] conv:(1.45)
• Rule 9. degree = high pb = yes 30 ==> lmt = yes 28 < conf:(0.93) >
lift:(1.09) lev:(0.02) [2] conv:(1.4)
• Rule 10. caprice = left lpss = yes pb = yes 30 ==> lmt = yes
28 < conf:(0.93) > lift:(1.09) lev:(0.02) [2] conv:(1.4)

If WEKA output is analysed, different rules can be noticed. One of them


regarding Rule 1 and Rule 2 suggest that for professional blogger (pb) writing
political topic, a high local media turnover is achieved. On the other side if
activity in local, political and social space is involved, it does not contribute to
additional local media turnover.
14 Darko Marinković and Turhan Civelek

After detailed WEKA output analysis, the most significant patterns


generated using apriori algorithm can be summarized as follows:

a) If blog topic = political,


b) If degree = high,
c) If caprice = left,
d) If lpss = yes, and
e) If pb = yes

Then local media turnover is high, and is very likely to be confirmed


through real local media function. Gharehchopogh and Khaze in their work
performed decision tree analyse on the same database set [20]. Their findings
are as follows:

• Among the interested subjects for blogging, politics is marked as the


most significant. They identified that major community of bloggers
belongs to the political party of so called reformists-leftists, which has
a great tendency to professional blogging.
• Most bloggers having high education (bachelor’s degree, M.Sc. and
Ph.D.) constitute the group which has professional approach to
blogging.
• Professional bloggers do not have definite opinion about the effects
that local political and social conditions have on blogging. It is also
the same about those who believe in local media function on the
tendency toward blogging.

In our experiment using the same database set, and based on created
association rules, our findings are as follows:

“If political issues are considered important in blogging, then local,


political and social conditions must also be considered as basic factor in
recognizing professional approach in blogging.”

We have to emphasise here that every estimated information or


relationship, no matter how small or unimportant it may seem, can be crucial
in ongoing criminal investigation where discovering ‘hidden’ connections is of
most importance.
Consequently, the best learning algorithm is based on kappa and f-
measure values. The higher the kappa and f-measure values, the better the
Computer-Based Data Analysis Techniques 15

learning. After considering the best algorithm according to these pieces of


information, that search should be used for same data types.

CONCLUSION
The great challenge all police and intelligence agencies are facing is an
accurate and efficient analysis of crime data, the scope of which is constantly
increasing. For instance, complex criminal conspiracies are often hard to
reveal because the information on suspects may be geographically scattered
and may include large number of people. Computer crimes disclosing can also
be difficult because the extensive network traffic and frequent online
transactions create a huge quantity of data out of which only a small portion
refers to illegal actions. Police agencies and forensic laboratories collect large
quantities of various data, as a result of criminal activities processing. It can be
said that the automatic data searching and matching techniques have been
insufficiently used so far in this field, although they could contribute
significantly, particularly in discovering crimes which are difficult to
anticipate and prevent. Extenuating circumstance in their application is, among
other things, huge versatility of data that should be processed and considered.
Those involved in criminal investigations who have years of experience
can often precisely analyze crime trends, but with the increased frequency and
complexity of criminal acts human errors also appear; consequently, the time
required for analysis increases as well, while the offenders have more time to
destroy evidence and avoid being arrested. Automatic data search and
matching is a powerful tool enabling fast and efficient searching of large
databases for crime investigators, who may not be skilled for analysis. In
addition to this, utilization of specific purpose (analysis) software (such as
WEKA, SPSS, RapidMiner, etc.) often costs less than hiring or training of the
staff. Data mining techniques are generally considered as less prone to errors
than people, emphasising the need for their application in different areas
including security related issues such as crime investigation. Special
understanding of the relationship between the possibilities of the analysis and
the characteristics of a certain type of crime can help investigators to apply
these techniques more efficiently in order to identify trends and patterns,
locate problem area(s), and even predict a crime.
16 Darko Marinković and Turhan Civelek

REFERENCES
[1] Marinković, D. (2008). Tajni audio nadzor kao dokazna radnja - različiti
modaliteti i analiza rešenja u zakonodavstvu Srbije. Sprečavanje i
suzbijanje savremenih oblika kriminaliteta III (collected papers),
Beograd, pp. 228-256.
[2] Clarke, R. (1988). Information technology and dataveillance.
Communications of the ACM, 31(5), pp. 498-512.
[3] Terrettaz-Zufferey A. L. et al. (2006). Assesment of Data Mining
Methods for Forensic Case Data Analysis. Varstvoslovje, Fakulteta za
varnostne vede, Ljubljana, pp. 350-354.
[4] Kuk, K. (2015). Veštačka intelegencija u prikupljanju i analizi podataka
u policiji. Nauka, bezbednost, policija, 20(3), pp. 131-148.
[5] Clarke, R. (1994). Dataveillance by governments: The technique of
computer matching. Information Technology and People, 7(2), pp. 46-
85.
[6] Peng, Yi, et al. (2008). A descriptive framework for the field of data
mining and knowledge discovery. International Journal of Information
Technology and Decision Making, 7(4), 639-682.
[7] Witten, Ian H., Eibe F. (2005). Data Mining: Practical machine learning
tools and techniques. Morgan Kaufmann.
[8] Fayyad, U. M. et al. (1996). From Data Mining to Knowledge
Discovery: An Overview. Advances in Knowledge Discovery and Data
Mining, Cambridge, pp. 1-34.
[9] Berry, M., Linoff G. (2000). Mastering Data Mining, New York.
[10] Kuk, K., Mehic, A., Kartunov, S. (2015). The importance of data mining
technologies and the role of intelligent agents in cybercrime. Archibald
Reiss Days, Thematic Conference Proceedings of International
Significance, Volume III, Academy of Criminalistic and Police Studies,
Belgrade, pp. 223-232.
[11] Aydogan, E. K., Gencer, C., Akbulut, S. (2008). Churn Analysis and
Customer Segmentation of a Cosmetics Brand Using Data Mining
Techniques, Journal of Engineering and Natural Sciences, 26(1), pp. 42-
56.
[12] Garner, S. R. (1995). Weka: The waikato environment for knowledge
analysis. Proc New Zealand Computer Science Research Students
Conference, University of Waikato, Hamilton, New Zealand, pp. 57-64.
[13] Han, J., Kamber, M., (2006). Data Mining Concepts and Techniques.
San Francisco, CA: Morgan Kaufmann, Elsiver Inc.
Computer-Based Data Analysis Techniques 17

[14] Agrawal, R., Srikant, R. (1994). Fast Algorithms for Mining Association
Rules in Large Databases. Proceedings of the 20th International
Conference on Very Large Data Bases, VLDB, Santiago de Chile, 12-15
September 1994, pp. 487-499.
[15] Agrawal, R., Imielinski, T., Swami, A. (1993). Mining Association Rules
between Sets of Items in Large Databases. Proceedings of the 1993
ACM SIGMOD International Conference on Management of Data,
Washington DC, 26-28 May 1993, pp. 207-216.
[16] Hipp, J., Güntzer, U., Nakhaeizadeh, G. (2000). Algorithms for
Association Rule Mining - A General Survey and Comparison. ACM
SIGKDD Explorations Newsletter, 2(1), pp. 58-64.
[17] Wyld, D., (2007). The Blogging Revolution: Government in the Age of
Web 2.0, BM Center for the Business of Government, Washington, DC.
[18] Rosanna, E., Cassie, A. Bradley, E., Okdie, M, (2010). Personal
Blogging Individual Differences and Motivations, IGI Global, pp. 292-
301.
[19] UCI Machine Learning Repository (2013). Available from: http://
archive.ics.uci.edu/ml/datasets.htm.
[20] Gharehchopogh, F.S., Khaze, S.R. (2012). Data Mining Application for
Cyber Space Users Tendency in Blog Writing: A Case Study.
International Journal of Computer Applications, 47(18), pp. 40-46.
In: Knowledge Discovery in Cyberspace ISBN: 978-1-53610-566-7
Editors: K. Kuk and D. Ranđelović © 2017 Nova Science Publishers, Inc.

Chapter 2

SPATIAL DATA VISUALIZATION


AS A TOOL FOR ANALYTICAL SUPPORT
OF POLICE WORK

Nenad Milić1, PhD, Brankica Popović2,*, PhD,


Venezija Ilijazi3 and Erzen Ilijazi4
1
Academy of Criminalistic and Police Studies,
Department of Criminalistics, Belgrade, Serbia
2
Academy of Criminalistic and Police Studies,
Department of Informatics and Computer Sciences, Belgrade, Serbia
3
Ministry of Interior of the Republic of Serbia,
Sector for Analyitics and ICT, Belgrade, Serbia
4
Office of Information and Communications Technology,
United Nations, Department of Management, New York, NY, US

ABSTRACT
Development of the information technology during the 80s has
significantly improved the way data is collected, stored and processed. As
a consequence police function becomes data driven more than ever
before. Crime analysis units became the new element in police
organizations’ structures worldwide and analytical information became
important prerequisite for effective policing. Having in mind that

* Corresponding author: B. Popovic, Email: brankica.popovic@kpa.edu.rs.


20 Nenad Milić, Brankica Popović, Venezija Ilijazi et al.

virtually everything police do is related to an address or location,


connecting locations with people (offenders, victims, community
members) and their activities is becoming a powerful tool (almost a
necessity) in understanding and managing communities security problems
by law enforcement agencies. For that reason, visualization of spatial data
receives a significant place in crime analysis. A number of sophisticated
specialized software aiming to help police in conducting effective crime
analysis exists, but accessibility of geospatial data sources make it
feasible for police to use Geographical Information Systems (GIS) and
crime mapping as essential tools. In addition to the visualization of
geospatial data, GIS technology provides analytical capacity primarily in
analyzing geospatial distribution of crime incidents. Therefore, it can be
utilized to discover different factors contributing to crime in order to give
timely and relevant information to all levels of police management, help
them to make better decisions, target resources and formulate strategies,
and enhance police proactive actions against crime.
The aim of this chapter is to present some of the GIS technology
capabilities in the function of analytical support of police work at all
levels of police organization and management. The following cases are
discussed in particular: optimization of police resources utilization
(resource location based on network analysis), hotspot policing, crime
data dissemination through Internet (Internet crime maps), offender
targeting through geographic profiling algorithm and Crime Information
Warehouse (CIW) Solutions. Lastly, innovative functionalities of
predictive analytic solutions are briefly described.

Keywords: spatial data, visualization, GIS, crime mapping, crime analysis,


police

INTRODUCTION
Historically, it can be noticed that the police agencies tasks and
responsibilities have remained substantially the same. It is the amount and
complexity of police duties that have been magnified in modern society, but
from police it was always required permanent readiness and high expertise in
solving the most complex problems in the field of security [32].
Police efficiency depends on many parameters, but we could say that
essentially it depends on data management (collecting, storing, processing, and
using). In order to solve a security problem, a police officer must have reliable
information about its origin, manifestation forms and the consequences it
causes [14]. For making a good decision, it is of great importance that such
Spatial Data Visualization as a Tool for Analytical Support 21

information is reliable, reflecting the true state of relating object, phenomenon


or processes, in the given time and geographic space (location). It is required
(with imperative) that information flows, their quantity and quality are
managed in a systematic manner (computer-based Information Systems) and
that they are in compliance with the specific requirements of law enforcement
officers.
Since modernization of society is followed with dramatically increased
information volume, police work has become unsustainable without utilization
of modern Information and Communication Technology (ICT) as well as
advanced methods and techniques of problem solving and decision-making
[28]. Therefore, for the purpose of identifying, monitoring, analysis and
research on complex security phenomena, data visualization techniques are
taking the prominent role.
With rapid development of mobile and geospatial technologies and
recognizing that almost 80% of the information has a spatial reference,1 police
agencies have intensively employed Geographic Information Systems (GIS)
and other technologies for mapping and analyzing crime data. They can be
utilized to discover factors contributing to crime in order to give timely and
relevant crime information to police executives helping them to make better
decisions, to devise better problem solutions, and target resources2 in a more
efficient and effective way, especially enhancing police proactive actions
against crime.
We can say that, in a way, modern technology utilization in police gave
strong impetus to the shift from classical reactive model (wait-and-respond) to
proactive one (predict and prevent), having an important role in analytical
support of police work.
The aim of this chapter is to present different ways in which GIS
technology can improve police work at all levels of police organization and
management. The rest of the paper is organized as follows. In the first section
the process of crime analysis is briefly presented. In the second section the
importance of modern analytical tools in crime analysis is described,
especially GIS and crime mapping. In the next section specific
implementations in police are discussed, in particular the optimization of
police resources utilization (resource location), hotspot policing, crime data
dissemination through the Internet (Internet crime maps), offender targeting

1
Read more in: Dempsey, C. 2012. “Where is the phrase ‘80% of data is geographic’ from?”
GIS Lounge, https://www.gislounge.com/80-percent-data-is-geographic/
2 E.g., cellular phone geoposition utilization for surveillance, or crowd movement monitoring in
order to coordinate police resources during large public gathering events, etc.
22 Nenad Milić, Brankica Popović, Venezija Ilijazi et al.

through geographic profiling algorithm and Crime Information Warehouse


(CIW) Solutions. After a short glance at innovative functionalities of
predictive analytic solutions, we conclude this chapter.

CRIME ANALYSIS
Rapid development of ICT has significantly improved the technology of
data collection and processing, whose end result - analytical information -
becomes an important factor in effective policing, making a solid base for new
discipline - crime analysis [1]. Crime analysis is required to enable police
officers to better identify problems, find out solutions and resources necessary
to address the problems and assess the achieved results [15].
According to Boba crime analysis is defined as ‘systematic study of crime
and disorder problems as well as other police–related issues - including
sociodemographic, spatial, and temporal factors - to assist the police in
criminal apprehension, crime and disorder reduction, crime prevention, and
evaluation’ [1]. Crime analysis involves the application of social science data
collection procedures, analytical methods, and statistical techniques,
employing both qualitative and quantitative techniques to analyze data
valuable to police agencies and their communities. Even though this discipline
is called crime analysis, it actually includes much more than just the
examination of crime incidents. As suggested by the International Association
of Crime Analysts (IACA), it includes ‘the analysis of crime and criminals,
crime victims, disorder, quality of life issues, traffic issues, and internal police
operations, and its results support criminal investigation and prosecution,
patrol activities, crime prevention and reduction strategies, problem solving,
and the evaluation of police efforts’ [19].
In order to avoid inconsistency and disagreement in both definitions and
typology of crime analysis, the IACA proposed the professional standards and
definitions of analytical methodologies, technologies, and core concepts
relevant to the profession of crime analysis. According to them there are four
major categories of crime analysis, suggesting that Criminal investigative
analysis, which is also sometimes called “profiling” is almost always part of
the tactical crime analysis process and therefore should not be considered to be
a separate type of crime analysis. Those categories ordered from specific to
general are:
Spatial Data Visualization as a Tool for Analytical Support 23

1. “Crime intelligence analysis - the analysis of data about people


involved in crimes, particularly repeat offenders, repeat victims, and
criminal organizations and networks.
2. Tactical crime analysis - the analysis of police data directed towards
the short-term development of patrol and investigative priorities and
deployment of resources. Its subject areas include the analysis of
space, time, offender, victim, and modus operandi for individual high-
profile crimes, repeat incidents, and crime patterns, with a specific
focus on crime series.
3. Strategic crime analysis - the analysis of data directed towards
development and valuation of long-term strategies, policies, and
prevention techniques. Its subjects include long-term statistical trends,
hotspots, and problems.
4. Administrative crime analysis - is analysis directed towards the
administrative needs of the police agency, its government, and its
community.” [19]

Different processes and techniques deployed for each type of crime


analysis are shown in Table 1.

Table 1. Processes and techniques deployed for each type of crime


analysis (according to [19])

Processes and techniques include (but are not limited to)


Crime intelligence Tactical crime Strategic crime Administrative
analysis analysis analysis crime analysis
 Repeat offender  Repeat incident  Trend analysis  Districting and
and victim analysis  Hotspot analysis re-districting
analysis  Crime pattern  Problem analysis
 Criminal history analysis analysis  Patrol staffing
analysis  Linking known analysis
 Link analysis offenders to  Cost-benefit
 Commodity flow past crimes analysis
analysis  Resource
 Communication deployment for
analysis special events
 Social media
analysis
24 Nenad Milić, Brankica Popović, Venezija Ilijazi et al.

While crime intelligence analysis and tactical crime analysis products are
usually internal and kept confidential,3 the products of strategic and
administrative crime analysis are more likely to be distributed externally to
inform audiences outside the police agency. Tactical crime analysis and
administrative crime analysis can be performed largely from the data that
comes from internal sources (police databases and computer-aided dispatch).
Contrary, although it often starts with the data from police databases, both
crime intelligence analysis and strategic crime analysis depend on the
deliberate collection of additional data from a variety of other sources in order
to obtain a broader context of analysed phenomenon [19]. Crime analysts
review all available data, both from police records and from other sources,
with the goal of identifying patterns as they emerge. Analyses of these patterns
and trends can provide the information about the nature of crime (who, what,
when, where, how and why), helping in the development of effective tactics
and strategies in preventing victimization and reducing crime.
The three most important kinds of information that crime analysts use are
sociodemographic,4 temporal and spatial. Sociodemographic information can
be used for establishing an identity of crime suspects, or on a broader level, to
determine the characteristics of groups and how they relate to crime. Temporal
analysis is conducted for examination of short-term and mid-term patterns
(such as patterns by day of the week, time of day or time between incidents
within a particular crime series), as well as examination of long-term patterns
in crime (such as patterns by month, the seasonal nature of crime and trends
over several years) [1, 16]. Nevertheless, it is the spatial nature of crime and
other police-related issues that are central for understanding the nature of a
problem, facilitating a larger role for spatial analysis in crime analysis.
There is a number of sophisticated specialized software aiming to help
police in conducting effective crime analysis.5 Typical crime analysis tools
include Statistical Analysis, Link Analysis, and Data Visualization and Crime
Mapping software.6 Police agencies are adding a new tool, Predictive
software, to assist their efforts [17]. Predictive analytics solutions apply
sophisticated statistical data exploration and machine-learning techniques to
historical information to help agencies uncover hidden patterns and trends -
even in large, complex datasets. In the context of its support to Predictive

3
in order to avoid compromising an investigative strategy.
4
Personal characteristics of individuals and groups, such as sex, race, income, age, education,
etc.
5 More on: http://www.iaca.net/resources.asp?Cat=Software.
6 More on http://www.it.ojp.gov/documents/analyst_toolbox.pdf.
Spatial Data Visualization as a Tool for Analytical Support 25

policing, it provides information that creates the needed situational awareness


among officers and staff. Even more, it can help to find not only where a crime
will most likely occur, but also when and who the suspect or victim is likely to
be. Emerging possibilities of predictive analytics will be briefly described
later.
Analytical information is valuable only if it is accessible and easily
understood by the people who can act on it. Visualization solutions such as
graphing and mapping along with statistical modeling are gaining in popularity
since they can deliver results clearly and cost effectively, often in real time.

Role of Place in Assessing Crime

As was previously noticed, one of the most important aspects of the crime
is the location [3]. Practically everything the police are doing is related to an
address or location. Each call for police intervention and going to the scene
has the appropriate geographical coordinates. In addition, considering the
crime as a product of human behavior, it is understandable why the crimes
geographical distribution is not random [26].
Today, the concentrated nature of crime is accepted as a fact, allowing
policing strategies shift from traditional reactive towards cutting edge
proactive (and/or location-based) policing approaches such as hotspots
policing, problem-oriented policing, intelligence-led policing, community-
oriented policing and Compstat management strategies [9]. All of them are
centered on directing crime prevention and crime reduction responses based on
crime analysis results [35].
Having that in mind crime mapping takes a significant place in the context
of crime analysis in order to facilitate understanding of the characteristics of
the spatial (geographical) distribution of crime and other events of importance
to the police work in a given time [23].

CRIME MAPPING AND GEOGRAPHICAL INFORMATION


SYSTEMS IN CRIME ANALYSIS
Crime mapping is an essential factor in the police agencies efficiency. It
represents a powerful tool for police analysts, providing valuable assistance to
police officers in problem identification and analysis (assessing the situation),
26 Nenad Milić, Brankica Popović, Venezija Ilijazi et al.

finding appropriate strategy for problem solving (decision-making) and


analysis of the effects of its utilization (problem solution evaluation) [23].
From a research and policy perspective, crime mapping is used to understand
patterns of incarceration and recidivism, help target resources and programs,
evaluate crime prevention or crime reduction programs and further
understanding of causes of crime [8]. Importance of crime mapping is
reflected in the fact that:

 It facilitates visual and statistical analysis of spatial distribution of


crime and other events;
 It allows the analyst to relate different data sources to a common
denominator - geographical space;
 It facilitates the presentation of the results of analytical work [7].

Although crime mapping and spatial crime analysis are not new concepts,
it is the emerging GIS technology that significantly contributes to their wide
utilization in crime analysis. Three important roles of GIS and crime mapping
that are generally accepted are:

 Database management,
 Spatial analysis, and
 Data visualization.

Having in mind the rule that “A picture is worth a thousand words” the
advantage of GIS utilization in crime mapping is evident. GIS has the unique
capacity to overlay different data sources (thematic layers) in digital map
layers in order to visualize them and use for further analysis (as shown in
Figure 1a). In other words, GIS ingests all available data such as historical
crime rates, police reports, department vehicle travel routes, traffic patterns,
camera footage, details of officer deployments, locations of critical
infrastructure or gang territories and other variables and displays them on
maps [23]. For example, through a hyperlink, an analyst can access and
visualize documents relevant to the crime event (e.g., official reports, photos
from the crime scene, etc.) (see Figure 1b).
Spatial Data Visualization as a Tool for Analytical Support 27

(a)

(b)

Figure 1. a) Layer organization enhances relationship determination between different


types of data. Simple selection of wanted thematic layer results in its visualization. b)
By activating a hyperlink, an analyst accesses the criminal report and photos from the
investigating documentation relating to the committed offense.

However, beside the visualization of spatial data, a major benefit of GIS


technology lays in its analytical capability. In the context of law enforcement
and GIS, the analysis involves the interaction of statistical data in a geographic
setting (see Figure 2a). The analytical capabilities of GIS software are
28 Nenad Milić, Brankica Popović, Venezija Ilijazi et al.

sophisticated and accurate. Today police analysts/officers have a number of


tools available to them. The most common are those, which facilitate:

 Hotspots identification - Hotspot displays typically use a statistical


technique called cluster analysis, which separates data into logical
groupings.
 Correlation analysis - correlations determine how closely two separate
factors are related (see Figure 2b). Although it alone does not prove
causality, it shows when there appear to be connections between
factors.
 Regression analysis - Regression analysis helps find the natural
relationships among the characteristics someone is studying. If
hotspots show where something is happening and correlation shows
apparent relationships among factors, regression helps show why by
demonstrating how the factors interact [13].

(a) (b)

Figure 2. Map showing: a) the spatial distribution of robberies (banks, post office,
pharmacy stores, currency exchange stores, casino and gambling facilities) in Belgrade
municipality Cukarica in 2008-2010 period. b) Result of the nearest neighbor
hierarchical spatial clustering technique (CrimeStat III), in order to find events closer
to each other than expected from the random distribution.
Spatial Data Visualization as a Tool for Analytical Support 29

To summarize, GIS allows crime analysts to map, visualize, and analyze


crime incident patterns and to identify crime hotspots along with other trends
and patterns. In that way it can help law enforcement management to
formulate strategies, perform better tactical analysis (e.g., crime forecasting,
geographic profiling) and make better decisions. Consequently, crime mapping
and GIS can be used for deciding the proper place for new police facilities
according to the future crime problems. Mapping also facilitates the
identification of some side effects of police actions, such as ‘displacement’7
[29] and ‘diffusion of benefits’8 [6]. Additionally, mapping can provide
specific information on crime and criminal behavior to the public, enhancing
the connection of law enforcement with local community in order to prevent
crime. Emerging web-based GIS (and other internet) technologies are opening
new opportunities for crime mapping utilization in crime analysis and
prevention. Some of these applications will be further described in the next
section.

SPECIFIC IMPLEMENTATIONS OF GIS


TECHNOLOGY IN POLICE WORK
As was previously emphasized there are many models of GIS technology
application in police agencies. We choose to describe briefly the most popular
ones which are well known and established (resource allocation, hotspot
identification), as well as the emerging ones (geographic profiling, internet
crime maps, crime warehouse solutions).

Resource Location Allocation

Police resources (human, material and technical) which can be used in a


specific geographic (national) space are limited. With the increasing number
of requests to act and less available resources, one of the main problems that
police management faces today is the optimization of resource utilization [25].

7
Displacement is said to occur if crime reductions in the target area lead to crime increases
elsewhere (in neighboring areas, or in the same area but at different times).
8 Opposite to crime displacement, diffusion of benefits entails the reduction of crime (or other
improvements) in the areas that are related to the targeted crime prevention efforts, but not
targeted by the response itself.
30 Nenad Milić, Brankica Popović, Venezija Ilijazi et al.

Location problems are related to the determination of the place or position


of object(s) in the space in order to find optimal ways of their use. From a
technology standpoint, the best way to locate resources in a policing
organization is to move from a reactive stance to a more predictive posture - to
get officers where they need to be, when they need to be there, and with the
right information for them to act quickly and decisively. An advance in ICT,
especially GIS technology, along with different spatial techniques, has made
the necessary progress to help assume this predictive posture.
Determination of resource location can be performed through objective
function maximization or minimization, at one or more criteria each ranked
according to importance, both for single-criteria and multi-criteria problems
respectively [25]. There are several types of models used for location problems
for single facility locations (center of Gravity, Grid, Centroid) and multi
facility location (Multiple gravity, Mixed integer programming, Simulation,
Heuristics). The most popular are Proximity-Based Models, whose aim is to
minimize impedance (time, distance) between two objects, and Maximize
Coverage Models (MCM) whose aim is to locate facilities in a way that as
many demand points as possible are allocated to solution facilities within the
impedance (distance) cutoff.9 Both can be performed as continuous location
models (models in the plane) or as network location models [25].
According to Klose and Drexl, continuous location models are
characterized by two essential attributes:

 Continuous solution space - facilities can be located on every point in


the plane, and
 Distance - measured with a suitable metric (typically Manhattan10 and
Euclidean11 distance) [20].

After determination of coordinates x, y  for each facility, the objective of


continuous location model is to minimize the sum of distances between the
facilities and k given demand points. When we look at a single facility,12 the
corresponding optimization problem of finding facility location x, y  in a

9
More on: http://desktop.arcgis.com/en/arcmap/latest/extensions/network-analyst/location-
allocation.htm.
10 grid or right-angle distance metric.
11 straight-line distance metric.
12 So-called Weber problem (Klose and Drexl, 2005).
Spatial Data Visualization as a Tool for Analytical Support 31

way that the sum of the (weighted) distances wk d k x, y  to given demand
points k  K located in ak , bk  is minimized:

vSWP   min  wk d k x, y , where d k x, y   x  ak    y  bk 


2 2

x, y  (1)
kK

can be solved efficiently by means of an iterative procedure [20]. An


extended version of the problem required to locate p, 1<p<|K| facilities and to
allocate demand to the chosen facilities is denoted as multi-source Weber
problem (MWP).
In network location models, space is viewed as a graph in which nodes
represent demand points and potential facility sites, and where distances are
estimated as the shortest paths in a graph. The network location model
corresponding to the single facility location is called 1-median problem, and
for continuous multi-source Weber model it is called p-median problem [20].
Network analysis can be performed by standard commercial GIS software
packages, such as ESRI’s ArcGIS®, whose extension Network Analyst13
contains tools necessary for the preparation and performance of network
analysis (Figure 3).

Figure 3. Functionalities of ESRI’s ArcGIS® Network Analyst extension.

A few examples of GIS utilization for solving police location problems


are briefly described.

13
More on http://desktop.arcgis.com/en/arcmap/latest/extensions/network-analyst/location-
allocation.htm.
32 Nenad Milić, Brankica Popović, Venezija Ilijazi et al.

Assessment of Police Patrols Area Covering Capacity


Taking into account given constraints regarding time and distance, as well
as the network model attributes for specific area, GIS can estimate and
visualize police patrol coverage area (Figure 4).
That information might be vital for estimating an optimal coverage model,
which should enable that available police resources (vehicles, officers) arrive
to the destination (e.g., crime scene) as soon as possible.

Figure 4. Possible ways for visualization of area coverage in accordance with the given
constraints (distance/time).

Optimization of Police Resources Employment


The determination of optimal positions of police patrols in order to have
maximal coverage of a given area is shown in Figure 5. On condition that
maximal distance (Manhattan) between an object and a patrol is 3000 m from
ten proposed locations (Figure 5a), maximal coverage can be obtained with six
locations (numbered and marked with triangle symbol in Figure 5b).
Benefits of ‘location’ analysis in GIS environment are reflected in the fact
that GIS data visualization capabilities allow coverage model shortcomings
spotting, their correction, and then the simple and rapid assessment of
(un)justification of the corrected model. For example, one can examine the
assumption as to the impact of elimination sites 1 and 4 on the overall
coverage capacity, and the extent to which this loss can be compensated by
different positioning locations 2 and 3, i.e., whether they can take requests for
location services 1 and 4 without impairing its capacity coverage. It is
observed that after the elimination of location 1 and 4, the total capacity of the
Spatial Data Visualization as a Tool for Analytical Support 33

coverage has not been significantly disrupted (Figure 6a). With additional
correction of location 3, all the events will be covered with four locations
(including those who have previously remained uncovered (Figure 6b). In
other words, corrections performed on the situation shown in Figure 5b further
optimize the coverage model in a way that the number of sites (patrols)
decreases, while the total coverage capacity remains almost unchanged [25].

(a) (b)

Figure 5. The determination of the minimum number of locations from which it is


possible to achieve maximum coverage.

(a) (b)

Figure 6. Total coverage capacity a) after location 1 and 4 elimination, b) after


additional correction of site 3 location.
34 Nenad Milić, Brankica Popović, Venezija Ilijazi et al.

Optimal Route Estimation


When we talk about optimal territory coverage (police stations, patrol
area, etc.), the basic requirement that arises is that in case of need the police
officers arrive at the site of intervention as fast as possible. It should be noted
that the shortest route is not always the fastest one, and in order to make a
valid choice one must have different characteristics (attributes) of road
infrastructure. It means that all kind of obstacles (roadblocks, a street closed
due to maintenance work, ramp locations, etc.) must be taken into
consideration while performing optimal route searching. In addition, the true
position of police patrols (in real time) should be available through GPS and
TETRA14 (TErrestrial Trunked RAdio) devices integration. The obtained
optimal route is then visualized, whereby a user can have at their disposal the
additional information about street name, segment length or the time it takes to
pass it (Figure 7a).
In a situation where a police vehicle needs to visit multiple locations, GIS
application can find the optimal route that will connect objects in the user
specified order, or to find other more optimal sequence of visiting the
facilities. This situation becomes more complex if time constraints are
introduced (e.g., prisoners transport to different courts and police stations with
different arrival/departure times at each of them), when it might not be rational
to go first to the nearest facility but to the other further away, and return later
to the closer one (Figure 7b). In this and similar cases, GIS applications find
the optimal route in a short time.

Hotspot Policing

Crimes tend to concentrate at particular geographic locations where


favorable opportunities exist [2, 34]. These concentrations or clusters of crime
are commonly referred to as hotspots [31]. Proliferation of GIS software
contributes to the fast and easy hotspot maps creation, making them a central
part of crime analysis and hotspot policing. Therefore, hotspot policing refers
to the concentration of police resources in a small discrete area “that has a
greater than average number of criminal events, or an area where people have
a higher than average risk of victimization” [10]. Inaccuracy in hotspots

14 TETRA is a digital trunked mobile radio standard developed to meet the needs of traditional
Professional Mobile Radio (PMR) user organisations such as Public Safety, Transportation,
Government, Military, etc. More on: http://www.etsi.org/technologies-clusters/
technologies/tetra.
Spatial Data Visualization as a Tool for Analytical Support 35

identifications may affect police effectiveness, as well as the citizens’ quality


of life,15 or even their rights.

(a)

(b)

Figure 7. Network analysis application in the function of the optimal route estimation
for a) single and b) multiple locations.

15
For example, placing the label “high crime area” on a safe area may cause stigmatizing effect,
which may hinder economic development of the particular neighborhood.
36 Nenad Milić, Brankica Popović, Venezija Ilijazi et al.

Hotspot identification techniques are often evaluated by their ability to


predict future crimes based on historic crime data. For that purpose, different
predictive measures can be utilized. The three most frequently used are:

 Hit rate (HR), defined as the proportion of new crimes that occur
within the areas where crimes were predicted to occur:

n
HR 
N (2)

where n is the number of crimes in the areas where crimes are predicted to
occur (hotspots) and N is the number of crimes in the whole study area;

 Predictive accuracy index (PAI), described as the ratio of the hit rate
to the proportion of the study area that consists of hotspots,

n
HR
PAI   N
proportionon _ of _ hot _ spot _ area a
A (3)

where a is the total area occupied by hotspots, and A is the size of entire
study area;

 Recapture rate index (RRI), defined as a ratio of predicted and


historic hotspot densities, standardized for changes of the total area
density in each year:

n2
hot _ spot _ crime _ ratio n1
RRI  
total _ crime _ ratio N 2
N1 (4)

Today a variety of spatial statistic techniques is available for hotspots


identification and analysis. They are classified in two broad groups: those
applied on point data and those applied on aggregated data [22], but still there
is no agreement among researchers which one of them is the best in terms of
accuracy in predicting future crimes. In order to facilitate hotspot
identification, various software applications were developed. Although much
Spatial Data Visualization as a Tool for Analytical Support 37

of the spatial analysis can be done in GIS environment (e.g., ArcGIS Spatial
Analyst provides a range of spatial modeling and analysis tools),16 different
software applications (such as the CrimeStat software)17 have the ability to
perform many of these analyses. The main goal of this analysis is to assess
whether crime locations are randomly scattered across space, or instead show
systematic patterns in the form of clusters (more points are systematically
closer together than they would be in a purely random case) or dispersion
(more points are systematically further away from each other than under
randomness). Well known hotspot analysis techniques includes grid mapping,
covering ellipses, kernel density and heuristics [27].
An example of covering ellipse methodology is the nearest neighbor
hierarchical clustering (Nnh) which identifies groups of incidents that are
spatially close. It is a hierarchical clustering routine that groups points together
based on a given criterion. The CrimeStat Nnh routine defines a threshold
distance and compares the threshold to the distances for all pairs of points.
Only points that are closer to one or more other points than the threshold
distance are selected for clustering (see Figure 2b).
One of the most popular techniques with both academics and the crime
analyst professionals is Kernel Density Estimation (KDE). The idea is to
spread out each crime’s expected contribution to the future crime risk over a
certain area using a mathematical function called a kernel. KDE is a statistical
analysis approach used to interpolate a continuous surface of crime data based
on initial crime data points from different locations. This is created by
‘overlaying a grid (with n equally sized cells) on top of the study area and
calculating a density estimates based on the center points of each grid cell.
Each distance between an incident and the center of a grid cell is then
weighted based on a specific method of interpolation (the kernel function) and
the bandwidth (search radius) [10].’ The approach produces a contour map, a
heat map, or a surface view map with the more heavily weighted areas of high
crime visually represented. The hot spots can then be defined as the areas

16
ArcGIS Spatial Analyst allows the user to create, query, map, and analyze cell-based raster
data, perform integrated raster/vector analysis, derive new information from existing data,
query information across multiple data layers and fully integrate cell-based raster data with
traditional vector data sources. More information are available at www.esri.com.
17 The purpose of CrimeStat is to provide supplemental statistical tools to aid law enforcement
agencies and criminal justice researchers in their crime mapping efforts. CrimeStat is
Windows-based and interfaces with most desktop GIS programs. It calculates various spatial
statistics and writes graphical objects to ArcGIS, MapInfo, Surfer for Windows and other GIS
packages. More information about CrimeStat are available at http://nij.gov/topics/technology/
maps/pages/crimestat.aspx.
38 Nenad Milić, Brankica Popović, Venezija Ilijazi et al.

above a certain threshold on each map. KDE produces hotspot maps with the
highest PAI, becoming stronger with longer time used for the prediction base
[5]. An example of Kernel density estimation map obtained in ESRI’s
ArcGIS® Spatial Analyst is shown in Figure 8.

Figure 8. Kernel density estimation map of robberies in the urban part of the Cukarica
municipality (Belgrade).

Geographical Profiling

Recent developments in criminological theory have encouraged crime


analysts to focus on geographic patterns of crime, by examining situations in
Spatial Data Visualization as a Tool for Analytical Support 39

which victims and offenders come together in time and space. This process is
referred to as geographic profiling [16].
Geographic profiling, introduced in the early 1990s, represents a
geospatial crime analysis technique that attempts to determine where a serial
offender most likely resides. The predictions are based on the locations of
these crimes, other geographic information about the case and the suspect, and
certain assumptions about the distance offenders will travel to commit crimes
[30]. Different algorithms could be used to calculate the area boundaries. It
must be emphasized that this technique should not be used to pinpoint a
particular location or suspect, since being a statistical technique, it gives
results in terms of probability, not certainty.
Geographic profiling involves application of advanced spatial analysis
techniques for crime distribution under the auspices of the criminology
theoretical framework, and above all routine activities theory, rational choice
theory and crime pattern theory [16].
The three most popular models for geographical profiling of unknown
offenders are Rosmo’s model (so called “criminal geographic targeting”
algorithm), Canter’s model and Levine’s model (journey-to-crime analysis).
All three models are implemented in the appropriate software solutions:
Rigel18 (Rossmo), Dragnet19 (Canter) and CrimeStat20 (Levine) [30]. Levine’s
model differs from the other two models in that it is not geographic profiling
model in the true sense of the word, as pointed out by his creator, rather it is a
model which estimates the crime trip (i.e., the road which offender utilize in
his crime action) [22].
It is clearly convenient to display the output of geographic profiling
software on a Geographic Information System that also shows streets,
landmarks, political boundaries, and other geographic features of the areas
around the crimes. Output is in the form of color shadings (two-dimensional
map) and the height of the surface (in the case of three-dimensional diagram),
representing the offender’s likely base of operations.
Despite the fact that geographic profiling is a relatively new discipline, it
is gaining in popularity after successful implementation in resolving several
serial crimes in the United States and Canada. Today, there is a number of
software packages that can be freely downloaded from the Internet, as the

18
More on http://ecricanada.com/products/rigel-analyst/.
19
More on http://www.i-psy.com/publications/publications_dragnet.php.
20 More on http://www.icpsr.umich.edu/CrimeStat/.
40 Nenad Milić, Brankica Popović, Venezija Ilijazi et al.

price for commercial products (with more advanced capabilities) is in constant


decline.

Figure 9. A screenshot of the Rigel geographic profiling software from Environmental


Criminology Research Inc website.21

Internet Crime Maps

Wide utilization of the internet technology has enabled police


organizations to post cartographic representations of crime (crime maps) on
their internet sites, ensuring constant public access to up-to-date crime
information. In a simple and fast way, they enable citizens to obtain
information on crime trends in a given geographic area and make their own
judgment about their safety. In that way, pressure on the police organization is
reduced, and significant police resources are released to be used for other
purposes [24]. Particularly convenient are the so-called ‘interactive’ crime
maps which can perform query (by place, execution time, the type of offenses,
etc.) defined by citizens interested in monitoring local criminal activity, giving
immediate response regarding the subject of their interest (Figure 10).
Today web-based crime mapping is a common practice in modern police
organization, mostly focused on supporting community policing [8]. Emerging
software packages have additional analytical functionality such as pattern
analysis and crime prediction.

21
More on http://ecricanada.com/technologies/.
Spatial Data Visualization as a Tool for Analytical Support 41

There are some controversial issues in the field of GIS and crime
mapping. One is so called ‘spatial labeling’ where labeling an area as
dangerous might produce serious consequences to the community of that area
in terms of economic, sociological, and criminological perspective [9].
Another controversial issue is the privacy of the people (especially crime
victims), where GIS and crime mapping may cause transgression of privacy
and confidentiality of people’s lives [33]. That is especially true for the
Internet crime maps [21]. Some detailed information such as gender, time,
place, ethnicity and age of criminals or victims might be used for creation of
web-based maps, where overlaying specific crimes with them may
inadvertently reveal the identity of a victim.22 In a public information system,
mechanisms must be devised to ensure privacy protection in order to balance
the public right to information and privacy of the crime victim [9]. Therefore,
the most important thing is to provide that individual identification is either
confidential or impossible (anonymous).

Figure 10. Example of interactive crime map from CrimeReports™ website. 23

22
Who is later often stigmatized on that basis (e.g., rape victim). People therefore often hide
information of their victimization from law enforcement agencies knowing that others could
recognize them with small efforts on the Internet.
23 Source: https://preview.crimereports.com/#!/ accessed on 05/31/2016.
42 Nenad Milić, Brankica Popović, Venezija Ilijazi et al.

Crime Information Warehouse (CIW) Solution

Crime analysis is data-intensive with difficulties to coordinate effectively


the volume of information from multiple systems. Police agencies have
numerous (mostly computer-based but there still exist paper-based)
information systems to collect data for the Compstat24 process and other crime
analyses [4]. Since data and information are scattered among different
subjects, they must be collected and integrated before analyzing in order to
assemble the big picture. In response, as the natural evolution of Compstat, the
IBM and Cognos proposed Crime Information Warehouse (CIW) Solution.25
IBM’s CIW data model is the heart of the solution, reporting and analysis
are delivered by business intelligence software from Cognos with GIS
mapping enabled by Esri (ArcGIS) (Figure 11). Esri and IBM have developed
several innovative integrated solutions for better decision making through
proven analytics and optimized management of business information.26 Esri
Maps for IBM Cognos is integrated with the Esri cloud.27
CIW represents a consolidated repository for all crime data for reporting,
analysis, and deployment. The solution incorporates advanced Web-enabled
tools like geographic imaging software and live video feeds for a detailed view
of an enforcement area and provides user access with Cognos for crime
analysis, mapping, reports, and statistics. Users can access any information
stored within the warehouse to report, analyze and understand crime statistics
based on any number of different factors [18]. Departments can use this instant
information to redeploy and reconfigure resources in response to crime trends.
With techniques of real-time information gathering (data and video) and
predictive analytics, CIW technologies are moving to the next level towards
smarter cities with more integrated law enforcement agencies [18]. These
solutions apply data-mining techniques in order to help agencies uncover
hidden patterns, associations, correlations and trends, even in large complex
datasets of structured and unstructured data (emails, videos, cell phone calls,
chat room interactions, etc.). It is expected that utilization of data warehouse,
deep analytics and data visualization will lead to better hotspots and crime

24 Compstat is a performance management system that is used to reduce crime and achieve other
police department goals. Compstat emphasizes information-sharing, responsibility and
accountability, and improves effectiveness.
25 The IBM-Cognos Crime Information Warehouse, more on ftp://public.dhe.ibm.com/software/
data/sw-library/cognos/demos/bp_od_blueprints/resources/br_ibm_crime_warehouse.pdf.
26 More on http://www.esri.com/partners/partners-alliance/ibm/solutions.
27 More on http://www.esri.com/~/media/Files/Pdfs/library/fliers/pdfs/esri-maps-ibm-cognos.pdf.
Spatial Data Visualization as a Tool for Analytical Support 43

trends prediction, which will allow more efficient deployment of police


resources.

Figure 11. Interactive mapping abilities of Crime Information Warehouse.28

Predictive Policing

Predictive policing has become one of the hottest emerging areas in law
enforcement. It can be described as ‘the application of analytical techniques -
particularly quantitative techniques - to identify likely targets for police
intervention and prevent crime or solve past crimes by making statistical
predictions’ [27]. All types of data can be analyzed, both structured and
‘unstructured’ (such as emails, text messages, audio and video files, health
records, journals, etc.). Both the volume and the quality of these data will
determine the usefulness of any approach. Obtained information can serve for

28
Source: ftp://public.dhe.ibm.com/software/data/sw-library/cognos/demos/bp_od_blueprints/
resources/br_ibm_crime_warehouse.pdf.
44 Nenad Milić, Brankica Popović, Venezija Ilijazi et al.

better anticipation of what types of intervention will be needed and where, in


order to plan and make the best use of available resources.
In the context of predictive policing, according to Perry et al., predictive
methods can be divided into four broad categories [27]:

1. Methods for predicting crimes: approaches used to forecast places and


times with an increased risk of crime.
2. Methods for predicting offenders: approaches used to identify
individuals at risk of offending in the future.
3. Methods for predicting perpetrators’ identities: techniques are used to
create profiles that accurately match likely offenders with specific
past crimes.
4. Methods for predicting victims of crimes: approaches used to identify
groups or, in some cases, individuals who are likely to become
victims of crime.

We have already emphasized that these analytical techniques produce


estimates, consequently the results - ‘predictions’- are probabilistic, not
certain. There are a number of different predictive techniques that can be used
to predict crime risk, whose categorization is shown in Table 2.
Short explanation of hotspot techniques was already given in section
‘Hotspot policing’. In contrast with hotspot mapping which relies only on past
crimes data, regressions methods use a wide range of data in estimating future
crime risk. Widely used data mining methods aim to: a) predict a category
(commonly referred to as classes) for an outcome (Classification); b)
subdivide data into groups (clusters) with similar attributes (Clustering).
Spatial clustering algorithms are widely used for estimation of statistically
significant hotspots. Data mining also includes some of the most complex
methods, including the neural network and support vector machine families to
make predictions. Near-repeat methods are based on the assumption that future
crimes will occur very near to the current crimes in both time and place.
Spatiotemporal analyses include various environmental and temporal features
of the crime location, which are used for analyzing both short-term series and
long-term problems or hotspots. Risk terrain analyses attempt to: (a) identify
geographic features that contribute to crime risk, and (b) make predictions
about crime risk based on how close given locations are to these risk-inducing
features. Although, from the user perspective, both risk terrain model and
hotspot method produce qualitatively the same output, they are very different
methods. Hotspot methods are fundamentally clustering techniques, while
Spatial Data Visualization as a Tool for Analytical Support 45

Risk terrain modeling is a classification approach. One major advantage of risk


terrain approaches is that it can predict new hotspots, even in the area with no
recent crimes, on the base of similarity to other hotspots [27].

Table 2. Categorization of different predictive techniques


that can be used to predict crime risk (according to [27])

Category Technique Data


 Grid mapping
 Covering ellipses using crime data
hotspot analysis
 Kernel density only
 Heuristics
 Linear
regression  Stepwise
using a range of data
methods  Splines
 Leading indicators
data mining  Clustering
using a range of data
techniques  Classification
 Self-exciting point process over next few days,
near-repeat
 ProMap using crime data
methods
 Heuristic only
Category Technique Data
 Heat maps
spatiotemporal using crime and
 Additive model
analysis temporal data
 Seasonality
risk terrain  Geospatial predictive analysis using geography
analysis  Risk terrain modeling associated with risk

Despite the public expectation, and even some misunderstanding among


users, predictive methods do not predict where and when the next crime will
be committed. They predict only the relative level of risk that a crime will be
associated with a particular time, place and person(s).
It is reported that utilization of predictive software can significantly
reduce serious crime. For example, IBM’s Blue CRUSH (Criminal Reduction
Utilizing Statistical History) software enabled the Memphis Police to evaluate
incident patterns throughout the city and forecast criminal ‘hotspots’ so they
could allocate resources, deploy personnel and increase public safety reducing
crime by more than 30 percent in 4 years [18]. Los Angeles Police Department
46 Nenad Milić, Brankica Popović, Venezija Ilijazi et al.

is one of the few in the USA using PredPol29 software that automatically
generates maps for police of where and when crimes may occur [12]. PredPol
is a cloud-based software-as-a-service (SaaS) in which unique crime
prediction methodology combines available crime data with advanced
mathematics, cloud computing and computer learning (including indispensable
experience of veteran police) techniques. By analyzing input data - type, place,
and time of crime, an output in the form of red box (dimension 500 by 500
foot) is shown, pointing to the district areas with the highest risk for criminal
activity for that shift (Figure 12). The results are more accurate and more
actionable recommendations for when and where crime is most likely to occur
thus allowing police to show up before crime happens.30

Figure 12. Predictive Policing Screen Shot: PredPol™ a cloud-based SaaS for crime
prediction (adopted from [11]).

29
More on http://www.predpol.com/.
30
PredPol's Innovative Predictive Policing Software Results in Dramatic Crime Reduction,
avaliable at http://www.prnewswire.com/news-releases/predpols-innovative-predictive-
policing-software-results-in-dramatic-crime-reduction-227802601.html.
Spatial Data Visualization as a Tool for Analytical Support 47

CONCLUSION
Police officers are engaged on a daily basis in the collection of data
necessary for fulfilling their responsibilities. The collected data are analyzed
and in the form of different analytical products disseminated to its users.
Having in mind that the most of activities undertaken by the police officers
have a spatial component (X and Y coordinates), cartographic visualization of
these data takes significant place in the crime analysis process. The importance
of spatial data visualization in the police practice has been recognized more
than a century ago, when the first crime maps appeared on the police stations’
walls [16]. Comparing to the textual crime reports (bulletins), crime maps
inform law enforcement officers much faster and easier about the spatial
distribution of crime.
Intensive development of information and communication technologies
has extended the existing analytical methods and added some new ones, thus
forming new approaches to solving crime problems. Crime mapping and GIS
technology are such examples. Helping police officers to be better informed
about crime and other events of importance for their activities, crime maps
enable more effective identification of problems and their causes, which is the
prerequisite for efficient work aimed at their elimination. In this way, spatial
visualization becomes an important decision making support tool at all levels
of police organization - from a street police officer, to the top management of
the police organization.
Nowadays when the limited resources should be effectively deployed in
order to combat crime and to respond to growing citizens' demands, focusing
them on the place and at the time when they are most needed becomes an
essential prerequisite for the effective performance of the police functions. In
this regard, the identification of crime hotspots becomes a part of everyday
police analysts’ activities. Although the human eye and brain can be a good
‘tool’ for geospatial data processing, visual method cannot be sufficient to
enable making correct conclusions. This is particularly evident in cases where
complex spatial distribution of crimes is analyzed. In this context GIS tools
have an important role. They can enable analysts to ‘see’ what is invisible to
the human eye. Timely recognition of problematic locations (e.g., hotspots)
and placing them into the focus of police attention, could lead to opportunity
reduction and yield clear crime prevention benefits.
In order to facilitate access to crime maps, police organizations use the
benefits of the Internet technology. Ensuring constant access to the current
crime data (24/7), the Internet crime maps enable citizens to get the data about
48 Nenad Milić, Brankica Popović, Venezija Ilijazi et al.

crime distribution and crime trends in a fast and easily accessible way. The
most popular are interactive crime maps that allow users to perform their own
queries (by the type, place or time of crime, etc.), and get answers to the
questions they are interested in. Specialized internet-based software will
enable up-to-date information that creates the needed situational awareness
among officers helping them to react timely.
Emerging predictive analytics capability which combines real-time
information gathering with data mining techniques helps law enforcement
officers to uncover hidden patterns, associations, correlations and trends in
large complex datasets of structured and unstructured data. Even more, it can
help to find not only where a crime will most likely occur, but also when and
who the suspect or victim is likely to be, helping police officers to react before
crime is committed (prevent it). With the development of machine learning
techniques, data mining solutions will produce results almost in real time and
it will not be long before we start talking about ‘smart police’ as a part of the
future ‘smart cities’ in the ‘smart world’.
At the end we would like to emphasize that wide utilization of the
aforementioned techniques for crime analysis is to a large extent the result of
spatial visualization techniques development, which make it possible for end
users (police officers) to understand and exploit a product of complicated
analytical methods when presented in a form of the visual data (map).

REFERENCES
[1] Boba, Rachel. 2005. Crime analysis and crime mapping. SAGE
Publications.
[2] Braga, Anthony A., Papachristos, Andrew V., Hureau, David M. 2010.
“The concentration and stability of gun violence at micro places in
Boston, 1980–2008.” Journal of Quantitative Criminology 26:33–53.
[3] Braga, Anthony A., Weisburd, David L. 2010. Policing Problem Places:
Crime Hot Spots and Effective Prevention (Studies in Crime and Public
Policy) 1st Edition, Oxford University Press.
[4] Bureau of Justice Assistance. 2013. “Compstat: Its origins, evaluation
and future in law enforcement agencies.” Bureau of Justice
Assistance&Police Executive Research Forum. Washington DC.
https://www.ncjrs.gov/App/Publications/abstract.aspx?ID=265292.
Spatial Data Visualization as a Tool for Analytical Support 49

[5] Chainey, Spencer; Tompson, Lisa and Uhlig Sebastian. 2008. “The
utility of hotspot mapping for predicting spatial patterns of crime.”
Security Journal, 21(1):4-28.
[6] Clarke, Ronald V., Weisburd, David. 1994. “Diffusion of crime control
benefits: Observations on the reverse of displacement.” In Crime
prevention studies, 2:165-184, edited by Ronald V. Clarke, Monsey,
NY: Criminal Justice Press.
[7] Cohen, Jacqueline and Wilpen L. Gorr. 2006. “Development of Crime
Forecasting and Mapping Systems for Use by Police in Pittsburgh,
Pennsylvania, and Rochester, New York, 1990-2001.” ICPSR04545-v1.
Ann Arbor, MI: Inter-university Consortium for Political and Social
Research, 2006-08-31. http://doi.org/10.3886/ICPSR04545.v1.
[8] Crime Tech Solutions. 2015. “What is Geospatial Crime Mapping?”
Crime Technology Weekly, October 20. Accessed January 28, 2016.
https://fightfinancialcrimes.com/2015/10/20/what-is-geospatial-crime-
mapping/.
[9] Daglar, Murat and Argun, Ugur. 2016. “Crime Mapping and
Geographical Information Systems in Crime Analysis.” International
Journal of Human Sciences, 13(1):2208-2221. doi:10.14687/ijhs.
v13i1.3736.
[10] Eck, John E., Chainey, Spencer; Cameron, James G., Leitner, Michael
and Wilson, Ronald E. 2005. Mapping crime: Understanding hotspots.
Washington DC: National Institute of Justice. https://www.ncjrs.gov/
pdffiles1/nij/209393.pdf.
[11] Friend, Zach. 2013. “Predictive Policing: Using Technology to Reduce
Crime.” FBI Law Enforcement Bulletin, April, 2013. Accessed April 12,
2016. https://leb.fbi.gov/2013/april/predictive-policing-using-technology
-to-reduce-crime.
[12] GCN Staff. 2014. “Seattle police deploy SeaStat crime mapping tech.”
GCN, September 23. Accessed April 5, 2016. https://gcn.com/articles/
2014/09/23/seastat-seattle-crime-mapping.aspx.
[13] GIS for Crime Analysis, Law Enforcement, and Public Safety. 2014.
American Sentinel University. Accessed February 20, 2016.
http://www.americansentinel.edu/blog/wp-content/uploads/2014/06/
AS_GIS-Crime-eBook-Final.pdf.
[14] Goldstein, Herman. 1990. Problem-oriented policing, McGraw Hill,
New York. Available at http://www.popcenter.org/library/reading/
pdfs/goldstein_book.pdf.
50 Nenad Milić, Brankica Popović, Venezija Ilijazi et al.

[15] Gottlieb, Steven; Arenberg, Sheldon and Singh, Raj. 1994. Crime
analysis: From first report to final arrest, CA: Alpha Publishing.
[16] Harries, Keith D. 1999. Mapping crime: Principle and practice, U.S.
Dept. of Justice, Office of Justice Programs, National Institute of Justice,
Washington DC. https://www.ncjrs.gov/pdffiles1/nij/178919.pdf.
[17] Hubler, David. 2013. “Predictive analysis grows as crime-prevention
tool.” GCN. January 15. Accessed March 10, 2016. https://gcn.
com/articles/2013/01/15/predictive-analysis-crime-prevention-tool.aspx.
[18] IBM. 2011. “Predictive Crime Fighting.” IBM’s 100 Icons of Progress,
March 17. Accessed March 7, 2016. http://www-03.ibm.com/ibm/
history/ibm100/us/en/icons/crimefighting/.
[19] International Association of Crime Analysts. 2014. “Definition and types
of crime analysis.” Standards, Methods&Technology White Paper 2014-
02, Overland Park, KS. http://www.iaca.net/Publications/Whitepapers/
iacawp_2014_02_definition_types_crime_analysis.pdf.
[20] Klose, Andreas and Drexl, Andreas. 2005. “Facility location models for
distribution system design.” European Journal of Operational Research,
162(1):4-29. http://dx.doi.org/10.1016/j.ejor.2003.10.031.
[21] Kounadi, Ourania; Bowers, Kate and Leitner, Michael. 2015. “Crime
mapping on-line: Public perception of privacy issues.” European journal
on criminal policy and research, 21(1):167-190.
[22] Levine, Ned. 2015. CrimeStat: A Spatial Statistics Program for the
Analysis of Crime Incident Locations (v 4.02). Ned Levine and
Associates, Houston, TX and the National Institute of Justice,
Washington, DC.
[23] Milc, Nenad. 2012. “Crime mapping in a function of problem oriented
policing (in Serbian).” NBP - Journal of Criminalistics and Law,
Belgrade, 1:123-140.
[24] Milic, Nenad. 2012a. “Crime mapping in a function of improving
partnership between the police and the local community (in Serbian).”
Bezbednost, 3:138-159.
[25] Milic, Nenad and Subosic Dane. 2013. “Location problems solving in
the function of police resources engagement optimization (In Serbian).”
In thematic proceeding Structure and function of police organization -
tradition, status, perspective - II, Academy of criminalistic and police
studies, Belgrade, Serbia, pp. 239-251.
[26] Paulsen, Derek J., Robinson, Matthew B. 2004. Spatial aspects of crime:
Theory and Practice, Pearson Education.
Spatial Data Visualization as a Tool for Analytical Support 51

[27] Perry, Walter L., McInnis, Brian; Price, Carter C., Smith, Susan C.,
Hollywood, John S. 2013. Predictive Policing: The Role of Crime
Forecasting in Law Enforcement Operations, Santa Monica, CA: RAND
Corporation, 2013. http://www.rand.org/pubs/research_reports/RR233.
html.
[28] Popovic, Brankica. 2013. “Role of ICT in modern police organization
(In Serbian).” In thematic proceeding Structure and function of police
organization - tradition, status, perspective - II, Academy of
criminalistic and police studies, Belgrade, Serbia, pp. 251-270.
[29] Reppetto, Thomas A. 1976. “Crime prevention and the displacement
phenomenon.” Crime and Delinquency, 22(2):166-177.
[30] Rich, Tom and Shively, Michael. 2004. A Methodology for Evaluating
Geographic Profiling Software, National Institute of Justice’s Document
No.: 208993, Washington DC. https://www.ncjrs.gov/pdffiles1/nij/
grants/208993.pdf.
[31] Sherman, Lawrence W., Gartin, Patrick R., Buerger, Michael E. 1989.
“Hot spots of predatory crime: Routine activities and the criminology of
place.” Criminology, 27(1):27−56. doi: 10.1111/j.1745-9125.1989.
tb00862.x.
[32] Walker, Samuel and Katz, Charles. 2012. The police in America: An
introduction, 8th edition, McGraw-Hill Education.
[33] Wartell, Julie and McEwen, Тhomas. 2001. Privacy in the Information
Age: A Guide for Sharing Crime Maps and Spatial Data, US
Department of Justice, Washington DC. https://www.it.ojp.gov/
documents/d/188739.pdf.
[34] Weisburd, David; Bushway, Shawn; Lum, Cynthia and Yang, Sue-Ming.
2004. “Trajectories of crime at places: a longitudinal study of street
segments in the city of Seattle.” Criminology, 42(2):283–322. doi:
10.1111/j.1745-9125.2004.tb00521.x.
[35] Wood, Tyler. 2015. “What is Crime Analysis?” Crime Technology
Weekly, December 11. Accessed March 28, 2016. https://fightfinancial
crimes.com/tag/data-visualization/.
In: Knowledge Discovery in Cyberspace ISBN: 978-1-53610-566-7
Editors: K. Kuk and D. Ranđelović © 2017 Nova Science Publishers, Inc.

Chapter 3

CYBERCRIME INFLUENCE ON PERSONAL,


NATIONAL AND INTERNATIONAL SECURITY
WHILE USING THE INTERNET

I. Cvetanoski1,*, MD, J. Achkoski, PhD,


D. Rančić2, PhD and R. Stainov3, PhD
Military Academy “General Mihailo Apostolski”- Skopje,
1

associate member of “Goce Delchev” University, Shtip Macedonia


2
Faculty of Electronic Engineering, Niš, Serbia
3
University of Applied Science Fulda, Fulda, Germany

ABSTRACT
The aim of this chapter is to stress the danger of cybercrime activities
in cyberspace and its impact on personal, national and international
security in the 21st century. Insignificant approaches towards this
phenomenon may lead to unpredictable consequences even for the state’s
security.
The new millennium brought information society growth which
enabled the nations to be linked in the global cyber space that lead to fast
data transfer throughout the world. Globalization of the cyberspace
caused new risks and threats which are invisible for the eyes and stealthy
for the ears. The cyber-criminals act conspiratorially through the
cyberspace; they penetrate in the system privacy and conduct the crime in

* Corresponding
author: I. Cvetanoski, Email: igorcvetanoski@yahoo.com.
54 I. Cvetanoski, J. Achkoski, D. Rančić et al.

such manner that we are even not aware of being victimized. Cybercrime
starts as personal, but it ends as international security threat.
During the research we will stress on the motives which encourage
cyber-criminals to execute cybercrimes on individuals, private
sector/business companies or state institutions. Furthermore, we will
define categories and types of cybercrime. Also, there will be presented
the methods of cybercrime, such as: hacking, social engineering,
phishing, pharming, denial of services attacks, distributed denial of
services, malicious software usage, adware, steganography and etc..
There will be presented some examples of cybercrimes that occurred in
the world in order to note that no state is immune on this threat in the 21 st
century. And finally, some examples of cybercrime will be shown since
they were noticed in the past few years in Macedonia accompanied with
some statistics. We will also present simple linear regression as a model
for short range predictions (a year or two in the future).
In the future, cybercrime will have increasing rate and it will cause
more significant damages due to the development of the information-
technology society. Today, modern technology gives great opportunity to
use on-line tools for performing cybercrime activities, which means that
anyone can create malicious software for crime activities in cyberspace.

Keywords: cyberspace, cyber-criminals, malicious software, malware and


virus prevention, simple linear regression

INTRODUCTION
The modern IT society enables global connection of the people through
cyberspace. Communications through cyberspace enable rapid transfer of
information, but they increase the risk to be compromised. The cyberspace as a
new battlespace creates new threats, new warriors and new challenges in the
21st century.
The cyberspace consists of many interconnected computers, servers,
routers, switches and fiber optic cables. Proper use of cyberspace is the basis
for the economy and national security. Provision of cyberspace is an extensive
undertaking that requires coordinated action and commitment from all
stakeholders of society: governments, states, local governments, private sector
and citizens [4].
Nowadays, modern societies depend on cyberspace for normal
functioning. The threat of cyber war and its alleged effects are source of a
great concern for governments and armed forces in the world. The fact that
Cybercrime Influence … 55

several serious cyber attacks are being carried out in these moments while
debating the exact definition of cyber war, can serve as an illustration of what
can be expected if the real cyber war occur in the future. There is real
inconvenience to identify the perpetrators of cyber attack, so they have plenty
of time to conceal their real identity [2].
Perhaps the movie “Matrix” starring Keanu Reeves is one of the many
stories about the future and the evolutionary process of cyberspace, about the
evolution of the war, about change of the perception of the man to the
machine, about the technological development and the development of
artificial intelligence, about switching roles between the humans and the
machines, about the world in which the machines manage the people, about
virtual world created by the progress of machinery using the possibilities of
cyberspace and smooth mutual communication through the established
network connections.
The scope of this chapter is cybercrime and the basis for the cybercrime
definition is that this type of crime includes any criminal act relating to
computers, computer networks and computer systems. The convention on
Cybercrime 2001 of the Council of Europe in its preamble defines cybercrime
as “activities that are directed against the integrity, confidentiality and
availability of computer systems and data networks, as well as any misuse of
these system networks and computer data” [10].
Malicious hackers are responsible for the cyber attacks. They have a basic
objective to penetrate into the computer, data network or computer system
through cyberspace, with the ultimate objective being disruption of the
stability of the system, taking over control of the system (so called zombie
system), denial of services attacks, stealing the personal data, stealing the
monetary funds from their own accounts, propaganda, spying, changes to data,
abuse of critical infrastructure and many other criminal activities with the help
of malicious software (viruses, worms and etc.).
It is difficult to understand the motives for committing cybercrime,
however, following grounds are very common:

 Political/religious
 Financial benefit,
 Idealistic (activities held only to prove the capabilities without
expectation of reward or a financial benefit)
 Curiosity, adventure (beginners who have not entered the criminal
leads, but they do it for fame, without the knowledge and skills) [13].
56 I. Cvetanoski, J. Achkoski, D. Rančić et al.

One of the characteristics of cyber attacks is that it is difficult to identify


the perpetrators of the attacks and even countries that committed those attacks.

CYBERCRIME METHODS OF ACTION


Cybercrime evolution is continuous and it has high speed changing forms.
The types of cybercrime depend only on the aspirations of the malicious
hackers. Cybercrime as crime has differentiated the following main elements:

 Story - what happened,


 Circumstances - as happened
 Mental state of perpetrators which is necessary for crime classification
and cyber criminal profiling.

Cybercrime can be grouped based on the role of the computer in the


execution of the crime, where the computer can be:

 apparent target (unauthorized entry into the computer, data theft);


 means of attack (credit card fraud, sending spam and pictures);
 When connected to everyday crime (trafficking in drugs and people,
child pornography and etc.);
 When it is based on evidence for committing cybercrime.

Recent studies have shown that crime associated with computers has
increasing rate, which primarily refers to the violation of intellectual property
(unauthorized copying and theft of copyright) and software piracy. There are
many types of cybercrime but some of them are:

 theft of computer services;


 software piracy;
 disclosure, theft and alteration of computer data and information;
 extortion using a computer;
 misuse of stolen passwords;
 child pornography;
 transmission destructive viruses;
 Industrial and political espionage [18].
Cybercrime Influence … 57

The most frequently used methods for cyber-attacks by malicious hackers


are:

 Hacking as activity that performs malicious hackers (cyber-criminals).


The main goal of malicious hackers is to enter unauthorized in the
system and to take procedures for authorization and identification, to
disrupt the proper functioning of the system, to steal data and
information system etc.. Malicious hackers can be rented from various
companies, to work as spies for some governments or they may have
some connections with organized crime and terrorist groups and so
on. The reasons for these actions are different: material gain, revenge,
entertainment and etc. [12].
 Social engineering, this method uses the people as the weakest line of
defense of any organization. Common term for this activity is people
hacking [1].
 Malicious software is software that involves the use of viruses,
worms, trojans, spywares etc..
 Attacks for access prohibition (DoS - Denial of Services Attacks) are
used for blocking the system that is targeted to claim the huge demand
for services per time. This action disables the system to give an
answer to the requests and therefore cause complete blockage of that
system.
 Fraud bank card.
 Phishing attacks. Тhese attacks enable malicious hackers to use fake
e-mail messages and fraudulent websites of financial institutions,
while trying to mislead consumers to disclose the confidential
personal data [5].

SECURITY RISKS AND THREATS WHILE


USING THE INTERNET
With the emergence and use of the Internet, the term security of computers
and computer networks, gets wider meaning which includes: protection of data
and computer network from malicious hackers. The trend of globalization has
further impact on security in Information Technology (IT) sector. The
undefined legal provisions on security risks and threats in the Internet-space at
a global level enable safe zones for offenders and malicious hackers. The large
58 I. Cvetanoski, J. Achkoski, D. Rančić et al.

number of Internet users make difficult and impossible to search for violators
of the laws, relating to the abuse of the Internet. The most common ways of
endangering the computers over the Internet are:

 Downloading malicious software that is attached on e - mail;


 Open ports on the personal computer - PC (mostly due to the already
installed malicious software), which can take control of the computer,
known as Distributed Denial of Services attacks - DDoS, Denial of
Services - DoS, ARP fraud (Address Resolution Protocol (ARP)
spoofing), SYN flood (SYN flooding) and the like;
 Visiting suspicious websites (usually placed on free servers) which
mostly through Java or ActiveX inserts malicious software on the
computer;
 Installing and launching the suspicious programs that have been
infected with malicious software;
 Security vulnerabilities in programs that are used on computer
(operating system, browser, users of e-mail and so on) that ask daily
update (download) of security accessories (update);
 Using the so-called patches (crack) that allow illegal use of software;
 Network identity theft (Phishing) which involves collecting personal
data (username, password, number of credit cards, telephone number
and etc.) and the user is not even aware of being victimized;
 Forwarding fake websites (Pharming) with the modification of the
local DNS (Domain Name System) server computer that has
previously been infected with malicious software;
 Reckless using of the services of social networks: Facebook, Twitter,
LinkedIn, Myspace [20]…

When accessing the Internet, the computer sends a message to the


appropriate web page and in return expects information for allowed access to
it. Often, this feedback brings unwanted malicious software created by
malicious hackers. So, the firewall can’t recognize what is inside the packet.
This software starts to be installed on the computer, so it can be only a slight
discomfort or a serious threat to the computer system, personal data and
sensitive financial information. Usually inconveniences are visible and easy to
detect, while dangerous threats are invisible, silent and difficult to detect [33].
The malicious software (Malware) under the definition of NIST (National
Institute of Standards and Technology) refers to “program that is often secretly
Cybercrime Influence … 59

deposited in the system in order to compromise the confidentiality, integrity or


availability of data, applications and operating system or trying another way to
harass the victim.” In other words, malicious software can be used to delete or
destroy valuable information; to slow down the performance of the computer
to a complete standstill or spying and stealing important personal data from the
victim’s computer [30]. Malware includes all malicious programs, such as
computer viruses, worms, Trojan horse or Trojan, Rootkit, Backdoor,
Spyware, Adware, Phishing, Pharming and others.
Viruses are malicious programs that infect the user’s computer in order to
cause damage (deletion and destruction of data, programs and operating
system) to the computer users. A computer virus is a program code that is
placed into individual files of the application or system software. They usually
consist of two parts: self-modifying code that allows propagation of the virus
and the main code (payload) in which the contents can be harmful. Infection
on the computer with viruses can be through the Internet or through portable
devices. Polymorphic viruses are those which change their code whenever
multiply in order to avoid the chance to be detected by the antivirus program.
Worms are malicious programs that are written in the working memory of
the computer and they remain active. Worms spread by placing identical
copies on other computers, in that way they can in a short time infect a large
number of computers. The range of possible damages ranging from causing
damage to the operating system (OS), extinguishing the PC slowdown or work
with network resources, to open and close the CD/DVD ROM, displacing the
characters on the keyboard of the computer and etc..
Trojan horses or Trojans are malicious programs that act as useful
programs, essentially causing great damage to the computer. They are thought
to originate from legal sources, but they disable anti-virus program and
firewall, so they allow access to the user’s computer [8]. Trojans unlike other
malicious software must be activated by hackers over the Internet. The most
common operations that a hacker can perform on the computer that is infected
with a Trojan horse are:

 Collection and theft of confidential information - passwords, bank


accounts and etc.;
 Installing software (including other types of malicious software);
 Download, installation, deletion, creation and modification of files;
 Review of the user’s desktop;
 Adding BotNet computer network (DDoS attacks);
60 I. Cvetanoski, J. Achkoski, D. Rančić et al.

 Collecting and downloading the text that is entered through the


keyboard (keylogging) and
 Taking the resources of the computer system and its slowdown [20].

Spy Programs (Spyware)

According to Trend Micro Inc. definition, spywares are: “Malicious


programs that send user data to a third party without the knowledge or consent
of the user.”
There are two types of spywares:

 Legal spyware programs and


 Commercial spyware programs - illegal malicious programs.

Legal spyware programs are those one that are installed on the computer
by the owners of the company, in order network administrators to be able to
monitor the activities of employees. These programs are used for protection of
intellectual property, data and computer networks, and parental supervision of
children and juveniles (at the request of a parent). Other legal cases for using
these programs are for the purpose of the authorities of the state in order to
monitor terrorists, criminals and other law-breakers.
Commercial spyware programs are programs of the companies which are
created for collecting information of users’ habits when viewing Internet
content. These programs are illegal and they collect users’ information easily.
The greatest benefit of spywares has the marketing industry. The spywares, in
accordance to their purpose, can be divided into the following categories:

 Internet URL Loggers,


 Screen Recorders,
 E-mail Recorders,
 Chat Loggers,
 Keyloggers,
 Password Recorders,
 Tracking Cookies,
 Browser Hijackers,
 Modem Hijackers and
 PC Hijackers [8].
Cybercrime Influence … 61

Rootkit Programs

Rootkit (root-administrator and kit-equipment, tools) is a malicious


software that can be composed of several programs whose main task is to
conceal the signs that the system is compromised, designed in order to
surreptitiously taking control on operating system by other malware (e.g.,
keylogger program). Using the rootkit does not have to be mean, but the term
rootkit is increasingly associated with undesirable behavior of the operating
system and the malicious program. Contrary to what its name form can imply,
rootkit is not assigning administrative privileges to the user, but provides
access, moves and modifies system files and processes.

Backdoor Programs

Backdoor is a program that is installed from the viruses, worms or Trojan


horses (without the user’s knowledge), which is used to bypass authentication
(verification process the personal data of the user in the moment of application
or connection of the operating system), with the ultimate goal to enable
smooth and unauthorized access to the operating system. Backdoor is using the
flaws and weaknesses of the operating system. Backdoor Trojans open the
“side entrance” of the embattled computer and allow unauthorized use of
hardware and software resources of embattled operating system.

Adware Programs

Adware (ad- advertising, advertisement and ware - programming package)


is any software package, some malicious software, which starts automatically,
displays or downloads advertisements from the computer while using or after
installation. Computers have the ability to collect large amounts of personal
data and transferred to third parties including companies and advertising
networks. Industry for on-line advertising is big and competitive business,
powered by buying and selling of personal data, such as Internet browsing
behavior and the nature of the users. There are many ways to collect such
information. One way is contracting with the websites of social networks
(Social Networking) [20]. Another way is to set so called Cookies on our
search engine to monitor our behavior and interests on the Internet or make an
62 I. Cvetanoski, J. Achkoski, D. Rančić et al.

agreement with the research teams of applications for smart phones – that can
even use GPS (Global Positioning System) to find our location [30].
The cookies, pop-ups and adware are tools for monitoring our behavior
when we are on-line on the Internet and are used to promote various products.
Many cookies are safe tools for the sole purpose of monitoring and collecting
information from the Internet. In the most of the cases adware programs are
made of pop-up ads that cause nothing else than unwanted nuisances. The
main problem with these tools is that malicious hackers and on-line criminals
largely use them to access and enter our computer and collect our personal
information without we’ve been aware of it. Some of the data of the user who
visits a website are detected through log files. These files register those data
targeted at the creator of the website [33].

Phishing – Catching a Personal Data

Phishing attacks include activities that unauthorized users using false


messages in emails and fake websites of some financial institutions (e.g.,
“Ebay”, “Paypal”, etc.) are trying to mislead the user revealing personal data.
In this context, this primarily refers to data such as credit card numbers,
usernames, passwords, PIN codes and etc. [5]. Phishing attacks are carried out
in several stages: gathering intelligence; design and preparation of the attack
and carrying out the attack. Fake emails and web pages that are used for these
attacks look very similar to the original. These attacks can be found through
URL (uniform resource locator) address of the fake website. For example, if
we visit the “eBay” then the final part of the domain in the URL address
should end with: “ebay.com”. Accordingly, websites that have the URL
http://www.ebay.com http://cgi3.ebay.com are valid web pages, while
http://www.ebay.vaidate-info.com and http://www.ebay.login123.com are fake
web pages that can be used by malicious hackers. If the URL contains IP
(Internet Protocol) address, like 12.30.229.107, instead of the domain name,
more than certain imply that someone wants to capture (phish) personal data
from the computer [20]. Table 1 shows the categories of websites infected
with phishing.
“The New New Internet” which represents a web page for cyber security,
in the past period has given a special emphasis to the malicious hackers who
lately are more active on social networking sites and activate phishing attacks
by Instant messaging, Facebook, Twitter and other social networks [30].
Cybercrime Influence … 63

Table 1. Website categories infected with phishing [15]

Websites categories infected with phishing


Rank Category Rank Category
1 Free web pages 6 Travel
2 Education 7 Shopping
3 Sports 8 Health and Medicine
4 Business 9 Real Estate
5 Computers and Technology 10 Fashion and Beauty

Pharming Programs

Pharming unlike Phishing directs users to fake websites without the user
being aware of it. The phishing web pages usually use the domain name for
the address, while their exact location is determined by the IP address. The
user gets to write the domain name into their web browser and press enter; the
domain name is converted into an IP address through DNS (Domain Name
Server). Thus web browser connects to the server with that IP address and
takes data from the website. Once the user visits the website, DNS entrance on
that side often remembers the DNS cache of computer user. Thus computer
must constantly access the DNS server whenever the user wants to access the
website. One of the ways of Pharming is an e-mail that has a code of a virus
that infects the local DNS cache user. For example, instead of IP address
17.254.3.183 which essentially is the address of www.apple.com, it can be
changed to another website by hackers. Pharmers – can infect some DNS
servers, which means that any user who uses that server will be redirected to
the wrong website. Usually most of the DNS servers have protection measures
that protects from these attacks. However, this does not mean that they are
100% immune to attacks by malicious hackers. These attacks can act on
multiple users at once in cases when large DNS server is modified [6].
Methods pharming and phishing are the best known methods of identity theft
and other personal data of the user. Categories of websites that were probably
compromised with malware in 2013 are shown in Table 2 [15].
64 I. Cvetanoski, J. Achkoski, D. Rančić et al.

Table 2. Websites categories infected with malware [15]

Websites categories infected with malware


Rank Category Rank Category
1 Travel 6 Education
2 Transportation 7 Search Engines and Portals
3 Business 8 Arts
4 Sports 9 Restaurants and Dining
5 Leisure and Recreation 10 Real Estate

Scareware Programs

The term scareware marks several classes of programs for fraud, often
with little or no profits that are sold to consumers because of some unethical
marketing. These programs are made as to cause shock or perception of theft
among users. The most frequently used tactic is convincing the user that the
computer is infected with a virus so it is recommended to download antivirus
program to remove the virus. Recommended antivirus is mostly commercial
and users must pay their use. The term programs fraud is often used to
describe a product while performing the desired operation and also produces
many warnings for the purposes of the application of commercial firewall or
programs for cleaning the registry (registry cleaner software). These classes of
programs mark and often display continuous warning messages to users. Even
more, some websites display windows with new ads (pop-up) or
advertisements (banners) with text that emphasizes the user that the computer
is infected with malware, and because of that they suggest scanning the
computer by clicking on offered window. These programs are not linked with
the installed malicious programs; they give false warnings, and are made as
coming from the operating system. The user can infect his/her computer even
if he/she presses the window to cancel or close the message. Some types of
programs that steal user’s data are also ranked in the scareware programs
because they shift the appearance of the background of the computer, they
install icons (for the operating system Windows), and continuously inform the
user that their computer is infected with some form of malicious software. One
example for this type of fraud is SpySheriff. It is a program for stealing user
data posing as a program to remove these malicious programs.
Cybercrime Influence … 65

Another form of scareware programs are the so called joke programs


(prank software) that are intended to intimidate the user to use the unexpected
images, sounds or video messages. First distributed program of this type was
for computer Amiga in 1991, and was called NightMare. This virus did not
take actions in the same time with booting the operating system, but in random
selected period of time altered the entire background and burned horrible
sounds. These viruses have recently been designed on such way to display
window on which when something is written, it will erase all data on the
computer no matter of whether that action will be taken or not. However, the
actual effect is that these malicious programs are used to intimidate the user,
and never delete data from your computer.

Ransomware Programs

These malicious programs are defined as programs that exploit the


vulnerability of the personal computer in order to break their operating system
and to encrypt those files. Once this happens the attacker keeps locked these
files till victim’s willingness to pay a certain amount of money. If the
operating system has previously been attacked by a worm or Trojan horse, an
attacker can easily penetrate in poorly configured operating system. In the
most of the cases, during the attack usually false messages are used in order to
detect the vulnerabilities of used antivirus program and to insert them in the
operating system through the most vulnerable port of the system. The next step
is contacting the user. The attacker sends an e-mail to the victim or a window
appears on the victim’s screen with an advertising message that requires
encryption key to unlock the files. Very often instructions are given in
accordance to recover data. When the attacker using the tools of ransomware
malicious programs, will take control of the data, he/she will encrypt them
with a sophisticated algorithm. The decryption password is given to the victim
in the moment when the certain amount of price is paid. The attacker inform
the victim with instructions message for steps which should be taken to
recover the data, which is located in the same directory as the encrypted data
[7]. At the end of 2013 users of the operating system Windows faced such a
threat known as “CryptoLocker”, which once swept computer, encrypt all
personal files and folders so that users cannot access the same [21].
66 I. Cvetanoski, J. Achkoski, D. Rančić et al.

Steganography

Steganography involves concealing secret messages in the multimedia


files (images, video file, audio file and etc.). This means that the process of
steganography usually involves inserting a secret message in a transmission
medium that in this case is used as a carrier and its primary role is
concealment of permanent secret message. The carrier should be the content
that does not draw any attention to itself (image, text, audio or video and etc.).
This package composed of a secret message and a holder where the message is
located is called steganography medium or stego. Besides the use of
steganography for good causes (protection of intellectual property rights,
confidentiality in communications and etc.), very often it is used for illegal
activities as the malicious software transmission is, and could be used for
computer assault [7]. Today, there are many examples for using steganography
in the process of secret communications between terrorist organizations.
Other malicious programs not subject to the processing of this chapter,
that is necessary to be mentioned are: dialer (communicator, elector of
numbers), DDoS (Distributed Denial of Service – distributed attacks for
refusal of services), botnet network, exploit, keylogging, Boot viruses, hoax
(fraud), scam, macro viruses, malware dropper and others [20].
Disgruntled insiders are employees of public institutions or private
companies whose objectives are to cause damage to the system or to steal
sensitive data from the company in which they are employed. According to the
Federal Bureau of Investigation (FBI) in the United States, insider attacks are
twice more likely than attacks from third parties [34]. In this context social
engineering increasingly has been applied. This method exploits the weakest
“line of defense” of any organization – people. As a new trend, in foreign
literature this term is known as people hacking where the trust of people is
abused for personal gains [1].

SOME EXAMPLES OF CYBERCRIME NOTICED


IN THE PAST FEW YEARS

Security risks and threats, that are mentioned above permanently exist in
the Internet-space. For these reasons everyone who is using this space should
be aware of the risks and threats that constantly lurk in both. Furthermore,
there are a few examples of malicious programs that were popular in 2013.
Cybercrime Influence … 67

For example, on September 6, 2013, the distributors of malicious software


invented false news aiming to attract public opinion about the possibility of
US air strikes against Syria. For this purpose messages have used the title “The
United States Began Bombing” and were made to look like legitimate
newsworthy Broadcasting station “CNN” (“The Cable News Network”) of the
United States. The trend of these campaigns is that they are becoming faster in
real time. The real time to create attack with malware software from examples
before Syria till the case in Syria was steadily declining.
In March, 2013 when the new Pope was elected, the first attack by
malicious software started exactly after 55 hours of the election. In April 2013,
after the bombings of the Boston Marathon in the US, after 27 hours, in order
to attract the public opinion the first malicious software attack had been
implemented.
In the case of Syria the attackers did not wait, so the attacks were faster
than the events on the ground. All this confirms that the Internet has a great
power to attract public opinion. Nowadays, the past several military campaigns
confirmed the motto “the one who will attract the world public opinion on his
side will win the war.” [15].
The campaign from the end of July 2013 referred to the arrival of the baby
– Prince George (“royal baby”) in the UK. This malware campaign was
initiated aiming to cause great interest. This news contrary to a previous had
all characteristics of web malware software. This news leads to a page with
three hidden links to pages infected with malicious software. The script
“<turncoat.js>” in its background activates “Blackhole Exploit Kit” without
being noticed by the user. Only visible content on the page was the message
“Connecting to server”. “Blackhole Exploit Kit” is one of the favorite tools
that are commonly used by cyber criminals nowadays. It scans the target
system and then downloads the most appropriate malware software depending
on the operating system, browser type, PDF format and etc..
As an illustration, in May 2006 the main news in the newspapers in the
United States was the theft of data from insurance of veterans. The chronology
of this case is as follows: an employee of the insurance company in order to
develop appropriate documents necessary for ensuring veterans and not to
break the given deadlines beyond the prescribed norms and safety rules, has
put the relevant data on his notebook (laptop) and took the data to work at
home. But in the meantime, portable computer had been stolen from his home
with the data in it. The company estimated that if these data are in the wrong
hands, it can cause serious consequences for the US and it could lead to a
complete collapse of the pension fund. Because of the seriousness of the
68 I. Cvetanoski, J. Achkoski, D. Rančić et al.

problem they created large expert team, which decided to publish this
information through the media in order to point to the thief on the possible
consequences (in order to convince the thief to destroy laptop). On the other
side, representatives of the IT sector of insurance company had undertaken all
necessary measures to protect the data from possible abuse [31]. The outcome
was such that luckily there were no consequences for the pension fund and the
state after all, but because of negligence and breach of security procedures by
one person contingency funds, time and extra work for data protection were
spent.
The report “The high-tech crime” 2011 of the company Norton, which is
designed for software solutions, estimated that consumers lost about 114
billion US dollars. The newspapers made a comparison and found that profits
from cybercrime are equivalent to profit from the global drug trade.
In June, 2012 the FBI performed operation “Card Shop”, in which 24
people from thirteen countries on four continents were arrested, for stealing
and selling of credit card data. FBI succeeded to capture them due to the fact
that secretly placed online carding forum called “Carder Profit”, who worked
on the principle invitation only and was constantly monitored by members of
the FBI. Stolen data were returned to the banks; more than 400.000 victims of
cyber crime were protected and it avoids loss of 205 million US dollars [14].
Assistant Director of the FBI from the USA, Janice K. Fedarcyk, said that:
“From New York to Norway and Japan to Australia, Operation “Card Shop”
was directed against sophisticated, highly organized cyber criminals involved
in buying and selling stolen identities, used credit cards, forged documents and
sophisticated hacking tools. Two-year-old secret FBI investigation conducted
on 4 continents is proof of commitment to eradicate rampant criminal behavior
of the Internet” [32]. This action also involved computer crime unit from the
Ministry of Interior from Macedonia. According to the FBI in Macedonia were
accommodated only orders for searching and interrogation of two persons for
whom there were grounds for suspicion that they are involved in cybercrime.
However in this action coordinated by the FBI, there weren’t arrested entities
from Macedonia [27].
In 2012, the Computer Crime Unit of the Ministry of Interior (MI) in
Macedonia detects cybercrime attack which made damage and unauthorized
entry into a computer system in public procurement for a hundred and fifty
police vehicles. Namely, on September 3, 2012 Bureau for Public Safety
(BPS) filed an application with respect to other issues of a technical nature in
the operation of the electronic procurement system in Macedonia in
implementing electronic auction of MI for the purchase of motor vehicles.
Cybercrime Influence … 69

Furthermore, from the analysis of the log files (logs) for access to the site, it
was established that increased traffic coming from 119 different IP (Internet
Protocol) addresses from various countries of the world. Obviously, the goal
was an attack for prohibited access to the electronic system of BPS which was
bombed with simultaneous claims of 119 different IP addresses that blocked
the electronic system [29].
In February, 2014 the Dutch police arrested four Dutch and one German,
and they closed the trading in the so-called “Dark web – Utopia”. These
people were suspected of being involved in illicit drugs, stolen credit cards,
weapons and etc.. Two of the arrested were suspected of having established
another web site called “Dark web” known as the “Black Market Reloaded”.
In the operation were found and seized the following things: personal
computers, hard drivers, USB sticks and 900 so called “Beatcoin” that had a
value between 400.000 and 600.000 euros [26].
The number of criminal activities in the area of cybercrime that occurred
worldwide and on domestic level is significantly higher than the above
mentioned. The aim is to show that no country is immune to this modern threat
nowadays, which is constantly changed in shape and capacity. Cybercrime like
any other crime knows no borders, nations or individuals, but it’s well known
environment is the cyberspace.

SOME STATISTICS FOR CYBERCRIME IN MACEDONIA,


NOTICED IN THE PAST FEW YEARS
Furthermore, in this chapter we are going to use simple linear regression
for data analysis. The simple linear regression is a statistical method that
allows us to summarize and study relationships between two continuous
(quantitative) variables. This lesson introduces the concept and basic
procedures of simple linear regression. The goal is to find the equation of the
straight line [23]:

𝐘=𝛼+ 𝛽𝑋+ 𝜀 (1)

X is the independent variable, Y is the dependent variable, β is the slope


of the fitted line, α is the intercept of the fitted line, and ε is the error term.
When there is only one predictor variable, the prediction method is called
simple regression. In simple linear regression, the predictions of Y when
plotted as a function of X form a straight line. Linear regression consists of
finding the “best” fitting straight line through the points. The “best” fitting line
70 I. Cvetanoski, J. Achkoski, D. Rančić et al.

is called a regression line. The black diagonal line in Figure 1 is the


regression line and consists of the predicted score on Y for each possible value
of X.

Table 3. Example data

X Y
1.00 1.00
2.00 2.00
3.00 1.30
4.00 3.75
5.00 2.25

Figure 1. A scatter plot of the regression line for example data in Table 3 (The black
line consists of the predictions, the points are the actual data, and the vertical lines
between the points and the black line represent errors of prediction) [9].

The error of prediction (Y-Y’) for a point is the value of the point minus
the predicted value (Y’ – the value on the line). So far, the most common
criterion which was used for the “best” fitting line is the line that minimizes
the sum of the squared errors of prediction. That criterion was used to find the
line in Figure 6. Even though, the regression line nowadays is computed with
statistical software still, the calculations are relatively easy and are given
further in the text. MX is the mean of x, MY is the mean of y, SX is the standard
deviation of x, SY is the standard deviation of y, and r is the correlation
between X and Y.
Cybercrime Influence … 71

The slope (β) can be calculated as follows:

Sy
𝛃=𝑟 (2)
Sx

and the intercept (α) can be calculated as: 𝛂 = 𝑀𝑦 − 𝛽𝑀𝑥 (3)

The regression equation is simpler if variables are standardized, so that


their means are equal to 0 and standard deviations are equal to 1, for then
β = r and α = 0. This makes the regression line:

𝐙𝐲′ = (𝑟)(𝑍𝑥) (4)

where ZY’ is the predicted standard score for Y, r is the correlation, and ZX is
the standardized score for X. Note that the slope of the regression equation for
standardized variables is r [9].
Further, we are going to present and analyse computer crime in
Macedonia in the period of 2012th to 2015th. In 2012, there was 1 (one)
reported case of production and distribution of child pornography via
computer system [35]. In 2013, there were 91 incidents: 74 cases of
unauthorized penetration into a computer system, 4 cases of computer fraud,
and 13 cases of credit card fraud. In 2014, there were 103 incidents of
computer crime: 76 cases of unauthorized penetration into a computer system,
4 cases of computer fraud, 4 cases of abuse of credit cards, 18 cases of
production and use of a fraudulent credit card, and 1 case of computer forgery
[36]. In 2015, there were 48 incidents of computer crime [37]. These data is
shown in Figure 2. Further it is used for data analysis.
In this chapter we are using linear regression to produce a scatter plot and
along with that a trendline –regression line (Figure 3). So, our data set is the
situation of computer crime in Macedonia over various years (data in
Figure 2). The first column of Table 4 shows various years and third column
shows rate of computer crime during that years. In order to simplify Table 4,
instead of years in the first column we create second column of values 0, 1, 2,
3 respectively for the years since 2012.
72 I. Cvetanoski, J. Achkoski, D. Rančić et al.

Figure 2. Chart for detailed data for computer crime in Macedonia from 2012 to 2015
year.

Equation for a regression line on the chart is y = 15.3x+37.8 (as shown in


Figure 3). R squared value (0.182) shows little or no correlation between
various years and computer crime in Macedonia. The value 15.3 means that on
average, computer crime is going up about 15.3 (coefficient slope) times each
year. The value of 37.8 (coefficient intercept) is the model prediction of
computer crime in the year 0 (2012), and notice that line do not hit the dot. We
can use this regression line to make a prediction for computer crime in a year
or two in the future, but it is not a good model to make a prediction very far
into a future. Standard error is the same thing as a residual standard deviation,
and in this case it is 51.295. P-value for the slope coefficient is pretty high
value, so it means we do believe that there is little or no relationship between
computer crime and years.

Table 4. Computer crime during years in Macedonia

Year Years since 2012 Computer crime


2012 0 1
2013 1 91
2014 2 103
2015 3 48
Cybercrime Influence … 73

Figure 3. Linear regression chart for computer crime during years in Macedonia.

Figure 4. Share of Internet users who caught a virus or other computer infection [11].

The Figure 4 shows the share of Internet users who caught a virus or other
computer infection in Macedonia, Turkey, Greece, Bulgaria, Slovenia and
Croatia. Compared with 2010, the share of internet users who caught a virus or
other computer infection resulting in loss of information or time, dropped in
all countries by 2015, except Macedonia and Croatia. As shown in Figure 4,
the most remarkable fall was detected in Bulgaria (from 58% in 2010 to 28%
in 2015, or a decrease by 30 percentage points), followed by Slovenia (-21 pp),
Turkey and Greece (both -9 pp). Contrary to these countries, Croatia (+8pp)
and Macedonia (+3pp), remark grow in loss of information or time due to
catching virus or other computer infection through Internet. The Figure 4
shows that Macedonians were the most exposed nation on hackers’ attacks in
comparison to other countries shown on the Figure 4 (78% of Internet users in
74 I. Cvetanoski, J. Achkoski, D. Rančić et al.

Macedonia caught a virus or other computer infection, contrary to 41% in


Croatia) [11].
Some of the reasons for growing percentage of hackers’ attacks in
Macedonia are:
 No national strategy and policy for information security;
 No appropriate legal framework dealing with information security
which is in accordance with existing international conventions and
agreements;
 No established Centers for Incident Registration and Support in case
of breach of information security;
 The growing rate of Internet users in Macedonia (over 60%);
 Emergence of broadband Internet and new features such as DSL or
cable Internet (this enables decline in prices for consumer goods and
greater penetration in the Macedonian households);
 Low cyber security awareness among the population;
 Ignorance, negligence and violation of safety rules and procedures.

Figure 5. Selected on-line activities not done because of security concerns, 2015 (% of
Internet users) [11].

In addition, the Figure 5 shows a significant share of Internet users who


did not use the Internet in 2015 for personal activities because of their
concerns about cybersecurity. In particular, more than 1 Internet user out of 5
did not buy or order goods or services on-line for private use in Macedonia
(38%) and Slovenia (24%). These security concerns discouraged 19% of
Cybercrime Influence … 75

Internet users from e-shopping in Bulgaria, 15% in Greece, 14% in Croatia


and 11% in Turkey. The security concerns kept more than 1 Internet user out
of 5 from e-banking activities only in Greece (22%). In Macedonia, this was
the case for 19% of Internet users in 2015, and for the rest of the countries
details are shown in Figure 5. Using the Internet with a mobile device via
wireless connection from places other than home was limited or avoided, due
to security concerns and only 6% of Internet users in Macedonia did it. The
highest percentage for this criterion had Bulgaria (8%) and the lowest had
Greece, only 2% of Internet users. The other details for this criterion as
regards to other states are shown in Figure 5 [11].

SECURITY MEASUREMENTS FOR PROTECTION OF


SECURITY RISKS AND THREATS ON THE INTERNET
Security measurements show that it is recommended using of known and
reputable antivirus program which includes tools against spywares/malwares.
It is necessary to install software patches and security updates on daily basis.
Firewall allows us protection from external attackers, while protecting our
computer or network from malicious or unnecessary Internet traffic. This type
of protection is especially important for users who are constantly connected to
the Internet via cable or digital connectivity with modems [30].
Computer Security Solutions:

1. Set a password in accordance with the standards;


2. Installation, maintenance and updating of anti-virus program;
3. Activation of automatic updates for antivirus;
4. Setting personal security settings on the web browser;
5. Controlling Internet connection;
6. Protection of the wireless network;
7. Connect the computer to the switch port (if another computer
connects then there will be no service);
8. Intrusion Detection System (IDS).

The basic steps how to be smart on the Internet.

1. Protection against malware and reduce the spam:


76 I. Cvetanoski, J. Achkoski, D. Rančić et al.

1.1 Never open the links in the e-mail message from an unknown
source;
1.2 No need to open an attachment from the email if we do not expect
or do not know the sender;
1.3 Antivirus scanning of attachments from e-mail prior to opening;
1.4 Always delete e-mail in the spam without an opening;
1.5 No need to give the address of the e-mails of people who do not
know us.

2. Personal protection from fraud while being active in the Internet


space:
2.1 Mandatory checks if we visit a secure page;
2.2 Using a secure way for e-banking;
2.3 Never send information for the financial status via e-mail;
2.4 Never respond to e-mail offering easy earnings;
2.5 There should not be a transfer of money or give credit card
information or bank account to any unknown people on the
Internet.
3. Protecting the identity and privacy:
3.1 Never share personal information via e-mail, SMS or through the
pages of social networking with unknown people;
3.2. Avoid using public computers or Wi-Fi hotspots for entering
personal data.

Security of the social networks:

1. Adjust the privacy profile;


2. Protecting the username with strong password;
3. Discretion in accepting friends;
4. Never click on suspicious links – even when they come from friends;
5. Do not post information that may be sensitive to the family, such as:
birthdays, address and the like;
6. Do not post inappropriate and personal pictures of family or friends,
or those which our friends asked not to be publicly published [24].

We should not install too many types of security software on our


computer. Too many of these programs can affect the performance of the
computer and the effectiveness of the software. Finally we need to protect
Cybercrime Influence … 77

against unwanted e-mail messages or pop-up ads that claim to contain anti-
virus program. These messages usually are Trojan horses waiting to infect our
computer.
It is necessary to check private and security settings of web browser to our
computers or mobile smart phones, which often are bought with installed web
browsers (Safari, Firefox, and Chrome, Internet Explorer or other). Search
engines often come with default settings that provide a balance between the
computer’s security and functionality of web pages. Settings set limits on the
extent to which computers will enable Internet applications - such as cookies,
ActiveX and Java – that help websites to perform important functions. If our
search engine allows unlimited interaction cookies or other applications that
monitor Internet activity, can easily be targeted, by contrast, if completely
block these applications then the website will not function effectively. It is
therefore necessary to find a balance, so for more detailed information it is
best to visit the producer of relevant search engine where we can inform
ourselves for the setting of personal and security information [30].
As regards to the necessary measures and actions that should be taken to
increase the security of information systems in state institutions, the following
steps should be considered:

 To develop and adopt the necessary legal framework in order to


improve information security, and in accordance with existing
international conventions and agreements;
 Developing of National strategy and policy for information security,
 Modification of existing laws on important areas that is sensitive to
the threat of information security (e.g., E – Government, E –
infrastructure; E – business, E – Health, E – education; E – citizens
and E – documents);
 Determination of the person/department/sector for information
security (CISCO - Chief Information Security Officer), in every
state/public institution;
 Implementation of activities for the purpose of raising awareness of
the risks, threats and challenges in cyberspace, the need to protect the
information and quick recovery from possible cyber incident/attack
(these activities will refer to the following subjects: general
employees and citizens in society, non-governmental sector (NGO),
sector economy, government/public institutions and enterprises and
local government);
78 I. Cvetanoski, J. Achkoski, D. Rančić et al.

 Appointment of state organizational infrastructure to deal with these


incidents (Centers for Incident Registration and Support in case of
breach of information security);
 Involvement in international activities to increase cooperation,
development projects and other activities related to combat
information incidents [19].

CONCLUSION
The methods and forms of cybercrime are in constant evolution. They
require continuous monitoring and studying by the authorities. Their ability to
adapt to the new environment shows the necessity of preventive measures in
order to protect cyberspace. The new millennium, the Internet revolution, new
ways of warfare, new enemies, new tactics and techniques of warfare, new
leaders, new world order and a new world security card only confirm the role
of security and intelligence in the modern world to fight against cybercrime.
Everyone who is connected to the Internet is constantly exposed to
security risks and threats from malicious software. The malicious software is
located in the Internet-space which is created, updated, upgraded, modified
and distributed to target groups by malicious hackers. Motives for creating
malicious software are of different nature (espionage, crime, entertainment and
etc.). One of the biggest dangers is disgruntled insiders within each
organization. Searching the Internet with an open IP address is an additional
security risk. Also, security risk means open access to the web page, while all
our activities are detected in the browser history, cookie store and so on.
Cybercrime is increasingly appearing in more complex forms difficult to
detect and prevent. The malicious software as one of the methods of
cybercrime is accessible in cyberspace. Nowadays, it is unnecessary to be a
great specialist of computer equipment or a good programmer in order to
create malicious software for criminal activities, because many of the
malicious software already exists in the Internet-space. Codes of malicious
software are built and placed on a web site or forum for malicious hackers who
waited on its use by any person who wants to create a cyber-attack. Social
engineering has always been a good tool for criminals to access information of
a personal nature of the potential target for implementation of activities in the
area of cybercrime. Information gathered through social engineering in many
cases resort to negligence and accident.
Cybercrime Influence … 79

Cyber criminals usually use the phishing because they know that there are
people who have the resources (computers) but lack of knowledge, they are
reckless and curious, and therefore they often become victims of this method
of cybercrime.
Protection against malicious software on the Internet is by constantly
updating antivirus program, installing and enabling a firewall, check the
private and security settings of the search engines, password protection, raising
awareness of using external memory devices (USB, CD and etc.), working
with a hidden IP address, using concealed search (incognito) and other
measures for computer protection.
In this chapter, simple linear regression was used in order to predict
computer crime in the future, due to the previous years. The results of this
research showed that simple linear regression model can be used to make a
prediction for computer crime in a year or two in the future, but it is not a good
model to make a prediction very far into a future.
The possibilities for action of cyber criminals are huge and they use all
their methods. However, the biggest threat to the operation of cyber criminals
in cyberspace will occur due to: low awareness of employees about the threats
and risks in this area, ignorance, negligence and violation of safety rules and
procedures. This means, however, that most cybercrime would be performed
because of the people as a security risk.
Prevention of the threat of cybercrime requires the establishment of a
separate institution/team to deal with the threats and challenges in cyberspace,
globally known as Computer Emergency Response/Readiness Team (CERT).
This team has not been yet established in Macedonia, although during 2013 its
formation was announced, with the task of protecting and providing
recommendations for the protection of IT systems of government institutions
and the private sector. Its establishment was announced once again in “The
Program of the Government of Republic of Macedonia 2014 – 2018”. The
deadline for its constitution in accordance to this Government Program was
June 2015, but it did not happen, probably due to the political crisis in
Macedonia. Debates on the rationality for the establishment of these teams
were also leading on the forum of the Internet portal “IT” on the theme
“Developing CERT/CIRT team in Macedonia”. In addition to the question
“Should we create CERT/CIRT team in Macedonia?” 78% of surveyed IT
members voted “yes” for the establishment of these teams, which is a high
percentage of the justification for establishing these teams.
The expectation in the future is that cybercrime is going to grow and
become more complex, more serious, covered and to cause major damage due
80 I. Cvetanoski, J. Achkoski, D. Rančić et al.

to the development of information-technology society and the greater


opportunity to use the tools for cybercrime available on-line.

REFERENCES
[1] Kevin, Beaver. 2010. Hacking For Dummies, 3rd Edition. Wiley
Publishing, Inc. 111 River Street Hoboken, NJ, 386. Accessed
December 15, 2011. http://www.dummies.com/cheatsheet/hacking.
[2] Benjamin, S. Buckland, Fred, Schreier, and Theodor, H. Winkler.
Democratic governance challenges of cyber security. Accessed
December 15, 2013. http://www.fbd.org.rs/akcije/POJEDINACNE/
CYBER%20ZA%20WEBSITE.pdf.
[3] Barry, G.Buzzan (1983). People, states and fear. Skopje, 2010:
Academic press, 112.
[4] Dejan, Vuletic. Cyber warfare as a form of information warfare.
Accessed December 15, 2013. http://www.itvestak.org.rs/ziteh_04/
radovi/ziteh-32.pdf.
[5] CARNet Croatian Academic and Research Network. Phishing attacks.
CCERT-PUBDOC-2005-01-106. CARNetCERT in association with
LS&S. Accessed December 16, 2013. http://www.cert.hr/sites/default/
files/CCERT-PUBDOC-2005-01-106.pdf.
[6] CARNet Croatian Academic and Research Network. Online extortion.
CCERT-PUBDOC-2009-06-268. CARNetCERT in association with
LS&S. Accessed December 16, 2013. http://www.cert.hr/sites/default/
files/CCERT-PUBDOC-2009-06-268.pdf.
[7] CARNet Croatian Academic and Research Network. Steganography.
CCERT-PUBDOC-2006-04-154. CARNetCERT in association with
LS&S. Accessed December 16, 2013. http://www.cert.hr.
[8] CARNet Croatian Academic and Research Network. Spyware
programs. CCERT-PUBDOC-2009-10-280. CARNetCERT in
association with LS&S. Accessed December 16, 2013. http://www.
cert.hr/sites/default/files/CCERT-PUBDOC-2009-10-280.pdf.
[9] David, M.Lane. Introduction to Linear Regression. Accessed March 26,
2016. http://onlinestatbook.com/2/regression/intro.html.
[10] ETS 185 – Convention on Cybercrime, 23.XI.2001. Council of Europe.
Accessed December 15, 2013. http://www.europarl.europa.eu/
meetdocs/2014_2019/documents/libe/dv/7_conv_budapest_/7_conv_bu
dapest_en.pdf.
Cybercrime Influence … 81

[11] Eurostat Newsrelease. 1 out of 4 internet users in the EU experienced


security related problems in 2015. Accessed June 26, 2016.
http://ec.europa.eu/eurostat/documents/2995521/7151118/4-08022016-
AP-EN.pdf/902a4c42-eec6-48ca-97c3-c32d8a6131ef.
[12] Kimberly, Graves. 2010. Certified Ethical Hacker Study Guide. Wiley
Publishing, Inc., Indianapolis, Indiana, 392. Accessed January 8, 2014.
http://files.laitec.ir/wp-content/uploads/2013/06/CEH-Study-Guide.pdf.
[13] Novak, Djordjijevic. 2011. Defending Cyberspace: International Law
must address Internet – based security threats. Per Concordiam. Journal
of European Security and Defence Issues, 2 (2), 21 – 27.
[14] Sue, Halpern. 2013. Are hackers heroes? Forum for security and
democracy. Edition views and signposts, number 3 March 2013.
Accessed December 15, 2013. http://www.fbd.org.rs/akcije/
POJEDINACNE/VIP3.pdf.
[15] Internet Threats Trend Report – October 2013. Commtouch. Accessed
August 17, 2014. https://www.pallas.com/fileadmin/img/content/
publikationen/2013-Q3_Commtouch-Internet-Threats-Trend-
Report.pdf.
[16] Industrial Control Systems Cyber Emergency Response Team (ICS –
CERT). Cyber Threat Source Descriptions. Accessed February 7, 2014.
http://ics-cert.us-cert.gov/content/cyber-threat-source-descriptions.
[17] Macedonians were the most exposed nation on hackers attacks in
comparison to other EU countries.09 February 2016. Accessed
February 25, 2016. http://www.dnevnik.mk/default.asp?ItemID=
48586E6D0DBAAF45AC36A06A9671AEC0
[18] Milan, Milosavljevic, and Gojko, Grubor. 2009. Computer crime
investigation- Methodological technological base. Singidunum
University – Serbia, 291. Accessed December 15, 2013. http:// www.
seminarski-diplomski.rs/biblioteka/Istraga%20kompjuterskog%
20kriminala.pdf.
[19] Metamorphosis guidebook on Information and Communication
Technologies (ICT), No. 4 (2007). Information security and why should
we protect? Accessed February 5, 2014. http://www.metamorphosis.
org.mk/
[20] Nikola, Banjari. 2013. Security risks on the Internet and solutions for
data protection. Singidunum University – Serbia, 2013. Accessed
December 15, 2013.
82 I. Cvetanoski, J. Achkoski, D. Rančić et al.

[21] New virus locks the data and required to pay $ 300. BusinessInsider.
Accessed January 4, 2014. http://brkajrabota.mk/tehnologija/internet/
30378-nov-virus-gi-zakluccua-komjuterite-i-bara-da-platite-300-dolari.
[22] Organized crime. Seminar work. Accessed February 13, 2014.
http://www.maturskiradovi.net/forum/attachment.php?aid=2015.
[23] PennState, Eberly College of Science. STAT 501. Lesson 1: Simple
Linear Regression. Accessed March 25, 2016. https://onlinecourses.
science.psu.edu/stat501/node/257.
[24] Protecting Yourself Online. What Everyone Needs to Know.
Commonwealth оf Australia 2010. Australian Government. Copyright
Administration, Attorney General’s Department, National Circuit,
Barton ACT.
[25] 2600. Accessed May 21, 2014. http://www.ag.gov.au/cca.
[26] Dejan, Sokolovski. 2014. “Dutch authorities turned off the online black
market Utopia”. Internet portal IT. Accessed February 3, 2014.
http://it.com.mk/holandskite-vlasti-go-izgasija-tsrniot-onlajn-pazar-
utopia/
[27] Dejan, Sokolovski. 2012. “MI and the FBI together against cyber crime,
24 persons from 13 countries were arrested”. Internet portal IT. June 27.
Accessed February 15, 2014. http://it.com.mk/mvr-i-fbi-zaedno-protiv-
kompjuterski-kriminal-24-uapseni-od-13-drzhavi/
[28] StATS: What is a correlation? (Pearson correlation). Accessed
February 18, 2016. http://www.pmean.com/definitions/correlation.htm.
[29] “There was computer crime in bidding for police vehicles”. 2012.
Internet portal МКД. September 19. Accessed Feruary 15, 2014.
http://www.mkd.mk/59923/crna-hronika/sepak-imalo-kompjuterski-
kriminal-pri-naddavanjeto-za-policiskite-vozila.
[30] Тhe Cyber security handbook. A cyber security guide. Accessed August
6, 2014. www.NJConsumerAffairs.gov.
[31] USAID/Project eGovernment. Ministry of Information. Metamorphosis.
2010. Fundamentals and development of e - government. Accessed
February 13, 2014. http://www.mio.gov.mk/files/pdf/Osnovi%20i%
20razvoj%20na%20e-Vlada%202010%20-%20mk.pdf.
[32] U.S. Attorney’s Office. Southern District of New York. 2012.
“Manhattan U.S. Attorney and FBI Assistant Director in Charge
Announce 24 Arrests in Eight Countries as Part of International Cyber
Crime Takedown”. Тhe Federal Bureau of Investigation (FBI). June 26.
Accessed February 17, 2014. http://www.fbi.gov/newyork/press-
releases/2012/manhattan-u.s.-attorney-and-fbi-assistant-director-in-
Cybercrime Influence … 83

charge-announce-24-arrests-in-eight-countries-as-part-of-international-
cyber-crime-takedown.
[33] Understanding Internet Security. What you need to protect yourself
online. 2004 Big Planet, Inc. All Rights Reserved. Big Planet is a
registered trademark. Accessed August 26, 2014. http://www.
bigplanetusa.com/library/bp/pdf/bpis_understanding_security.pdf.
[34] United States Government Accountability Office, Information Security:
Cyber Threats and Vulnerabilities Place Federal Systems at Risk
(Washington DC: US GAO, 2009); William A. Wulf and Anita K.
Jones, “Reflections on Cybersecurity,” Science 326 (13 November
2009): 943-4; See Martin Charles Golumbic, Fighting Terror Online:
The Convergence of Security, Technology, and the Law (New York:
Springer, 2007).
[35] United States Department of State. OSAC, Bureau of Diplomatic
Security. Macedonia 2014 Crime and Safety Report. Accessed March
25, 2016. https://www.osac.gov/pages/ContentReportDetails.aspx?cid=
15074.
[36] United States Department of State. OSAC, Bureau of Diplomatic
Security. Macedonia 2015 Crime and Safety Report. Accessed March
25, 2016. https://www.osac.gov/pages/ContentReportDetails.aspx?cid=
17677.
[37] United States Department of State. OSAC, Bureau of Diplomatic
Security. Macedonia 2016 Crime and Safety Report. Accessed March
25, 2016. https://www.osac.gov/pages/ContentReportDetails.aspx?cid=
18939.
In: Knowledge Discovery in Cyberspace ISBN: 978-1-53610-566-7
Editors: K. Kuk and D. Ranđelović © 2017 Nova Science Publishers, Inc.

Chapter 4

SOME ASPECTS OF THE APPLICATION


OF BENFORD’S LAW IN THE ANALYSIS OF
THE DATA SET ANOMALIES

D. Joksimović1,, PhD, G. Knežević2, PhD,


V. Pavlović3, PhD, M. Ljubić4, PhD
and V. Surový5, PhD
1
Full time professor, The Academy of Criminalistic and police studies,
Belgrade, Serbia
2
Associate professor, Singidunum University, Belgrade, Serbia
3
Full professor, University of Pristina, Faculty of Economics (Kosovska
Mitrovica), Serbia
4
Associate professor, John Naisbitt University, Belgrade, Serbia
5
Faculty of Economic Informatics, University of Economics,
Bratislava, Slovakia

ABSTRACT
In this chapter the contemporary generally accepted theoretical
analysis and assumptions regarding the implementation of Benford’s Law
are presented. It seems interesting to introduce the perspective to
Benford’s Law as a consequence of the universal law of nature stating


Corresponding author: D. Joksimovic, Email: dusan.joksimovic@kpa.edu.rs.
86 D. Joksimović, G. Knežević, V. Pavlović et al.

that nature strives for the maximum entropy or disorder, as well as the
perspective in which Benford’s Law, aspires to find its place in the
contemporary theory of everything in nature.
The implementation of this law in the analysis of the anomalies in
some numerical data in various scientific disciplines is also part of this
article. The incorrect numerical data that describes the specific
occurrence can be the consequence of an unintentional error in the
formation of the numerical data (as a consequence of bad design of an
experiment, the imperfections of the detection of the numerical data, a
badly set up model of some process that generates the set of numerical
data, etc.), but also the consequence of intentional abuse.
Using the Monte Carlo simulation, we determine the average values
and standard deviation of the relative frequency of the first type error for
the Mean Absolute Deviation test and for the Pearson 𝜒 2 test of
Benford’s Law, for the first digit, and the first two digits and we
determine the acceptable length of series for the application of these tests,
in the context of acceptable first type errors.
We make the practical implementation of Benford’s Law by testing it
on the data gathered from the International Monetary Fund, World
Economic Outlook Database. We use two groups of data. The first group
is the data regarding the Gross domestic product and the other group
comprises the Current Account Balances for the 184 countries in the
period from 1980 to 2016.
The specific perspective provided in this chapter regards the
implementation of this law in the forensic analysis of frauds, especially in
the analysis of the numerical data that describes various sociological,
econometrical and financial irregularities. We show how the mutual
usage of Benford’s Law and specific laws of mathematical statistics,
successfully detect potential irregularities in the numerical data and
advances the forensic analyst’s potential fraud detection in this area.

Keywords: Benford’s Law, Benfords’ distributions, analysis numerical data,


forensics

INTRODUCTION - THE HISTORY OF BENFORD’S LAW


In the second part of the 19th Century, the astronomer Simon Newcomb,
remarked that the first pages of the logarithm table are used more heavily that
the last pages. Back in that time, the common logarithm table was used for the
multiplication and divisions of big numbers. He concluded that the numbers
that start with the smaller digits were used more often than the numbers with
the bigger digits. In 1881 he published the scientific article [1] describing his
Some Aspects of the Application of Benford’s Law … 87

findings, without getting into the assumptions and theoretical analysis of the
phenomena. This article went unnoticed and was forgotten easily.
In the year of 1938 the physicist Frank Benford came to the same
conclusion as Simon Newcomb. Benford tested this hypothesis using the
20229 data of big numbers from 20 different sources and 2968 data of small
numbers from 10 different sources. Sources were found in the natural and
social occurrences, such as numbers used for the published journals, the length
of a river, the value of some physical constant, the mortality rates, statistics in
baseballm, etc. Different from Newcomb, Benford in his work [2] determined
the mathematical law of frequency distribution of leading digits in the
numbers, that became known as Benford’s Law.
Starting from the second half of the 80’s of the last century, this l aw
started to be used more often in the analysis of the consistency of the
numerical data expressed in various social and natural phenomena. Nowadays,
the theoretical analysis of this law in the area of finding a better mathematical
basis and its implementation is still a current issue in contemporary scientific
society [3-4].

DEFINITIONS OF BENFORD’S LAW AND THEIR


MUTUAL EQUIVALENCY
It is commonly understood that for any base where 𝐵 > 1 any positive
real number (𝑥 > 0, 𝑥 ∈ 𝑅), can be expressed as

𝑥 = 𝑀𝐵 (𝑥) ∙ 𝐵𝑘

where 𝑘 ∈ 𝑍, a 𝑀𝐵 (𝑥) ⊂ [1, 𝐵). The number 𝑀𝐵 (𝑥) will denote mantissa of
the number 𝑥. We can conclude that the following equation is proven to be
true

𝑥 𝑥 𝐵log𝐵 𝑥
𝑀𝐵 (𝑥) = = = = 𝐵log𝐵 𝑥−[log𝐵 𝑥]
𝐵𝑘 𝐵[log𝐵 𝑥] 𝐵[log𝐵 𝑥]

where [log 𝐵 𝑥] will denote the whole part of the number log𝐵 𝑥, or the greatest
integer less than or equal to log𝐵 𝑥.
In scientific practice we can find two mutually equivalent definitions of
Benford’s Law. Benford’s Law as a function of probability distributions of
88 D. Joksimović, G. Knežević, V. Pavlović et al.

mantissa numerical data and Benford’s Law for the joint probability
distribution of representing first k digits of significant numerical data.
Definition 2.1. (Benford’s Law for the function of the probability
distribution of mantissa)
Random variable X, whose realizations are only positive values in the base
𝐵 > 1, is recognized under Benford’s Law if, and only if the function of the
probability distribution of a random variable determined by the mantissa of the
random variable X, M(X), in that base is recognized under the following
logarithm law

𝑃(𝑀𝐵 (𝑥) ≤ 𝑚) = log𝐵 𝑚

where 𝑚 ∈ [1, 𝐵).


Definition 2.2. (Benford’s Law for the joint probability distribution of
leading k digits in the numbers)
Random variable X, whose realizations are only positive values in the base
𝐵 > 1, follow Benford’s Law if, and only if joint probability distribution first
k significant digits of their realization,(𝐶𝑗 )𝑗=1,2,…,𝑘 , (𝑘 ∈ 𝑁 ∗ ), satisfies the
following law

1
𝑃(𝐶1 = 𝑐1 , 𝐶2 = 𝑐2 , … , 𝐶𝑘 = 𝑐𝑘 ) = log𝐵 (1 + )
∑𝑘𝑖=1 𝐵𝑘−𝑖 𝑐𝑖

where 𝑐1 ∈ (1,2, … , 𝐵 − 1), 𝑐𝑗>1 ∈ (0,1,2, … , 𝐵 − 1).


At first glance we find that the definition 2.2. can be used only for the
discrete random variables, but this is not correct, because the condition is that
k belongs to the unlimited set of positive integers, going to the infinity
(𝑘 ∈ 𝑁 ∗ ).
Also, we can notice that because of the continuity the following
expression is valid:

𝑃(𝑀𝐵 (𝑥) ≤ 𝑚) = 𝑃(𝑀𝐵 (𝑥) < 𝑚) + 𝑃(𝑀𝐵 (𝑥) = 𝑚) = 𝑃(𝑀𝐵 (𝑥) < 𝑚),

because it is 𝑃(𝑀𝐵 (𝑥) = 𝑚) = 0.


Mutual equivalence of these definitions can be proven.
Some Aspects of the Application of Benford’s Law … 89

SOME CHARACTERISTICS OF BENFORD’S LAW


Form the Definition 2.2. we can easily show that for the random variable
X that satisfies Benford’s Law in the base 𝐵 > 1 so the following is valid:

1. The Probability of Showing the First Significant Digit


1
1. 𝑃(𝐶1 = 𝑐1 ) = log𝐵 (1 + ), 𝑐1 ∈ (1,2, … , 𝐵 − 1).
𝑐 1

This characteristic in the base 𝐵 = 10, was detected in the first works of
Newcomb [9] and Benford [3].

2. Probability of Showing K-Significant Digit Where 𝒌 ≥ 𝟐,


(𝒌 ∈ 𝑵∗ ) is:

1. For 𝑘 ≥ 2, (𝑘 ∈ 𝑁 ∗ )

𝐵 𝑘−1 −1
1
𝑃(𝐶𝑘 = 𝑐𝑘 ) = ∑ log𝐵 (1 + )
𝑖 ∙ 𝐵 + 𝑐𝑘
𝑖=𝐵𝑘−2

𝑐𝑘 ∈ (0,1,2, … , 𝐵 − 1).

1
𝑃(𝐶1 = 𝑐1 , 𝐶2 = 𝑐2 ) = log𝐵 (1 + ) , 𝑐1 ∈ (1,2, … , 𝐵 − 1) , 𝑐2 ∈
𝑐1 ∙𝐵+𝑐2
(0,1,2, … , 𝐵 − 1).

Numerical and graphical analysis of the characteristics 1 and 2 is covered


in the next section (it was shown with only the distribution with the bases B =
10, 9, 6). This shows that, starting with the fourth significant digit and onward
in all of the bases (digits 𝐶𝑘≥4 ), almost all is uniformly distributed, and we can
say that for the probable distribution of the third digit. That is why the
implementation of Benford’s Law in practice is constrained to the analysis of
the probability distribution of the first two significant numbers, and only
sometimes is it the analysis of the distribution of the third significant number
in the set of numerical data.
90 D. Joksimović, G. Knežević, V. Pavlović et al.

The probability distributions of the first digits in the base B=10


0.35

0.3

0.25

0.2
Probability

0.15

0.1

0.05

0
1 2 3 4 5 6 7 8 9
The first significant digit
The probability distributions of the second digits in the base B=10
0.12

0.1

0.08
Probability

0.06

0.04

0.02

0
0 1 2 3 4 5 6 7 8 9
The second significant digit

Figure 1. (Continued)
Some Aspects of the Application of Benford’s Law … 91

The probability distributions of the third digits in the base B=10


0.12

0.1

0.08
Probability

0.06

0.04

0.02

0
0 1 2 3 4 5 6 7 8 9
The third significant digit

Figure 1. The probability distributions of the first, second and third digits in the base B
= 10.

The probability distributions of the first digits in the base B=9


0.35

0.3

0.25

0.2
Probability

0.15

0.1

0.05

0
1 2 3 4 5 6 7 8
The first significant digit

Figure 2. (Continued)
92 D. Joksimović, G. Knežević, V. Pavlović et al.

The probability distributions of the second digits in the base B=9


0.14

0.12

0.1

0.08
Probability

0.06

0.04

0.02

0
0 1 2 3 4 5 6 7 8
The second significant digit

The probability distributions of the third digits in the base B=9


0.12

0.1

0.08
Probability

0.06

0.04

0.02

0
0 1 2 3 4 5 6 7 8
The third significant digit

Figure 2. The probability distributions of the first, second and third digits in the base B
= 9.
Some Aspects of the Application of Benford’s Law … 93

The probability distributions of the first digits in the base B=6


0.4

0.35

0.3

0.25
Probability

0.2

0.15

0.1

0.05

0
1 2 3 4 5
The first significant digit
The probability distributions of the second digits in the base B=6
0.25

0.2

0.15
Probability

0.1

0.05

0
0 1 2 3 4 5
The second significant digit

Figure 3. (Continued)
94 D. Joksimović, G. Knežević, V. Pavlović et al.

The probability distributions of the third digits in the base B=6


0.18

0.16

0.14

0.12
Probability

0.1

0.08

0.06

0.04

0.02

0
0 1 2 3 4 5
The third significant digit

Figure 3. The probability distributions of the first, second and third digits in the base B
= 6.
The probability distributions of the fourth digits in the base B=10
0.12

0.1

0.08
Probability

0.06

0.04

0.02

0
0 1 2 3 4 5 6 7 8 9
The fourth significant digit

Figure 4. (Continued)
Some Aspects of the Application of Benford’s Law … 95

The probability distributions of the fourth digits in the base B=9


0.12

0.1

0.08
Probability

0.06

0.04

0.02

0
0 1 2 3 4 5 6 7 8
The fourth significant digit
The probability distributions of the fourth digits in the base B=6
0.18

0.16

0.14

0.12
Probability

0.1

0.08

0.06

0.04

0.02

0
0 1 2 3 4 5
The fourth significant digit

Figure 4. The probability distributions of the fourth digit in the bases B = 10,9,6.
96 D. Joksimović, G. Knežević, V. Pavlović et al.

3. The Occurrence of the Significant Digits with the Realizations


of the Events That Satisfy Benford’s Law Are Not Mutually
Exclusive Events

We can easily state that from the Definition 2.2. for 𝑘 ≥ 2, (𝑘 ∈ 𝑁 ∗ ) is


valid
𝑘

𝑃(𝐶1 = 𝑐1 , 𝐶2 = 𝑐2 , … , 𝐶𝑘 = 𝑐𝑘 ) ≠ ∏ 𝑃(𝐶𝑖 = 𝑐𝑖 )
𝑖=1

This property shows that the appearance of the significant digits within the
realization of the events that satisfy Benford’s Law, are not mutually exclusive
events. This fact opens many questions in the area of investigating these
distributions and finding some universal law, which is in the essence of events
that satisfy Benford’s Law. On the other hand, these events are from a large
number of scientific disciplines and areas, and at first glance they seem
completely unrelated, and some scientific and academic researchers consider
that Benford’s Law is one of the detectors of the existence of the Theory of
Everything, to which leading scientists have tried to come closer in the last
decades.

The probability distributions of the first two digits in the base B=10

0.05

0.04
Probability

0.03

0.02
10
0.01

0 5
0
2
4
6
8 0 Second digits
10
First digits

Figure 5. The probability distributions of the first two digits in the bases B = 10.
Some Aspects of the Application of Benford’s Law … 97

4. Theoretical Possibility of Generating a Random Variable


for Benford’s Law

Taking into consideration that the (Definition 2.1.) is valid

𝑃(𝑀𝐵 (𝑥) ≤ 𝑚) = log𝐵 𝑚, where 𝑚𝜖[1, 𝐵).

We can generate the random variable mantissa 𝑀𝐵 (𝑥), that satisfies


Benford’s Law in the base 𝐵 > 1, generating 𝐵𝑼 → 𝑴𝐵 , where the random
variable 𝑼 is uniformly distributed in the interval (0,1), therefore,
𝑼 ~ 𝒰(0,1).
One of the consequences of this characteristic is that the set of one digit
numerical data that satisfy Benford’s Law can be generated using [𝐵𝑼 ], where
with the [ ] we denote the whole part of the number and where the 𝑼 is
uniformly distributed within the interval (0,1), respectively 𝑼 ~ 𝒰(0,1).
Therefore, taking into consideration that the following is valid 𝑀𝐵 (𝑥) =
log𝐵 𝑥−[log𝐵 𝑥]
𝐵 , we can conclude that the random variable X satisfy Benford’s
Law if, and only if the random variable log𝐵 𝑋 − [log𝐵 𝑋] is uniformly
distributed in the interval (0,1), e.g., log𝐵 𝑋 − [log𝐵 𝑋]~𝒰(0,1), or
log𝐵 𝑋 𝑚𝑜𝑑 1 ~ 𝒰(0,1). Otherwise, the number log𝐵 𝑋 − [log𝐵 𝑋] represents
the fractal part of the number log𝐵 𝑋.
So, the following is valid:

𝐥𝐨𝐠𝑩 𝑿 𝒎𝒐𝒅 𝟏~𝓤(𝟎, 𝟏) ⇔ 𝑿 𝒔𝒂𝒕𝒊𝒇𝒚 𝒕𝒉𝒆 𝑩𝒆𝒏𝒇𝒐𝒓𝒏𝒅′𝒔 𝑳𝒂𝒘 (2)

Based on the Kronecker-Weyl theorem, that states that for each irrational
number 𝛼𝜖𝑅\𝑄, sequence 𝑧𝑛 = 𝑛 ∙ 𝛼, where (𝑛 ∈ 𝑁 ∗ ) uniformly distributed
based on modul 1, therefore 𝑛 ∙ 𝛼 𝑚𝑜𝑑 1 ~ 𝒰(0,1), and identity (2), is valid
for all irrational numbers 𝛼𝜖𝑅\𝑄, sequence 𝐵𝑛𝛼 , where (𝑛 ∈ 𝑁 ∗ ), satisfy
Benford’s Law in the base 𝐵 > 1.
Sequence {𝒂𝒏 } satisfy Benford’s Law if, and only if 𝐥𝐨𝐠𝑩 𝒂 is an irrational
number.
98 D. Joksimović, G. Knežević, V. Pavlović et al.

5. The Hypothesis of the Scale and Base Invariance for the Data
That Satisfy Benford’s Law

The event space ℳ𝐵 for which the Definition 2.1., in the base B > 1, is
defined as

ℳ𝐵= { ⋃ 𝑆 ∙ 𝐵𝑘 , 𝑓𝑜𝑟 𝑎𝑙𝑙 𝐵𝑜𝑟𝑒𝑙 𝑆 ⊆ [1, 𝐵)}


𝑘=−∝

The event space ℳ𝐵 we call mantissa algebra, which is, σ-algebra as a sub
σ-field of the Borel.
So, for any set E it is valid:

𝐸 ∈ ℳ𝐵 ⇔ 𝐸 = ⋃ 𝑆 ∙ 𝐵𝑘 , 𝑆𝜖𝓑([1, 𝐵))
𝑘=−∝

where 𝓑([1, 𝐵)) denotes Borel’s set [1, 𝐵).


This σ-algebra ℳ𝐵 has the following properties:

a) Each non empty set in ℳ𝐵 is infinite with the accumulation points at


0 and + ∝
b) ℳ𝐵 is closed under scalar multiplications,

(𝑒 > 0, 𝐸𝜖ℳ𝐵 ⇒ 𝑒 ∙ 𝐸 𝜖ℳ𝐵 )

c) ℳ𝐵 is closed under integral roots, but not integral powers, e.g.,

1
(𝑚 ∈ 𝑁, 𝐸 ∈ ℳ𝐵 ⇒ 𝐸 𝑚 ∈ ℳ𝐵 )

d) ℳ𝐵 is self similar in the sense that (𝐸 ∈ ℳ𝐵 ⇒ 𝐵𝑚 ∙ 𝐸𝜖ℳ𝐵 ) for all


𝑚 ∈ 𝑍.

Property b) is in the essence of the scale invariance hypothesis, and the


properties c) and d) are in the essence of the base invariance hypothesis for the
occurrences whose realization satisfies Benford’s Law, and for the numerical
data that has some properties of Benford’s Law.
Some Aspects of the Application of Benford’s Law … 99

Hypothesis of the scale and base invariance for the numerical data, can be
formulated, but not strictly mathematically, in the following way:
Scale invariance hypothesis: If some random variable X satisfy Benford’s
Law, then the random variable 𝒂 ∙ 𝑿, where 𝑎 > 0, 𝑎 ∈ 𝑅, also satisfy this law,
multiplied with some positive scalar remains coherent with Benford’s Law
properties under the numerical data that has those properties.
We can say that this hypothesis was tested theoretically and practically
and it passed them all. In 1995 Hill found that in the base 𝐵 = 10, probability
density under the field (𝑅+ , ℳ10 ) is scale invariant if, and only if that
probability satisfy the Benford’s Law [5].
The base invariance hypothesis is much more sensitive. This one tries to
answer the question if Benford’s Law property of some random occurrence or
set of numerical data which is detected in the base 𝐵 > 1, is valid on that set
even when the set of data is converted into the other base? It was shown that
[5] that Benford’s Law properties transfer into the other base for the Borel
sets, while for the Dot set we cannot be sure of that. In that case the
combination of probabilities that satisfy the Benford’s Law and Dirac
measures of the probability with the constant one, help to preserve the
property of the base.
Above all it was presented that if some random variable satisfies the
hypothesis of the scale invariance than that also satisfies the base invariance
hypothesis, but the inverse does not hold true.

6. Hypothesis of the Sum Invariance

The random variable is the sum invariant in the sense that if for any
natural number 𝑛 ∈ 𝑁 the expected sum of mantissa of all entries starting with
the fixed n-tuple of significant digits is the same as that for any other n-tuple.
It was presented that the random variable is the sum invariant only if it
satisfies Benford’s Law.
The sum-invariance for Benford’s Law data is proven in the sense of the
expected sums, in the real Benford’s Law data that sum is not exactly equal, as
some variance exists. The analysis of this variance leads to certain results that
assure the practical usage on the data that satisfy Benford’s Law.
100 D. Joksimović, G. Knežević, V. Pavlović et al.

7. Benford’s Law as a Consequence of Closed System Aspiration


to the Maximum Entropy

One of the universal laws in the universe is the law on maximum entropy
that states that all isolated systems in the universe have an aspiration to the
maximum entropy, or disorder. This is the state in which all the possibilities
are equally possible. It shows that [6-8] Benford’s Law is the consequence of
that universal law. That is why even in the most contemporary Theory of
Everything, Benford’s Law is analyzed as one of the potential paths to that
comprehensive solution. However, the fact that Benford’s Law is derived from
the maximum entropy of the system in which it is used, shows us the
successful implementation of this law in the large spectrum of natural and
social events.

TESTING OF BENFORD’S LAW USING A SET


OF NUMERICAL DATA

Let us take the sample set comprising of N numerical data 𝑋̅ =


{𝑥̅𝑖 | 𝑖 = 1,2, … , 𝑁}. It was developed to undertake a specific number of tests
that investigate whether the significant numbers of this set deviate from
Benford’s Law. All of these tests is constructed under the method of statistical
hypothesis where the hypothesis

H0: Significant numbers of the sample satisfy Benford’s Law

and an alternative hypothesis that is accepted or rejected depending on the


nature of the test

H1: Significant numbers of the sample do not satisfy Benford’s Law.

Test significance is 𝛼 = 0,05, that means if the alternate hypothesis is to


be accepted we can say that the significant numbers do not satisfy Benford’s
Law.
In large number of iterations this shows the anomaly of the observed data,
and the possibility of having an error (intentional and unintentional). As a
matter of fact, we have considered the data that are a realization of events that
satisfy Benford’s Law. If we do not accept the alternate hypothesis this does
Some Aspects of the Application of Benford’s Law … 101

not mean that the set of data do not satisfy Benford’s Law, but it can be the
consequence of the fact that we do not have the adequate level of significance
that the data needs to satisfy Benford’s Law.
There are many tests developed and used for the testing of Benford’s Law
at the numerical data (test of the Mean Absolute Deviation, Pearson 𝜒 2 test,
Kuiper test, Z-test, Test of the sum invariance, Test of the factors of distortion,
Second level test, Test of doubling the digits, Test the last two digits...).
In this paper we are going to describe the mean absolute deviation test and
Pearson 𝜒 2.

 Test of the Mean Absolute Deviation (MAD-Mean Absolute


Deviation)

Mean Absolute Deviation (MAD) is calculated for the first, the second
and the first two digits

9
1
𝑀𝐴𝐷(𝐶1 ) = ∑|𝑃(𝐶1̅ = 𝑐𝑖 ) − 𝑃(𝐶1 = 𝑐𝑖 )|
9
𝑐𝑖 =1

9
1
𝑀𝐴𝐷(𝐶2 ) = ∑ |𝑃(𝐶2̅ = 𝑐𝑖 ) − 𝑃(𝐶2 = 𝑐𝑖 )|
10
𝑐𝑖 =0

9 9
1
𝑀𝐴𝐷(𝐶1𝐶2) = ∑ ∑ |𝑃(𝐶1̅ = 𝑐𝑖 , 𝐶2̅ = 𝑐𝑗 ) − 𝑃(𝐶1 = 𝑐𝑖 , 𝐶2 = 𝑐𝑗 )|
90
𝑐𝑖 =1 𝑐𝑗 =0

where with the 𝑃(𝐶1̅ = 𝑐𝑖 ), 𝑃(𝐶2̅ = 𝑐𝑖 ), 𝑃(𝐶1̅ = 𝑐𝑖 , 𝐶2̅ = 𝑐𝑗 ) we denote the


probability distribution for the first, the second and the first two significant
digits in the sample 𝑋̅ = {𝑥̅𝑖 | 𝑖 = 1,2, … , 𝑁}, a sa 𝑃(𝐶1 = 𝑐𝑖 ), 𝑃(𝐶2 = 𝑐𝑖 ),
𝑃(𝐶1 = 𝑐𝑖 , 𝐶2 = 𝑐𝑗 ) we denote the probability for the first, the second and the
first two digits shown in Benford’s Law, according to the Definition 2.2.
With this test there are no critical values that are exactly mathematically
determined, but instead there are critical values obtained by the experience
with the practical tests and they are [9].
102 D. Joksimović, G. Knežević, V. Pavlović et al.

Table 1. Critical values for the mean absolute deviation test

Decisions First digit Second digit First two digits


Close agreement <0,004 <0,008 <0,0006
Accepted agreement 0,004-0,008 0,008-0,012 0,0006-0,0012
Marginal agreement 0,008-0,012 0,012-0,016 0,0012-0,0018
No agreement at all >0,012 >0,016 >0,0018

We have done the testing of the relative frequency of showing the first
type error in the base B = 10, which we generated several times (100 times to
be precise) Benford’s Law string of different length data (from 100 to 10000
data set) and based on Table 1, taking the values for the “No agreement at all,”
we determine the relative frequency showing of the first type error.
Such testing were done an additional 100 times and using the common
methods we obtained the value for the average value and standard deviation of
the relative frequency of the first type error.
We did the testing using the software MATLAB.
Results are shown in the following Figures.
From Figure 6 and 7 we can conclude that the relative frequency of the
first type error for the Mean Absolute Deviation calculated for the first digit is
acceptable for the string of data whose number is higher than 1,000 and for the
first two digits for the data sets whose number is higher than 3000. For the
smaller strings of data, relative frequency of the first type error shows that the
tests are not applicable.

 Pearson 𝝌𝟐

(𝑂𝑖 −𝐸𝑖 )2
Pearson 𝜒 2 test denote ∑𝑘𝑖=1 2
~𝜒𝑘−1 , where Oi - sample frequency,
𝐸𝑖
Ei - expected frequency
The appearance of the characteristics of the set classified into k classes. In
this case number of classes is equal to the number of digits for which the
analysis is (k = 9 for the analysis of the first digit, k = 10 for the analysis of the
second digit, k = 90 for the analysis first two digits), and this test is then:

9
(𝑃(𝐶1̅ = 𝑐𝑖 ) − 𝑃(𝐶1 = 𝑐𝑖 ))2
𝑁∙ ∑ ~𝜒82
𝑃(𝐶1 = 𝑐𝑖 )
𝑐𝑖 =1
Some Aspects of the Application of Benford’s Law … 103

9
(𝑃(𝐶2̅ = 𝑐𝑖 ) − 𝑃(𝐶2 = 𝑐𝑖 ))2
𝑁∙ ∑ ~𝜒92
𝑃(𝐶2 = 𝑐𝑖 )
𝑐𝑖 =0

9 9 2
(𝑃(𝐶1̅ = 𝑐𝑖 , 𝐶2̅ = 𝑐𝑗 ) − 𝑃(𝐶1 = 𝑐𝑖 , 𝐶2 = 𝑐𝑗 )) 2
𝑁∙∑ ∑ ~𝜒89
𝑃(𝐶1 = 𝑐𝑖 , 𝐶2 = 𝑐𝑗 )
𝑐𝑖 =1 𝑐𝑗 =0

1 0.05

0.9 0.045

St.dev.rel.freq.(type I error-MAD(C1))
0.8 0.04
Av.rel.freq.(type I error-MAD(C1))

0.7 0.035

0.6 0.03

0.5 0.025

0.4 0.02

0.3 0.015

0.2 0.01

0.1 0.005

0 0
0 500 1000 0 500 1000
Number of data set Number of data set
0.025 0.014

0.012
St.dev.rel.freq.(type I error-MAD(C1))

0.02
Av.rel.freq.(type I error-MAD(C1))

0.01

0.015
0.008

0.006
0.01

0.004

0.005
0.002

0 0
0 5000 10000 0 5000 10000
Number of data set Number of data set

Figure 6. Average relative frequency and standard deviation of relative frequency of


type I error MAD(C1)-test in the base B = 10.
104 D. Joksimović, G. Knežević, V. Pavlović et al.

-3
x 10
1 1.5

0.9

St.dev.rel.freq.(type I error-MAD(C1C2))
0.8
Av.rel.freq.(type I error-MAD(C1C2))

0.7
1
0.6

0.5

0.4
0.5
0.3

0.2

0.1

0 0
0 500 1000 0 500 1000
Number of data set Number of data set
1 0.06

0.9
0.05
St.dev.rel.freq.(type I error-MAD(C1C2))

0.8
Av.rel.freq.(type I error-MAD(C1C2))

0.7
0.04
0.6

0.5 0.03

0.4
0.02
0.3

0.2
0.01
0.1

0 0
0 5000 10000 0 5000 10000
Number of data set Number of data set

Figure 7. Average relative frequency and standard deviation of relative frequency of


type I error MAD(C1C2)-test in the base B = 10.
Some Aspects of the Application of Benford’s Law … 105

Critical values are taken from the 𝜒𝑛2 tables for the level of significance
𝛼 = 0,05, or for some other level of significance. This test is very sensitive to
the deviations of Benford’s Law, and it is sensitive to the enlargement of the
sample N.
In the same way as with the testing of the relative frequency of showing
the first type error for the MAD tests, we tested the relative frequency of
showing the first type error for the 𝜒 2 test, for the first digit and for the first
two digits. Critical values are taken from the 𝜒𝑛2 tables for the level of
significance 𝛼 = 0,05.
Results are shown in the following Figures:

0.06 0.025

0.05 St.dev.rel.freq.(type I error-Chi sq.(C1))


0.02
Av.rel.freq.(type I error-Chi sq.(C1))

0.04
0.015

0.03

0.01
0.02

0.005
0.01

0 0
0 500 1000 0 500 1000
Number of data set Number of data set
0.06 0.03

0.05 0.025
St.dev.rel.freq.(type I error-Chi sq.(C1))
Av.rel.freq.(type I error-Chi sq.(C1))

0.04 0.02

0.03 0.015

0.02 0.01

0.01 0.005

0 0
0 5000 10000 0 5000 10000
Number of data set Number of data set

Figure 8. Average relative frequency and standard deviation of relative frequency of


type I error 𝜒 2 (𝐶1 ) - test in the base B = 10.
106 D. Joksimović, G. Knežević, V. Pavlović et al.

0.07 0.03

0.06

St.dev.rel.freq.(type I error-Chi sq.(C1C2))


0.025
Av.rel.freq.(type I error-Chi sq.(C1C2))

0.05
0.02

0.04
0.015
0.03

0.01
0.02

0.005
0.01

0 0
0 500 1000 0 500 1000
Number of data set Number of data set
0.06 0.025
St.dev.rel.freq.(type I error-Chi sq.(C1C2))

0.05
Av.rel.freq.(type I error-Chi sq.(C1C2))

0.02

0.04
0.015

0.03

0.01
0.02

0.005
0.01

0 0
0 5000 10000 0 5000 10000
Number of data set Number of data set

Figure 9. Average relative frequency and standard deviation of relative frequency of


type I error 𝜒 2 (𝐶1 𝐶2 ) test in the base B = 10.
Some Aspects of the Application of Benford’s Law … 107

The probability distributions of the first digits-Gross domestic product,current prices


0.35

0.3

0.25

0.2
Probability

0.15

0.1

0.05

0
1 2 3 4 5 6 7 8 9
The first significant digit

The probability distributions of the second digits-Gross domestic product,current prices


0.14

0.12

0.1

0.08
Probability

0.06

0.04

0.02

0
0 1 2 3 4 5 6 7 8 9
The second significant digit

Figure 10. (Continued)


108 D. Joksimović, G. Knežević, V. Pavlović et al.

The probability distributions of the first two digits-Gross domestic product,current prices

0.05

0.04
Probability

0.03

0.02
10
0.01

0 5
0
2
4
6
8 0 Second digits
10
First digits

Figure 10. Probability distribution of the first, the second and the first two digits-Gross
domestic product, current prices.

Figure 8 and 9 show that the relative frequency of showing the first type
error is acceptable for the whole length of observed strings of data.
We undertake the practical implementation of Benford’s Law by testing it
on the data gathered from the International Monetary Fund, World Economic
Outlook Database. We use two groups of data. The first group is the data
regarding the Gross domestic product and the other group comprises the
Current account balances for the 184 countries in the period from 1980 to
2016. The first group of data contains the 6266 useful data, and the other
group contains 6243 data considered to be useful. We use the program code
that we created in the software package MATLAB. The results are shown in
Table 2 and the graphs for the probability for the first, second and the first two
digits for both groups of data are shown also.
The test shows that for the group of data containing Gross domestic
product, Benford’s MAD and χ2 for the first, second and the first two digits
does not show the irregularity of data.
Regarding the data in the group Current account Balances Benford’s
MAD and χ2 for the first digit does not show the irregularities, MAD test for
Some Aspects of the Application of Benford’s Law … 109

the second digit shows the acceptable differences, while the χ2 of the second
digit and MAD and χ2 of the first two digits shows the probability, and we are
almost sure, that the result does not comply with Benford’s Law. This result
can be the consequence of the way by which the data are calculated regarding
Current Account Balances because it is about the difference of the two groups
of data that both are Benford’s Law groups. But this could be the consequence
of the possible irregular Current Account Balances reported by some
countries.

Table 2. Benford’s MAD and χ2 test for the first, second and the first
two digits

Gross domestic product, current prices Current account balance


MAD (C1) 0.0024 0.003
MAD (C2) 0.0034 0.0084
MAD (C1C2) 0.000954 0.0019
Χ28 (C1) (Chi square (C1)) 7.9118 6.0233
Χ29 (C2) (Chi square (C2)) 12.5121 79.3184
Χ289 (C1C2) (Chi square (C1C2) 75.758 299.1518
Pvalue (C1) 0.5755 0.7598
Pvalue (C2) 0.2598 4.344*10-24
Pvalue (C1C2) 0.8582 3.52*10-35

The probability distributions of the first digits-Current account balance


0.35

0.3

0.25

0.2
Probability

0.15

0.1

0.05

0
1 2 3 4 5 6 7 8 9
The first significant digit

Figure 11. (Continued)


110 D. Joksimović, G. Knežević, V. Pavlović et al.

The probability distributions of the second digits-Current account balance


0.16

0.14

0.12

0.1
Probability

0.08

0.06

0.04

0.02

0
0 1 2 3 4 5 6 7 8 9
The second significant digit

The probability distributions of the first two digits-Current account balance

0.05

0.04
Probability

0.03

0.02
10
0.01

0 5
0
2
4
6
8 0 Second digits
10
First digits

Figure 11. Probability distribution of the first, the second and the first two digits-
Current account balance.
Some Aspects of the Application of Benford’s Law … 111

SOME EXAMPLES OF BENFORD’S LAW IMPLEMENTATION


General conditions that some of the numerical data should coincide in
order to satisfy Benford’s Law were subject to the analysis of many papers and
researchers.
Some of these conditions are as follows:

 data must describe a similar phenomenon, i.e., that they have the same
nature or the same set of sources that generate them (financial
transactions, the results of various measurements of length, volume,
etc. ...)
 no need for boundaries of minimum or maximum values
 data must have an incidental nature, rather than some previously
generated data using a pattern, such data are serial numbers, phone
numbers, personal identification numbers, social security numbers,
tax numbers, car registration, account numbers ...
 data should comprise of more small than large numbers, and that the
average value is less than the median (positive asymmetry), the higher
the ratio of the mean divided by a median, the more the data are
suitable for this analysis.
 data should be reported under the same units of measurement
 data should include at least two orders of magnitude.

Examples of events that generate data that satisfy Benford’s Law are: the
price of securities on the stock exchange, financial transactions, bank cards,
some processes in telecommunication and computer systems, processes that
describe recurrent sequences (Fibonacci sequence, fractals) The natural
demographic population growth processes of plants and animals, and many
others. Subject to Benford’s Law analysis can also be the frequency of
categorical data.
Some examples of the usage of this law are:

 In mathematics [10-20] it seems that many dynamic systems generate


data that are subject to Benford’s Law. Also, many recurrence
relations, as well as solutions to certain classes of difference equations
are associated with Benford’s Law. Markow chains are also
associated with this law. The first digit of prime numbers also follow
Benford’s Law. In this case the observer pattern must be very large (n
112 D. Joksimović, G. Knežević, V. Pavlović et al.

> 106 prime numbers). Newton iterative procedure generates data that
are subject to this law, and so on.
 In the area of economy [21-31] we can find some famous examples of
the application of Benford’s Law. Mark J. Nigrini analyzed tax
returns. This is the beginning of the use of this law for the detection of
fraud. The basic assumption is that the frequency of significant digits
that do not follow Benford’s Law suggests a possible irregularity in
the transactions. This method was quickly adopted by some
supervisory authorities and recognized as a valid audit procedure. So
now there are standard software packages that use Benford’s Law. It
has a highly regarded role in detecting fraud. In the detection of
scams, Benford’s Law can be applied and this is the most widely used
area for this law. This method was soon discovered and applied in the
detection of fraud with credit cards and other forms of electronic
transaction fraud. Detection of fraud is not the only application of this
law. Creative material accounting causes many misstatements in
financial reports and that is why Benford’s Law is needed. It is
applied in the analysis of structural deficiencies in macroeconomic
data, the analysis of investment programs, accounting reports, traffic
reports, forensic accounting, etc. ...
 In informatics and computing science it is shown, based on the Law of
Benford, that computer design that minimizes the storage space is
based on the base B = 8. It is also used for analysis of the size of the
files in the folders, as well as the duration of the analysis of various
processes in multi-user environments, and it is also used in cyber
security, neural networks, digital forensics [32-42], etc. ...
 In cryptology Benford’s Law is used in the steganography, the
stylometry (analysis of linguistic styles and habits of individuals’
writing) and in image forensics [43-53].

This law is also applied in a number of other problems in various fields, in


almost all natural and social events and random processes such as: biology,
psychology, medicine, demographics, physics, astronomy, geology, time series
analysis [54-69], etc. ...
Benford’s Law cannot be used in the following cases:
Some Aspects of the Application of Benford’s Law … 113

Data analysis where data are normalized to some process (min-max


normalization, normalization using standards covering deviations, etc.).

 if the sample has more dimensions and if in some dimension some


data are missing,
 if it is only part of the elements of the sample subject to multiplication
by a number. Therefore, data must be presented in a single measuring
system

CONCLUSION
Benford’s Law has determined the probability of occurrence of significant
digits in the realization of random variables and in the analysis of numerical
data, in a very wide range of events. Under certain conditions this law has a
universal character. It is valid in all the various systems and its application is
found in almost all natural and social phenomena, and in the analysis of the
events that have some system of measurement.
So far it has been the most used in forensics, in the analysis of intentional
fraud, especially in the various financial statements and the interpolation
process analysis (Newton’s iterative process), in the optimizing of computer
systems, accelerating algorithms, deciphering hidden messages, and in many
other analyzes. As of yet it has not been exactly mathematically proven why
numerical data under certain conditions meet Benford’s Law distribution. For
some specific events such evidence exists, but not in general. In a modeled
problem of the existence of Benford’s Law distribution, it is shown that it is a
consequence of the law of universal aspirations of closed system to move to a
state with maximum entropy, and it is present in all natural and social
phenomena, which in a given analysis can be considered to be closed.
Mathematical implications of this law are defined in a specific Borel σ-field,
which is called algebra mantissa.
Our Monte-Carlo simulations of relative frequency of showing the first
type error, points to the inapplicability of the Mean Absolute Deviation tests of
the first digit for the strings whose length does not pass the 1000 data sets, as
well as the inapplicability of the Mean Absolute Deviation data tests for the
first two digits for the strings whose length is no more than 3000 data sets.
Also, it confirms that the Pearson 𝜒 2 test of the first digit and first two digits is
114 D. Joksimović, G. Knežević, V. Pavlović et al.

applicable for all length of data sets. Applicable means that the first type error
is acceptable.
We carry out the practical implementation of Benford’s Law by testing it
on two groups of data. The first group is the data regarding the Gross
Domestic Product and the other group comprises the Current Account
Balances for the 184 countries in the period from 1980 to 2016. The test shows
that for the group of data containing Gross Domestic Product, Benford’s MAD
and χ2 for the first, second and the first two digits does not show the
irregularity of data. Regarding the data in the group Current Account
Balances, Benford’s MAD and χ2 for the first digit does not show the
irregularities, MAD test for the second digit shows the acceptable differences,
while the χ2 of the second digit and MAD and χ2 of the first two digits shows
the probable, and we are almost sure, that the result does not comply with
Benford’s Law. This result can be the consequence of the way by which the
data are calculated regarding Current Account Balances because it is about the
difference of the two group of data that both are Benford’s groups. But this
could be the consequence of the possible irregular Current Account Balances
reported by some countries.
We believe that the implementation of this law will be more prominent,
and it will be used in practice, and also in the scientific analysis of a multitude
of phenomena in many different areas.

REFERENCES
[1] Newcomb, S. 1881. “Note on the frequency of use of the different digits
in natural numbers.” American Journal of Mathematics 4 (1): 39-40.
[2] Frank Benford. 1938. The Law of Anomalous Numbers, Proceedings of
the American Philosophical Society, Vol. 78, No. 4, p. 551-572.
[3] Arno Berger, Theodore P. Hill. 2011. “A Basic Theory of Benford’s
Law,” Probability Surveys, Vol. 8, 1-126.
[4] Drien Jamain. 2011. “Benford’s Law,” Dissertation Report, Department
of Mathematics, Imperial College, London.
[5] Theodore P. Hill. 1995. “A Statistical Derivation of the Significant Digit
Law,” Statistical Science, Vol. 10, No 4, 354-363.
[6] Michaele Ciofalo. 2009. “Entropy, Benford’s first Digit Law and The
Disribution of Everything,” Palermo, Italy: Dipartamento di Ingenieria
Nucleare Universita degli Studi di Palermo.
Some Aspects of the Application of Benford’s Law … 115

[7] Oded Kafri. 2009. “Entropy Princip in Direct Derivation of Benford’s


Law,” Varicom Communications.
[8] Sofia B. Villas-Boas, Qiuzi Fu, George Judge, 2015. “Is Benford’s Law
a Univerzal Behavioral Theory?,” Econometrics, 3, 698-708.
[9] Philip D. Drake, Mark J. Nigrini. 2000. “Computer Assisted Analytical
Procedures Using Benford’s Law,” Journal of Accounting Education 18,
127-146.
[10] Berger, A and Eshun, G. 2014. A characterization of Benford’s Law in
discrete-time linear systems. Journal of Dynamics and Differential
Equations, Springer; published online 15 September 2014. ISSN/ISBN:
1040-7294. DOI: 10.1007/s10884-014-9393-y.
[11] Berger, A and Eshun, G. 2014. Benford solutions of linear difference
equations. Theory and Applications of Difference Equations and
Discrete Dynamical Systems, Springer Proceedings in Mathematics and
Statistics Volume 102, pp. 23-60. ISSN/ISBN: 978-3-662-44139-8.
DOI: 10.1007/978-3-662-44140-4_2.
[12] Berger Arno and Hill Theodore P. (2015). An Introduction to Benford’s
Law. Princeton University Press: Princeton, NJ. ISSN/ISBN:
9780691163062.
[13] Berger, A, Hill, TP, Kaynar, B and Ridder, A. 2011. Finite-state Markov
Chains Obey Benford’s Law. SIAM Journal of Matrix Analysis and
Applications 32(3), 665-684.
[14] Dickinson, JR. 2002. A universal mathematical law criterion for
algorithmic validity. Developments in Business Simulation and
Experiential Learning 29, 26-33.
[15] Eliahou, S, Massé, B and Schneider, D. 2013. On the mantissa
distribution of powers of natural and prime numbers. Acta Mathematica
Hungarica, Volume 139, Issue 1 (2013), pp. 49-63, DOI: 10.1007/s1047
4-012-0244-1. ISSN/ISBN:0236-5294.
[16] Massé, B and Schneider, D. 2014. The mantissa distribution of the
primorial numbers. Acta Arithmetica 163, pp. 45-58. ISSN/ISBN: 0065-
1036. DOI: 10.4064/aa163-1-4.
[17] Massé, B and Schneider, D. 2015. Fast growing sequences of numbers
and the first digit phenomenon. International Journal of Number Theory
11:705, pp. 705-719. DOI:10.1142/S1793042115500384.
[18] Miller Steven J (ed.) (2015). Benford’s Law: Theory and Applications.
Princeton University Press: Princeton and Oxford. ISSN/ISBN: 978-0-
691-14761-1.
116 D. Joksimović, G. Knežević, V. Pavlović et al.

[19] Wojcik, MR. 2014. A characterization of Benford’s law through


generalized scale-invariance. Mathematical Social Sciences, Volume 71,
September 2014, pp. 1-5. DOI:10.1016/j.mathsocsci.2014.03.006.
[20] Berger, A. and Eshun, G. 2016. A Characterization of Benford’s Law in
Discrete-Time Linear Systems, Journal of Dynamics and Differential
Equations, June 2016, Vol. 28, Issue 2, pp. 431-469, doi:10.1007/s10884
-014-9393-y.
[21] Gava, AM and Vitiello, L. 2014. Inflation, Quarterly Balance Sheets and
the Possibility of Fraud: Benford’s Law and the Brazilian case. Journal
of Accounting, Business and Management Vol. 21 Issue 1, pp. 43-52.
ISSN/ISBN: 0216-423X.
[22] Geyer, D and Drechsler, C. 2014. Detecting Cosmetic Debt Management
Using Benford’s Law. The Journal of Applied Business Research, 2014,
30 (5), pp. 1485-1492.
[23] Lin, F and Wu, S-F. 2014. Comparison of cosmetic earnings
management for the developed markets and emerging markets: Some
empirical evidence from the United States and Taiwan. Economic
Modelling, Vol. 36, pp. 466-473. DOI:10.1016/j.econmod.2013.10.002.
[24] Mir, TA, Ausloos, M and Cerqueti, R. 2014. Benford’s law predicted
digit distribution of aggregated income taxes: the surprising conformity
of Italian cities and regions. Eur. Phys. J. B (2014) 87: 261.
ISSN/ISBN:1434-6028. DOI:10.1140/epjb/e2014-50525-2.
[25] Nigrini Mark J. 2012. Benford’s Law: Applications for Forensic
Accounting, Auditing, and Fraud Detection. John Wiley and Sons,
Hoboken, New Jersey. ISSN/ISBN:978-1-118-15285-0.
[26] Nigrini Mark J. 2011. Forensic Analytics: Methods and Techniques for
Forensic Accounting Investigations. John Wiley and Sons, Hoboken,
New Jersey. ISSN/ISBN:978-0-470-89046-2.
[27] Özer, G and Babacan, B. 2013. Benford’s Law and Digital Analysis:
Application on Turkish Banking Sector. Business and Economics
Research Journal Vol. 4 Issue 1, pp. 29-41. ISSN/ISBN: 1309-2448.
[28] Tödter Karl-Heinz. 2015. Benford’s Law and Fraud in Economic
Research. In: Steven J. Miller (ed.), Benford’s Law: Theory and
Applications, Princeton University Press, Princeton and Oxford, pp. 244-
256. ISSN/ISBN:978-0-691-14761-1.
[29] Marijana Ljubic. 2011. “Stress Testing As An Instrument of Risk
Control In Banks,” Megatrend Review, 2011. Vol. 8(1), 303-323.
Some Aspects of the Application of Benford’s Law … 117

[30] G. Knežević, Mizdraković, V. Arežina, N. 2012. “Management as a


cause and effect of creative accounting suppresion,” Management, FON,
(17), 62, 5-11.
[31] S. Muminovic, V. Pavlovic, 2013. “Application Benford’s Law in
Forensic Accounting,” Revizor, No 61, 59-69.
[32] Arshadi, L and Jahangir, AH. 2014. Benford’s Law behavior of Internet
traffic. Journal of Network and Computer Applications, Volume 40,
April 2014, pp. 194-205. ISSN/ISBN: 1084-8045. DOI:10.1016/j.jnca.
2013.09.007.
[33] Goldoni, E, Savazzi, P and Gamba, P. 2012. A novel source coding
technique for wireless sensor networks based on Benford’s Law. 2012
IEEE Workshop on Environmental Energy and Structural Monitoring
Systems (EESMS), 26-28 Sept. 2012, pp 32-34. ISSN/ISBN: 978-1-
4673-2739-8.
[34] Koskivaara, E. 2004. Artificial neural networks in analytical review
procedures. Managerial Auditing Journal 19(2), pp. 191-223.
[35] Krakar, Z and Žgela, M. 2009. Application of Benford’s Law in
information systems auditing. Journal of Information and
Organizational Sciences, 33(1), pp. 39-51.
[36] Odueke, A and Weir, GRS. 2012. Triage in Forensic Accounting using
Zipf’s Law. Issues in Cybercrime, Security and Digital Forensics.
Edited by G. R. S. Weir and A. Al-Nemrat. Glasgow. University of
Strathclyde Publishing. 2012. pp. 33-43.
[37] Ravikumar, B. 2008. The Benford-Newcomb Distribution and
Unambiguous Context-Free Languages. International Journal of
Foundations of Computer Science 19(3), 717-727. ISSN/ISBN: 0129-
0541.
[38] Shi, YQ. 2009. First digit law and its application to digital forensics. In:
H. J. Kim, S. Katzenbeisser, and A.T.S. Ho (Eds.), Digital
watermarking: Seventh International Workshop (IWDW ’08) (pp. 444-
456). Berlin: Springer-Verlag. ISSN/ISBN:978-3-642-04437-3. DOI:10.
1007/978-3-642-04438-0_37.
[39] Winter, C, Schneider, M and Yannikos, Y. 2011. Detecting Fraud Using
Modified Benford Analysis. Advances in Digital Forensics VII, 7th IFIP
WG 11.9 International Conference on Digital Forensics, Orlando, FL,
US, January 31 - February 2, 2011, Revised Selected Papers. Gilbert
Peterson and Sujeet Shenoi (Editors). IFIP Advances in Information and
Co. ISSN/ISBN: 1868-4238. DOI: 10.1007/978-3-642-24212-0_10.
118 D. Joksimović, G. Knežević, V. Pavlović et al.

[40] Winter, C, Schneider, M and Yannikos, Y. 2012. Model-Based Digit


Analysis for Fraud Detection overcomes Limitations of Benford
Analysis. Availability, Reliability and Security (ARES 2012), Seventh
International Conference, August 20-24, 2012, Prague, Czech Republic.
IEEE CS volume E4775, pages 255-261. IEEE Computer Society.
ISSN/ISBN: 978-1-4673-2244-7. DOI:10.1109/ARES.2012.37.
[41] Xu, B, Wang, J, Liu, G and Dai, Y. 2011. Photorealistic computer
graphics forensics based on leading digit law. Journal of Electronics
(China) 28(1) pp. 95-100. DOI: 10.1007/s11767-011-0474-3.
[42] Raftopoulou, P, Koubarakis, M, Stergiou, K and Triantafillou, P. 2005.
Fair resource allocation in a simple multi-agent setting: search
algorithms and experimental evaluation. International Journal on
Artificial Intelligence Tools 14(6), 887-899.
[43] Andriotis, P, Oikonomou, G and Tryfonas, T. 2013. JPEG
steganography detection with Benford’s law. Digital Investigation 9(3-
4), pp. 246-57. DOI:10.1016/j.diin.2013.01.005.
[44] Birajdar, GK and Mankar, VH. 2013. Digital image forgery detection
using passive techniques: A survey. Digital Investigation, Vol. 10, No.
3, pp. 226-245. DOI:10.1016/j.diin.2013.04.007.
[45] Zenkov, AV. 2015. Deviation from Benford’s law and identification of
author peculiarities in texts. Computer Research and Modeling, vol. 7,
no. 1, pp. 197-201 (in Russian). ISSN/ISBN: 2076-7633.
[46] Chen, W and Shi, YQ. 2009. Detection of Double MPEG Compression
Based on First Digit Statistics. In: H. J. Kim, S. Katzenbeisser, and
A.T.S. Ho (Eds.), Digital watermarking: Seventh International
Workshop (IWDW’08) (pp. 16-30). Berlin: Springer-Verlag.
[47] Fu, D, Shi, YQ and Su, W. 2007. A generalized Benford’s law for JPEG
coefficients and its applications in image forensics. Proceedings of
SPIE, Volume 6505, Security, Steganography and Watermarking of
Multimedia Contents IX, San Jose, California, January 28 - February 1,
2007, pp. 65051L-65051L-11. DOI:10.1117/12.704723.
[48] Li, XH, Zhao, YQ, Liao, M, Shih, FY and Shi, YQ. 2012. Detection of
tampered region for JPEG images by using mode-based first digit
features. EURASIP Journal on Advances in Signal Processing, 2012:
190. DOI: 10.1186/1687-6180-2012-190.
[49] Pérez-González, F, Heileman, G and Abdallah, CT. 2007. A
generalization of Benford’s Law and its application to images. European
Control Conference’2007, Kos, Greece, July 2007, pp. 3613-3619.
ISSN/ISBN: 9783952417386.
Some Aspects of the Application of Benford’s Law … 119

[50] Perez-Gonzalez, F, Heileman, GL and Abdallah, CT. 2007. Benford’s


Law in Image Processing. Image Processing, pp I-405-I-408. ICIP 2007.
IEEE International Conference. ISSN/ISBN: 1522-4880.
[51] Piva, A. 2013. An Overview on Image Forensics. ISRN Signal
Processing, vol. 2013, Article ID 496701, 22 pages, 2013. doi:10.1155/
2013/496701. ISSN/ISBN:2090-5041.
[52] Qadir, G, Zhao, X and Ho, ATS. 2010. Estimating JPEG2000
Compression for Image Forensics Using the Benford’s Law. Proc. of
SPIE Vol. 7723, pp. 77230J1-77230J10. DOI:10.1117/12.855085.
[53] Tong, S, Zhang, Z, Xie, Y and Wu, X. 2013. Image Splicing Detection
Based on Statistical Properties of Benford Model. Proceedings of the 2nd
International Conference on Computer Science and Electronics
Engineering (ICCSEE 2013), pp. 0792-0795. ISSN/ISBN: 978-90-
78677-61-1. DOI:10.2991/iccsee.2013.200.
[54] Campos, L, Salvo, AE and Flores-Moya, A. 2016. Natural taxonomic
categories of angiosperms obey Benford’s law, but artificial ones do not.
Systematics and Biodiversity, in press. ISSN/ISBN: 1477-2000 (Print)/1.
DOI:10.1080/14772000.2016.1181683.
[55] Cournane, S, Sheehy, N and Cooke, J. 2014. The novel application of
Benford’s second order analysis for monitoring radiation output in
interventional radiology. Physica Medica 30(4), pp. 413-418. DOI:10.
1016/j.ejmp.2013.11.004.
[56] De Vries, P and Murk, AJ. 2013. Compliance of LC50 and NOEC data
with Benford’s Law: An indication of reliability?. Ecotoxicology and
Environmental Safety 98 (2013) 171-178. DOI:10.1016/j.ecoenv.2013.
09.002.
[57] Kreuzer, M, Jordan, D, Antkowiak, B, Drexler, B, Kochs, EF and
Schneider, G. 2014. Brain electrical activity obeys Benford’s law.
Anesth. Analg. 118(1), pp. 183-91. DOI:10.1213/ANE.00000000000000
15.
[58] Alexopoulos, T and Leontsinis, S. 2014. Benford’s Law in Astronomy.
Journal of Astrophysics and Astronomy, 35(4), pp. 639-648.
ISSN/ISBN: 0250-6335. DOI: 10.1007/s12036-014-9303-z.
[59] Biau, D. 2015. The first-digit frequencies in data of turbulent flows.
Physica A: Statistical Mechanics and its Applications Volume 440, pp.
147-154. DOI:10.1016/j.physa.2015.08.016.
[60] Eliazar, II. 2013. Benford’s Law: A Poisson Perspective. Physica A 392
(16) pp. 3360-3373. DOI:10.1016/j.physa.2013.03.057.
120 D. Joksimović, G. Knežević, V. Pavlović et al.

[61] Geyer, A and Martì, J. 2012. Applying Benford’s law to volcanology.


Geology 40(4), pp. 327-330.
[62] Hill, TP and Fox, RF. 2016. Hubble’s Law Implies Benford’s Law for
Distances to Galaxies. Journal of Astrophysics and Astronomy 37(1), pp.
1-8. ISSN/ISBN: 0973-7758. DOI: 10.1007/s12036-016-9373-1.
[63] Nigrini, MJ and Miller, SJ. 2007. Benford’s Law Applied to Hydrology
Data—Results and Relevance to Other Geophysical Data. Mathematical
Geology 39(5), 469-490. ISSN/ISBN: 0882-8121.
[64] Pain, J-C. 2013. Regularities and symmetries in atomic structure and
spectra. High Energy Density Physics, Vol. 9, No. 3, pp. 392-401. DOI:
10.1016/j.hedp.2013.04.007.
[65] Sottili, G, Palladino, DM, Giaccio, B and Messina, P. 2012. Benford’s
Law in Time Series Analysis of Seismic Clusters. Mathematical
Geosciences Volume 44, Number 5 (2012), pp. 619-634. DOI:10.1007/s
11004-012-9398-1.
[66] Roukema, BF. 2014. A first-digit anomaly in the 2009 Iranian
presidential election. Journal of Applied Statistics 41(1), pp. 164-199.
DOI:10.1080/02664763.2013.838664.
[67] Pericchi, L and Torres, DA. 2011. Quick anomaly detection by the
Newcomb-Benford Law, with applications to electoral processes data
from the US, Puerto Rico and Venezuela. Statistical Science 26(4), pp.
502-16. DOI:10.1214/09-STS296.
[68] Günnel, S and Tödter, KH. 2009. Does Benford’s Law hold in economic
research and forecasting?. Empirica 36, 273-292.
[69] Breunig, C and Goerres, A. 2011. Searching for Electoral Irregularities
in an Established Democracy: Applying Benford’s Law Tests to
Bundestag Elections in Unified Germany. Electoral Studies 30(3)
September 2011, pp. 534-545.
In: Knowledge Discovery in Cyberspace ISBN: 978-1-53610-566-7
Editors: K. Kuk and D. Ranđelović © 2017 Nova Science Publishers, Inc.

Chapter 5

BEHAVIOUR AND ATTITUDES VS. PRIVACY


CONCERNS OF SOCIAL ONLINE NETWORKS

Gordana Savić*, PhD and Marija Kuzmanović, PhD


University of Belgrade, Faculty of Organizational Sciences, Serbia

ABSTRACT
The Online Social Networks (OSNs) are widely used in commercial
or personal purposes, including entertainment. The main value of OSNs
is in their ability to facilitate and ease communication between end users.
They also enable users who are physically remote to maintain
relationships and stay up to date with current events.
Therefore, the popularity and market potential of different OSNs has
been growing over time along with the growth of user engagement and
they still continue to increase. The number of network users worldwide
reached 2.2 billion as of 2016. The leading social network is Facebook
with above 1.5 billion mainly active users worldwide, as of 2016.
According to Internet World Stat, 73.5% of the European Union citizens
use the Internet on a regular basis, while 51.24% of them use Facebook.
In Serbia, 66.20% of the population uses the Internet, but 72.51% of them
also use the most popular OSN – Facebook. This data motivated us to
conduct a survey which would determine if the OSN users in Serbia have
concerns about their privacy.
We intend to investigate the relationship between concerns of OSN
users and their behaviour and attitudes towards privacy. For that purpose,

* Corresponding
author: G. Savic, Email: gordana.savic@fon.bg.ac.rs.
122 Gordana Savić and Marija Kuzmanović

the online survey is conducted with 641 respondents in Serbia. The


behaviour is investigated throughout the groups of questions concerning
the habits of using OSNs, leaving real personal data or connecting with
unknown people. The attitude is investigated by the questions regarding
the opinion about possibility of using personal data by the provider or
third party, posting the photos by the friend, data transparency or
possibility of customizing network. Privacy concerns are investigated by
the Westin Privacy Segmentation Index which divides users into the
groups of fundamentalists, pragmatists and unconcerned. We found that
largest group of OSN users are Privacy Pragmatists (335), followed by
group of Privacy Fundamentalist (283) and smallest group are Privacy
Unconcerned (23). Those fundamentalists belong to a group of older
ONS users who are mainly concerned for their privacy. The most of them
(90.5%) have been, or they are not sure whether they have been, exposed
to private data harassment. On the opposite, the unconcerned users are the
most flexible group, spending a lot of time on OSNs, concerning that
their data have not been exposed in 69.6%.

Keywords: OSNs privacy, online survey, behaviour and attitudes, concerns,


privacy segmentation index

INTRODUCTION
The Online Social Networks (OSNs) are widely used in commercial or
personal purposes, including entertainment. The main value of OSNs is in
their ability to facilitate and ease communication between end users. They also
enable users who are physically remote to maintain relationships and stay up
to date with current events. This helps to create social capital [33].
Therefore, the popularity and market potential of different OSNs has been
growing over time along with the growth of user engagement and they still
continue to increase. As of the 4th quarter of 2014, the average time per day
spent on social networks by global Internet users is 101.4 minutes surfing
social networks, while the number of network users worldwide reached 2.2
billion as of 2016. Leading social networks are Facebook with almost 1.5
billion active users worldwide per month as of 2015, followed by
communication networks such as WhatsApp and Facebook Messenger or the
photo-sharing social network Instagram. Recently, social networking has
demonstrated a clear shift towards mobile platforms. As of the first quarter of
2015, some 580 million Facebook users accessed the social network
exclusively via a mobile device, up from 341 million users in the
Behaviour and Attitudes vs. Privacy Concerns … 123

corresponding quarter of the previous year. Furthermore, according to industry


experts, in a few years most interactions on Facebook will be conducted from
a mobile device. This data is taken over from Statistics and facts about Social
Networks [32].
According to the Internet World Stat [38], 46.4% of the world population
uses the Internet. European countries are even more developed and their
population has the advantage of using the Internet in business and private life.
Therefore, 73.5% of the EU citizens use the Internet on a regular basis, while
51.24% of them also use Facebook. Even though Serbia belongs to the group
of developing non-EU member countries, it is positioned somewhat below the
European average of the Internet users with 66.20% of its population.
However, 72.51% of them are also users of the most popular OSN – Facebook.
This data can lead to the conclusion that there are more unemployed, flexible,
younger OSN users who are not concerned about their privacy in Serbia than
in other countries.
Generally, OSN users are unaware of the risks and threats that exist on
these networks starting from privacy violations, identity theft to sexual
harassment which is especially dangerous for adolescents [10]. As a
consequence, OSN users are willing to share their private data and data about
their friends. This data can be connected to demographic data (name, age,
birthplace, place of living), education, vocation, current location, photos, and
private and business habits. All of these data can be harvested by operators or
third-parties and exploited for different purposes, which can jeopardize the
user’s privacy.
For example, harmful private and personal datacan be exposed to third
parties (private persons or companies), and even the government [25]. The
companies can use this data to create personalized commercials or for direct
marketing. The data can be obtained without direct access to an individual’s
online profile; personal details can be exposed without the Internet users’
knowledge. All these findings should lead to the increase in privacy concerns
of OSN users.
We intend to investigate the relationship between OSN users’ concerns,
their behaviour when using the Internet and attitudes towards privacy. The
research is done through a survey conducted with 635 respondents from Serbia
in 2016. On one hand, the behaviour and attitudes are investigated through
groups of questions about the habits of using OSNs, revealing actual personal
data or data about friends, or connecting with unknown people. On the other
hand, the opinion about the possibility of using their personal data by the
provider or third party, photos posted by a friend, data transparency or the
124 Gordana Savić and Marija Kuzmanović

possibility of customizing networks are investigated. Privacy concerns are


investigated using the Westin Privacy Segmentation Index [20], which divides
users into groups of fundamentalists, pragmatists and unconcerned. We
assumed that fundamentalists are older, mainly concerned about their privacy,
probably having experience with private data harassment and consequently
spending less time on the OSNs in comparison to the group of pragmatists and
unconcerned users. As opposed to that, the unconcerned users should be the
most flexible group, spending significant amounts of time on OSNs and
leaving actual data on all networks used.
This chapter is structured as follows. A comprehensive literature review
on general customer and information privacy concerns of OSNs users is
presented in the next section. The results of the survey conducted in Serbia are
given in Section 3. Finally, a conclusion and references are given at the end of
the chapter.

RELATED WORKS

Consumer Information Privacy Concerns

Extensive evidence suggests that worldwide consumers recognise the


problem of a lack of information privacy and control over personal
information [9]. The literature on privacy offers several theories on the
relationship between consumers’ information privacy concerns and their
willingness to share personal information [27]. The first studies on general
privacy concerns emerged in the late 1960s [41]. Westin’s main focus was on
individual privacy rights and control over information collected by
organizations based on individuals’ perceptions of distrust and fears of abuse
with information technology [40].
The Westin-Harris consumer privacy surveys from the late 1970s until
2004 have focused on several privacy areas, from general privacy concerns to
more focused areas in marketing, medical, and online commerce [20]. The
research resulted in a methodology known as Privacy Segmentation Index
(PSI) that divides respondents into three privacy-sensitive segments (privacy
fundamentalist, pragmatist and unconcerned) based on their level of privacy
concern. The PSI captures general privacy attitudes about consumer control,
business, and laws and regulations.
Behaviour and Attitudes vs. Privacy Concerns … 125

Westin studied how PSI has changed over time. He showed that during
1994-2003 the percentage of Fundamentalists in the public remained almost
the same – around 25%, but the number of Unconcerned decreased from 42%
in 1993 to 12% in 2000 and reduced further to 10% in 2003. However, the
Pragmatist group varied between 30% and 64% in 2003 [20]. Westin mentions
that a steady decrease of unconcerned consumers might be due to the fact that
people learned more about technology and also became aware of the various
means of protecting their privacy.
In their study, Krasnova, Hildebrand, and Guenther [19] show that 17.3%,
72.6% and 10.1% belonged to privacy fundamentalists, pragmatists and
unconcerned groups, respectively. Recent research suggests that
approximately 49% of individuals are Fundamentalists, 40% are Pragmatists,
and 10% are Unconcerned [42].
Wang and Petrison [39] demonstrated that certain consumers (particularly
in the older age groups) are more negative about potential threats to their
privacy than others. Sheehan and Hoy [30] showed that consumers who
believe they do not have control over their personal information are more
concerned about privacy. Malhotra et al. [24] have identified consumer
concern factors related to privacy practices and developed survey instruments
to measure user information privacy concerns such as data collection
procedures, control, and awareness of privacy practices.
Numerous other studies have analyzed privacy concern, and applied
diverse instruments for measuring it [29]. Recent studies have focused on
privacy in an online environment.
The findings indicate that consumers are willing to provide both online
and offline companies with basic information, but are more protective of
personal information and are less comfortable sharing more sensitive
information [15, 9]. Graeff and Harmon [11] pointed out that a vast majority
of consumers believed that the Internet has made it easier for someone to
obtain personal information about them. In his study, Ha [13] showed that
online users want highly visible privacy policies telling them precisely how a
company will use their personal information. Chen and Rea [6] reported more
specific protective behaviour suggesting that users will cease web site access if
too much personal information is requested when registering on the site.
Berendt et al. [3] argue that while many users have strong opinions on
privacy and state privacy preferences, they are unable to act accordingly. Once
they are in an online interaction, they often do not monitor and control their
actions sufficiently. They also state that online privacy statements seem to
have no impact on behaviour [9].
126 Gordana Savić and Marija Kuzmanović

Information Privacy Concerns of OSN Users

Boyd and Ellison [4] define OSNs as web-based services that allow
individuals to create a profile and connect to friends within a bounded system
[19]. Debatin [7] emphasizes that the main purpose of participating in social
networks is the exchange of information, most of it highly personal, and the
maintenance and expansion of one’s social relationships. Thus, privacy
protection in online social networks seems to be an oxymoron.
Hugl [17] highlights the necessity to focus on multidimensional and
multidisciplinary frameworks of privacy, considering a so-called “privacy
calculus paradigm” and rethinking “fair information practices” from an
increasingly ubiquitous environment of OSNs.
Barnes [2] refers to public versus private boundaries and a so-called
“paradoxical world of privacy”: while adults are concerned about potential
privacy threats, teenagers make personal data public.
Gross and Acquisti [12] analyzed the online behaviour of students at
Carnegie Mellon University who have used a popular OSN and highlight
diverse potential attacks on various facets of privacy. Authors stated that the
informal character of online social networking and the possibility to
communicate casually enables users to manage a large number of contacts
with relatively little effort [7]. In addition, OSNs enable users to control the
impression they make on others by allowing them to decide how much they
are willing to self-disclose as well as by offering privacy settings to
strategically manage access to personal information. This is additional
motivation for users to post frequently and to voluntarily share large amounts
of personal information. Gross and Acquisti [12] conclude that only a minimal
percentage of users change the permissive default of high privacy preferences,
and personal data therefore is generously shared. On the other hand, OSNs
pose many privacy risks for their users, ranging from unauthorized use of their
information to harmful activities by other users, such as cyber-stalking,
harassment, and reputation damage [6, 16, 26].
Young and Quan-Haase [43] also draw attention to Facebook use by
undergraduate students. They found that 99.35% of the respondents use their
actual first and last name in their profile; nearly two-thirds present their sexual
orientation and interests; 83.1% provide their e-mail address; 92.2% their date
of birth, 80.5% their current town of residence, 97.7% present an image of
themselves, and 96.1% photos of friends [17]. Similarly, Tufekci [37] found
that 94.9% of Facebook users reported using their actual names, while 75.6%
disclose their relationship status.
Behaviour and Attitudes vs. Privacy Concerns … 127

Staddon et al. [31] consider both overall social network privacy concern
and aspects of concern related to transparency and control, with special
attention on comprehension of information sharing in the network, control
over information sharing in the network, and sharing practices of users in
relation to their friends in the network. They found that each aspect of privacy
concern is strongly associated with self-reported engagement across several
measures; users who report higher concern are less engaged while those who
report more control and comprehension over sharing of their information in
the network are more engaged.
Pew Internet Report studies demonstrate that 58% of OSN users have
restricted access to their entire OSN profile or to parts of their profiles [21,
22].
Studies on online privacy behaviour have shown that OSN users tend to be
rather careless with their personal data [7]. Although most users have a general
awareness of possible privacy risks, they do not always act accordingly. For
instance, most Facebook users have hundreds of friends, and statistically,
about one-third of users will accept complete strangers as friends [8].
A broad range of privacy paradox (attitude-behaviour dichotomy) research
finds OSN users actual behaviour during privacy transactions to be in
contradiction with their concerns on privacy risks when sharing personal
information [28]. Namely, OSN users showing high-levels of general privacy
or information sharing concerns are still willing to share higher amounts of
personal information [27].
Acquisti and Gross [1] demonstrated a gap between the information
participants said they cared about protecting online, and what they were
showing publicly on Facebook. Madejski, Johnson, and Bellovin [23]
measured privacy attitudes and intentions and compared these against the
privacy settings on Facebook. They also found that there are inconsistencies
between users’ sharing intentions and their privacy settings.
Haddidi and Hui [14] compared individuals’ behaviour with regard to
friendship requests by using 40 fake identities of well-known film stars and
ordinary people on Facebook. They show that usually users do not accept
random friendship requests.
Keith et al. [18] compared OSN user’s intent to share actual information
with ones who do not share and found no support between privacy concerns
and actual information share but did found a weak relationship between
sharing intentions and actual behaviour suggesting disclosure behaviours are
better predictors of actual behaviour [27].
128 Gordana Savić and Marija Kuzmanović

Sutanto et al. [34] found users with privacy concerns were more than
willing to share their information for personalization benefits on a privacy safe
mobile platform. Taddicken [35] finds that social situations of ‘quid pro quo’
have a much higher impact than privacy concerns on willingness to share
personal information in social networks.

EMPIRICAL STUDY

Study Goals

Our study was broadly designed to explore respondent behaviour in using


online social networks and to determine whether there is a connection between
their attitudes (using the Westin Privacy Segmentation Index), habits and
demographics. Therefore, in this chapter, we focus on the relationship between
privacy concerns and actual behaviour of the online social network users in
Serbia.

Measurement Instrument

In April and May of 2016, an online survey was conducted in Serbia to


determine privacy concerns and behaviour. The research was conducted as an
online questionnaire since we opted for an empirical investigation of privacy
concerns and privacy-related behaviour of OSN users in Serbia.
The questionnaire was administered as a Google document that can be
easily shared online. It consisted of four sections (see Table 1). Section A
comprises the socio-demographic questions regarding gender, age, level of
education, employment status and marital status. Section B contains questions
related to general behaviour of OSN users; while Section C contains questions
related to behaviour with regard to personal and friend’s privacy. Section D
comprises questions regarding privacy concerns with OSN use. This section
also comprises three questions from the Westin Privacy Segmentation Index.
In addition, a special question relates to the experience of privacy violation
and the misuse of data on OSN.
Behaviour and Attitudes vs. Privacy Concerns … 129

The Westin Privacy Segmentation Index

Privacy Segmentation Index (PSI) has been widely used to measure


privacy attitudes and categorize individuals into three privacy groups:
fundamentalists, pragmatists, and unconcerned [20, 42]. It was developed by
Harris Interactive in cooperation with Alan Westin [41]. In recent years the
PSI is based on the following three statements [15, 20]:

Q1: OSN users have lost all control over how personal information is
collected and used by companies (providers).
Q2: Most companies (providers) handle the personal information they
collect about OSN users in a proper and confidential way.
Q3: Existing laws and organizational practices provide a reasonable level
of protection for OSN user privacy today.

Respondents have to rank their position regarding each statement on a


four point scale: 1-Strongly Disagree, 2-Somewhat Disagree, 3-Somewhat
Agree, 4-Strongly Agree. Based on their responses to these three questions,
Westin used the following procedure for dividing respondents into three
categories [20].
First, responses to the individual questions are classified as follows:

 For Q1, responses of “Strongly Agree” or “Somewhat Agree” are


considered privacy-concerned.
 For Q2 and Q3, responses of “Strongly Disagree” or “Somewhat
Disagree” are considered privacy-concerned.

Next, participants are categorized according to the following rules [42]:

1. Privacy fundamentalists are respondents who give privacy concerned


responses to all questions. These respondents are the most protective
of their privacy. They feel companies should not be able to acquire
personal information for their organizational needs and think that
individuals should be proactive in refusing to provide information.
2. Privacy unconcerned are respondents who give responses that are not
privacy-concerned to all questions. These consumers are the least
protective of their privacy – they feel that the benefits they may
receive from companies after providing information far outweigh the
potential abuses of this information.
130 Gordana Savić and Marija Kuzmanović

Table 1.Survey definition

Section Category/Corresponding Survey Question


A Demographic data
B Actual behaviour in using OSNs (general)
For what purposes do you use OSN? Business only,
Private only,
Both private and business
Frequency of performing different activities: Frequently,
establishing new friendships, chatting, general Occasionally,
information sharing, informing about social events, etc. Never
What network is your primary choice?
Frequency of use of the particular OSN (Facebook, 5-point scale
Twitter, Instagram, …) Every day, ..., Never
Total daily time spent on OSNs on average (in hours)
C Actual behaviour (privacy issues)
Personal data shares at networks (actual name, profile Always,
picture, e-mail, phone number, date of birth, Sometimes (depending on
occupation, place of residence, relationship status) the social network),
Never
Sharing intentions: Public,
Who has access to a variety of personal information Friends,
and activities? Friends of friends,
Selected friends
Privacy related behaviour Always,
Check in during a visit to certain places Sometimes,
Accessing applications that are trying to collect my Never
personal data
Accessing my profile from public computers
I accept friend requests from strangers
I tag friends in photos without asking them
It bothers me when friends tag me in photos without
asking me
D Privacy concern
Privacy Segmentation Index (three questions) 4-point scale*
I'm concerned about the privacy of the personal data 5-point scale
I'm concerned about the privacy of friends personal Strongly agree
data …
I'm worried that parents (superiors or colleagues) have Strongly disagree
access to my information
* Details are explained in the section “The Westin Privacy Segmentation Index.”
Behaviour and Attitudes vs. Privacy Concerns … 131

3. Privacy pragmatists are all other respondents, i.e., participants who


give a mix of privacy-concerned and not privacy-concerned
responses. These respondents weigh the potential pros and cons of
sharing information, and evaluate the protections that are in place and
their trust in the company or organization. After this, they decide
whether it makes sense for them to share their personal information.

RESULTS
Invitations to participate in the study were distributed via OSNs and the
responses were collected from April to May 2016. The overall sample is
consisted of OSN users of all ages, gender and professional statuses. The final
sample comprises of 641 respondents who fully completed the survey. The
analyses of their demography structure and responses regarding the privacy
behaviour and concerns are given in this section.

Demography

The statistics regarding demographic data is shown in Table 2. The sample


mainly consisted of women (69%). The overall sample average age is 27.4
(SD = 9.612), while the respondents are between 16 and 65. The majority of
them completed high school (51.2%) or gained one of the university degrees
(39.9% in total), and they are either students (8.9% in high school and 48.2%
at university), or employed (36.3%).

General Behaviour in Using OSNs

Section B of the survey has shown that a negligible number of respondents


use OSNs solely for business purposes (only 8), the largest number of
respondents use OSNs exclusively for private use, 324 out of 641 (50.5%).
The remaining 309 respondents (48.3%) use OSNs both for private and
business purposes.
Mainly students and the employed use OSNs for business purposes, while
the majority of users who use OSNs exclusively for private purposes are
mainly students, the unemployed and pensioners. This proved to be a
statistically significant difference at the level of p<0.01. The differences
132 Gordana Savić and Marija Kuzmanović

related to gender and other demographic characteristics and the purpose of


using OSNs are not statistically significant. Figure 1 shows the frequency of
performing different activities at OSNs.

Table 2.Demographic data

Demographic Category Percent


Gender
Male 31%
Female 69%
Age
16-24 57.10%
25-34 17.47%
35-44 17.00%
>45 8.43%
Level of education
Primary school 8.9%
High school 51.2%
Undergraduate 18.4%
Master degree 19.0%
PhD degree 2.5%
Employment status
Students (high school) 8.9%
Students (university) 48.2%
Unemployed 6.1%
Employed 36.3%
Retired 0.5%
Relationship status
Single 44.3%
In relationship 37.6%
Married 18.1%

Another statistically significant (p<0.01) result indicates that men use


OSNs for exploring more often (55.3%) than women (38.1%). The difference
in behaviour regarding relationship factors (p<0.01) is also statistically
significant. More specifically, there is a difference between the respondents
who are married, in a relationship or single. The majority of respondents who
are married (73.3%) have never used an OSN for making new relationships.
Similarly, 62.2% of the respondents who are in a relationship have never used
an OSN for this purpose. However, 45.1% of single respondents claimed that
they have also never used an OSN for making a new relationship. As for the
Behaviour and Attitudes vs. Privacy Concerns … 133

employment status, the employed respondents use OSNs for making new
relationships in the lowest percent (31.3%), while students are leaders in this
segment with as much as 57.9% of the whole segment.
When it comes to chatting, there is a significant difference between those
who are married, in a relationship and single, including the employment status
(p<0.01). Only 2.5% of the singles are claimed that are not chatters, while the
percentage of those who are married and are not chatters is 15.5%. Most high
school students (100%) and university students (97.1%) chat, and the lowest
percent of the employed are chatters (12.5% never chatting) in comparison to
the other groups such as unemployed, students, etc.

Figure 1. Frequency of performing OSN activities.

The majority of the respondents use an OSN for sharing information –


only 2.7% have never shared information. The students are also leaders in this
segment. Actually, 78.6% of university students often used OSNs for sharing
information, while high school students mainly do not use them for this
purpose. This result is in full compliance with the statistics of the responses to
the question about what activities OSNs are used for. Namely, the students in
high school use OSNs for being informed about social events the least (26.3%,
p<0.01), but they often use OSNs for sharing photos and videos (more than
50%, p<0.05). In addition, about 50% of women share, while just 4.8% of
women and 11% of men do not share their photos and videos (p<0.01). The
“behave question” regards using of the different applications such as games
134 Gordana Savić and Marija Kuzmanović

and quizzes reveals that only 17.8% of users in the whole sample use them,
high school students being the most frequent users (24.7%, p<0.01)
Furthermore, the majority of respondents use Facebook as their primary
choice (69.6%). This result is expected since it is in line with Internet World
Stat findings [38]. Details on the use of other OSNs are given in Figure 2.
Facebook is the most popular OSN among users aged 35-44 (80.7%). In
the youngest group of respondents (16-24), in addition to Facebook (63.7%),
the top choices are Instagram (14.8%) and WhatsApp (14.6%). Viber is the
best quoted (about 11%) among the other specified OSNs with significance
level p<0.01. In terms of the frequency of use of a particular network, the
statistics is as follows: three OSNs, Facebook, Instagram, and YouTube, were
the most prominent by the fact that a large number of respondents use them
every day (Figure 3). Only 1.6% of respondents have never used Facebook,
while it is used on daily bases by 76.4% of all respondents. YouTube has
never been used by only 1.1%, but 61% of the respondents use it on a daily
basis. Instagram is used on a daily basis by 30.9% of the respondents (mostly
younger respondents aged up to 34), but the share of those who never use it is
quite large (41.3%). LinkedIn is mostly used by the employed aged 35-44, but
a large proportion of respondents (64.3%) never used it. And finally, Pinterest
has never been used by 76.4% of the respondents and only 1.9% of them use it
on a daily basis. The average time per day spent on all social networks is 3.98
hours with standard deviation of 2.985. No statistically significant correlation
between the time spent on OSNs and age categories appeared.

Figure 2. OSNs primary choice (in %).


Behaviour and Attitudes vs. Privacy Concerns … 135

Figure 3. User distribution per most frequently used OSNs.

Behaviour in Using OSN Regarding Privacy

The first set of questions regarding privacy is related to personal data and
their public placement and online availability. For the majority of items,
respondents answered that it depended on the OSN. But there were those who
never reveal their actual personal information (such as name, birthday, etc.)
regardless of the OSN. However, there are those who always leave their actual
personal information. The detailed results are shown in Figure 4.
As for the sharing intentions, the majority of the respondents share
personal data, photos, videos and posts with all friends (Figure 5).
In the next set of questions, tagging behaviour is the most interesting.
Actually, those who are married, tag friends without asking considerably more
seldom than others (6%), while 18.75% of those who are in a relationship and
20.8% of those who are single tag friends without permission.
136 Gordana Savić and Marija Kuzmanović

Figure 4. Personal data sharing (in %).

Figure 5. Distribution of sharing intentions.

It was also interesting to further explore and crosstab issues related to


tagging behaviour. The observed correlations are statistically significant at
level p<0.01:
Behaviour and Attitudes vs. Privacy Concerns … 137

 93 (14.5%) of the respondents who never tag others would always


mind if someone else tagged them.
 99 (15.4%) of the respondents who never tag others would be
bothered sometimes if someone else tagged them.
 48 (7.5%) of the respondents who never tag others are never bothered
to have someone else tag them.
 70 (10.9%) of the respondents who always tag others are never
bothered to have someone else tag them.
 31 (4.8%) of the respondents who always tag others would sometimes
be bothered to have someone else tag them.
 Only 10 (1.6%) of the respondents who always tag others would
always be bothered if somebody else tagged them.

Data Privacy and Safety Concerns

Data privacy and safety concerns are expressed through Westin Privacy
Segmentation Index [41].
Figure 6 shows the distribution of responses to the three PSI questions. In
Q1, agreement means privacy-concerned, while in Q2 and Q3, disagreement
means privacy-concerned. Therefore, the majority of respondents think that
OSN users may lose control over private information distribution (50.39%
strongly agree and 33.95% somewhat agree). Only 2.03% of the respondents
disagree with this. Such responses also indicate privacy concern.
The distribution of the responses to questions about concerns for friends’
data privacy and the concern that parents (superiors or colleagues) could have
access to their private information (Q2 and Q3) are very similar. The majority
of the respondents (around 45%) somewhat disagree with the statement that
providers behave properly (Q2) and that laws protect users (Q3). On the other
hand, only 6.24% agree with statement Q2 and 4.84% agree with statement
Q3. These statistics also indicate that a high level of privacy concern exists
among OSN users in Serbia. The obtained results are in line with the concerns
level in other countries, e.g., 92% of Americans and Brits are concerned to
some extent for their privacy online [36] which is 42% more than in the
previous year. Interestingly, the average share (around 8%) of the unconcerned
online users in those countries is still slightly higher than in Serbia.
138 Gordana Savić and Marija Kuzmanović

Figure 7 shows the distribution of the Westin categories. Approximately


52% of the respondents are pragmatists, 44% are fundamentalists, and nearly
4% are unconcerned.
We examined whether demographic variables predicted respondents
Westin categories or their responses to the individual Westin questions, using
the chi-square test. The results indicate that the majority of women (56.1%)
are pragmatists, while most men are fundamentalists (53.8%), at statistical
significance level p<0.01. Young people aged 16-24 are mostly pragmatists
(59.8% of them), while users aged 25-34 are almost equally spaced in
pragmatists and fundamentalists (49.1% and 48.2%). Respondents aged 35-44
(as high as 60.6%) and above 45 (as high as 56.6%) are mostly
fundamentalists. The highest percent of the unconcerned are young people
aged up to 24 – 69.6% at statistical significance level p<0.01. The pensioners
and the employed are mostly fundamentalists (66.7% and 58.8%), while the
students are mostly pragmatists (68.4% and 59.9%). As high as 10.3% of all
unemployed are unconcerned, and 8.8% of all students are also unconcerned.
Moreover, 56% of those who are single, and 52.7% of those in a relationship
are pragmatists, while the majority of those married are fundamentalist (56%).
These results are statistically significant at level of p<0.1.

Figure 6. Distribution of three Westin questions responses.


Behaviour and Attitudes vs. Privacy Concerns … 139

Figure 7. Distribution of the Westin categories.

We also explored the habits in using OSNs and concluded that


fundamentalists use Instagram much less than pragmatists and unconcerned.
Only 22.6% of them use it on a daily basis, while this share of the
unconcerned is 39.1% and 37.3% of the pragmatists (p<0.01).

Figure 8. Experience of privacy violations.


140 Gordana Savić and Marija Kuzmanović

In addition, we explored the impact of abuse and violation of privacy on


OSNs (see Figure 8).
A correlation between privacy abuse experience and PSI is significant at
level p<0.1. The majority of those who have had such experiences are among
the fundamentalists (51.9%), while the majority of those who never had such
experiences are among the pragmatists (57%). Those who do not know if their
privacy has been violated, are evenly distributed among the fundamentalists
(49.8%) and pragmatists (48.4%) and the least number of them is unconcerned
(only 1.8%). Meanwhile, just 5.1% of those who have not had a bad
experience belong to the group of unconcerned.
There is no significant correlation between PSI and revealing the actual
name, phone number, email, date of birth, residence or employment status, but
there is a statistically significant correlation between posting an actual photo
and PSI. Namely, 73.9% of privacy unconcerned always post actual photos,
while the others 26.1% claim that it depends on the network.
The crosstab PSI versus location checking shows statistical significance
with the conclusion that more than a half of the fundamentalists (54.8%) have
never checked location, while this share is much smaller (34.8%) among the
unconcerned. Fundamentalists are also the most cautious and the largest shares
of them do not tag friends without prior asking, while only 10% always tag
friends (see Figure 9). Surprisingly, greater shares of pragmatists do tag
friends without prior asking (21.5%) than the share of unconcerned (13%).

Figure 9. Tagging friends within Westin categories.


Behaviour and Attitudes vs. Privacy Concerns … 141

Figure 10. Disagreement with tagging within Westin categories.

Figure 11. Concern about data privacy.

On the other hand, fundamentalists are much more bothered if someone


tags them without asking: 25.1% are bothered always and 45.6% are bothered
occasionally (Figure 10). Once again, a share of the unconcerned that does not
want to be tagged without asking is unexpectedly large (17.4%). The lowest
142 Gordana Savić and Marija Kuzmanović

percent of respondents who complain about tagging without asking (16.1%) is


in the group of pragmatists. Meanwhile, this is still less than the shares of
pragmatists who always tag others (21.5%).
The last section of the questionnaire answered the concerns about the
privacy of personal data and photos of the users and their friends and concerns
whether the parents (superiors or colleagues) could have access to their
information. Most respondents showed moderate concern (Figure 11).
Respondents are mostly concerned about their personal data privacy
(mean score 3:05, standard deviation 1.144). They are less concerned about the
privacy of information on friends (mean score 2.63, standard deviation 1.054),
and just 5.1% of them are extremely concerned. The respondents are least
concerned that parents or a superior could have insight into their activities and
posts on an OSN (mean score 2.07, standard deviation 1.245). As high as
47.7% of the respondents were assigned grade 1 (not at all concerned) and
only 5.8% had grade 5 (extremely concerned) as a response to the statement
“I’m worried that parents (superiors or colleagues) have access to my
information.”
With regard to respondents’ age, the youngest respondents (up to 24) are
mostly concerned about their personal data with an average score of 3.1, as
well as those aged 45-54 (3.2). Women are more concerned about their
personal information than men. When it comes to information about friends,
respondents aged 35-45 are the most concerned, and concern declines with
age. And in this issue, women also show greater concern than men.

Figure 12. (Continued)


Behaviour and Attitudes vs. Privacy Concerns … 143

Figure 12. Concern about data privacy within Westin segments.

With the third question from this group (“I’m worried that parents
(superiors or colleagues) have access to my information”), the situation is
similar. The most concerned are the youngest respondents, or one could say
that they are not worried because the average score of responses is only 2.11,
which indicates that most of them replied that they are slightly concerned.
A statistically significant correlation (p<0.01) is observed between PSI
and the level of concerns regarding personal information privacy (Figure 12).
None of the unconcerned assigned score 5 (extremely concerned), while a very
small number of them are very concerned (8.7%). The average score for this
144 Gordana Savić and Marija Kuzmanović

group of respondents is 2.34. The distribution of responses for fundamentalists


and pragmatists is observed to be almost normal. The average score for
pragmatists is 2.96 and 3.21 for the fundamentalists. There are only 6%
unworried fundamentalists and as many as 16.6% are extremely worried.
There is also a correlation (p<0.01) between the PSI and the level of
concern regarding friends’ privacy. Again, none of the unconcerned assigned
score 5, i.e., no one is extremely concerned, while a very small number of
them are very concerned (8.7%). There are even 30.4% of unconcerned users.
The average score over this segment of respondents was 2.13. On this issue,
the pragmatists have shown less concern for the privacy of friends’ data than
the privacy of their own personal data (average score 2.51) and only 3.6% are
extremely concerned, and even 19.7% are absolutely unconcerned.
Fundamentalists were again most concerned with the average grade 2.82.
Among fundamentalists, only 6% of users sample are not at all worried and as
many as 16.6% are extremely worried.
When it comes to concerns that parents or superiors could have access to
data, images and the information shared on OSNs, there is no statistically
significant difference between members of three segments according to PSI.

CONCLUSION
The study focuses on understanding the relationship between the concerns
of OSN users, their actual behaviour and concerns regarding privacy. The
results reveal that general behaviour of OSN users in Serbia mainly depends
on marital and employment status, e.g., depends on general occupancy and
interests. Namely, singles use OSNs for establishing new relationships and
chatting, while university students most often use OSNs for sharing
information and photos and informing about social events and playing games
and solving quizzes. Furthermore, the majority of the respondents (72.51%)
use Facebook, and for a large share of 69.6% it is their primary choice. The
results are in line with Internet World Stat findings [38], but the percent is still
considerably higher. According to them, 46.4% of the Internet users
worldwide prefer and use Facebook. This means that there are more OSN
users in Serbia than world’s average, especially regarding Facebook. Having
that in mind, the main issue should be their online privacy protection and
concerns.
Meanwhile, as with other research in the literature [28, 27], our survey
discloses a privacy paradox (concerns-behaviour dichotomy). According to the
Behaviour and Attitudes vs. Privacy Concerns … 145

Privacy Segmentation Index [20, 42], Serbian OSN users are almost evenly
distributed into groups of pragmatists and fundamentalists, and only nearly 4%
are unconcerned. The results from Serbia are in line with the recent results
which suggested that approximately 49% of individuals are fundamentalists,
40% are pragmatists and around 10% are unconcerned according to [42].
There are more fundamentalists among older users (above 35 years of age) and
married people than in the other groups, as expected. Apparently, a lower
percent of Serbian OSN users are unconcerned. Therefore, we expected that
users in Serbia would be more cautious regarding data privacy and behaviour
than in other parts of the world. However, the actual online behaviour is in
contradiction with user concerns when sharing personal information. There is
no significant correlation between PSI categories and revealing the actual
name, phone number, e-mail, date of birth, residence or employment status.
Namely, actual behaviour of OSN users revealed that most respondents share
actual personal data, photos, videos and posts, mainly with all friends.
The results of our study indicate that most users in Serbia always share
actual personal data or sharing behaviour depends on a currently used OSN.
More precisely, 98% of the respondents use their actual first and last name and
98% post an actual photo in their profiles. These results are in line with Young
and Quan-Haase [43] and Tufekci [37] findings. Unlike Tufekci [37] who
demonstrates that as many as 75.6% of the respondents reveal their
relationship status, our survey showed only 20% of the respondents always did
so. But, as much as 40% of OSN users have never revealed their relationship
status.
However, the survey revealed that respondents were moderately
concerned for their data privacy (mainly youngsters), slightly less concerned
for their friends data privacy and the least concerned that parents or a superior
could have had insight into their activities and posts on an OSN. Their
experience with the misuse of data was in line with the Westin category to
which they belong. The majority of those who have had such experiences were
among the fundamentalists, but the majority of those who never had such
experiences fall into the group of the pragmatists.
Limitations of our study are reflected in the predictive power of Westin’s
categories and the assumptions underlying his Privacy Segmentation Index.
Namely, our findings have failed to establish a significant correlation between
the Westin categories and actual privacy-related behaviours on OSNs. This is
because Westin index captures broad, generic privacy attitudes. Moreover, the
instrument was created in 1995 for the American market and has not been
146 Gordana Savić and Marija Kuzmanović

significantly updated since then. Thus, it has to be adapted to be more in tune


with today’s Internet-focused and technology-based society.
The robustness of the results could be checked on a larger sample or a
sample of a different structure. The data collected come mainly from
Facebook, which could affect user’s answers. The majority of the sample in
our survey is comprised of young people (up to 24). Although this category is
the most active on OSNs, the analysis should be carried out on a larger sample
of users older than 24 years of age, since it is known that there are a number of
such users, and their number constantly increases.
Future research should be directed towards the improvement of the Westin
Privacy Segmentation Index or the development of a new approach for
segmentation, which is based on the preferences of the users, but which would
take into account their current behaviour and concrete characteristics of OSNs.
A tool that could be useful for this purpose is conjoint analysis. The method
was originally developed to measure consumer preferences, but proved to be
very useful and applicable in many other areas. The characteristics of this
approach are that the preferences are measured at the individual level, and it
allows post hoc segmentation based on the results. Perhaps the results would
have showed a smaller gap between attitudes and behaviour than in this study
using the Westin Privacy Segmentation Index.

REFERENCES
[1] Acquisti, A., and R. Gross. “Imagined communities: Awareness,
information sharing, and privacy on the Facebook.” In Privacy
Enhancing Technologies. Springer, 2006.
[2] Barnes, S.B. “A privacy paradox: Social networking in the United
States.” First Monday 11, no. 9 (2006).
[3] Berendt, B., O. Gunther, and S. Spiekermann. “Privacy in e-commerce:
stated preferences vs. actual behavior.”Communications of the ACM 48,
no. 4 (2005): 101-106.
[4] Boyd, D., and N. Ellison. “Social Network Sites: Definition. History.
and Scholarship.” Journal of Computer Mediated Communication 13,
no. 1 (2007).
[5] Chen, K., and A.I. Rea Jr. “Protecting personal information online: A
survey of user privacy concerns and control techniques.” The Journal of
Computer Information Systems 44, no. 4 (2004): 85.
Behaviour and Attitudes vs. Privacy Concerns … 147

[6] Clark, L.A., and S.J. Roberts. “Employer’s use of social networking
sites: a socially irresponsible practice.” Journal of Bussines Ethics 95
(2010): 507-525.
[7] Debatin, B. “Ethics, privacy, and self-restraint in social networking.” In
Privacy online, 47-60. Springer Berlin Heidelberg, 2011.
[8] Debatin, B., J.P. Lovejoy, A.K. Horn, and B.N. Hughes. “Facebook and
online privacy: Attitudes, behaviors, and unintended consequences.”
Journal of Computer-Mediated Communication 15, no. 1 (2009): 83-
108.
[9] Dolnicar, S., and Y Jordaan. “Protecting consumer privacy in the
company's best interest.” Australasian Marketing Journal (AMJ) 14, no.
1 (2006): 39-61.
[10] Fire, M, R Goldschmidt, and Y Elovici. “Online Social Networks:
Threats and Solutions.” IEEE COMMUNICATION SURVEYS and
TUTORIALS 16, no. 4 (2014): 2019-2036.
[11] Graeff, T.R., and S. Harmon. “Collecting and using personal data:
consumers’ awareness and concerns.” Journal of Consumer Behaviour
19, no. 4 (2002): 302-318.
[12] Gross, R., and A. Acquisti. “Information revelation and privacy in
online social networks.” Proceedings of the 2005 ACM workshop on
Privacy in the electronic society. 2005. 71-80.
[13] Ha, H. “Factors influencing consumer perceptions of brand trust
online.” Journal of Product and Brand Management 13, no. 5 (2004):
329-342.
[14] Haddidi, H., and P. Hui. “To add or not to add: privacy and social
honeypots.” Proceedings of the ICC 2010: IEEE International
Conference on Communications. Capetown, South Africa: IEEE, 2010.
[15] Harris Interactive. A survey of Consumer privacy attitudes and
behaviours. PLI/Harris, 2001.
[16] Hoy, M.G., and G. Milne. “Gender differences in privacy-related
measures for young adult facebook users.” Journal of Interactive
Advertising 10, no. 2 (2010): 28-45.
[17] Hugl, U. “Reviewing person's value of privacy of online social
networking.” Internet Research 21, no. 4 (2011): 384-407.
[18] Keith, M.J., S.C. Thompson, J. Hale, P.B. Lowry, and C. Greer.
“Information disclosure on mobile devices: Re-examining privacy
calculus with actual user behavior.”International Journal of Human-
Computer Studies 71, no. 12 (2013): 1163-1173.
148 Gordana Savić and Marija Kuzmanović

[19] Krasnova, H., T. Hildebrand, and O. Guenther. “Investigating the value


of privacy in online social networks: conjoint analysis.” ICIS 2009
Proceedings. 2009. 173.
[20] Kumaraguru, P, and L. F. Cranor. “Privacy indexes: a survey of
Westin's studies.” 2005.
[21] Lenhart, A. Adults and social network websites. Pew Internet Report,
2012.
[22] Madden, M. Privacy management on social media sites. Pew Internet
Report, 2012.
[23] Madejski, M, M.L. Johnson, and S.M. Bellovin. “The failure of online
social network privacy settings.” Columbia University Computer
Science, 2011.
[24] Malhotra, N.K., S.S. Kim, and J. Agarwal. “Internet Users‘ Information
Privacy Concerns (IUIPC): The Construct, the Scale, and a Causal
Model.” Information Systems Research 15, no. 4 (2004): 336-355.
[25] Miller, C. C. “Tech Companies Concede to Surveillance Program.” The
New York Times. May 2013. http://www.nytimes.com/2013/06/08/
technology (accessed May 2016).
[26] Mishna, F., A. McLuckie, and M. Saini. “Real-world dangers in an
online reality: a qualitative study examining online relationships and
cyber abuse.” Social Work Research 33, no. 2 (2009): 107-118.
[27] Motiwalla, L.F., Li Xiaobai, and X Liu. “Privacy Paradox: Does Stated
Privacy Concerns Translate into the Valuation of Personal
Information?”PACIS, 2014: 281.
[28] Norberg, P.A., D.R. Horne, and D.A. Horne. “The Privacy Paradox:
Personal Information Disclosure Intentions Versus Behaviors.”The
Journal of Consumer Affairs 41, no. 1 (2007): 100-126.
[29] Preibusch, S. “Guide to measuring privacy concern: Review of survey
and observational instruments.” International Journal of Human-
Computer Studies 71, no. 12 (2013): 1133-1143.
[30] Sheehan, K.B., and M.G. Hoy. “Dimensions of privacy concern among
online consumers.” Journal of Public Policy and Marketing 19, no. 1
(2000): 62-74.
[31] Staddon, J., D. Huffaker, L. Brown, and A. Sedley. “Are privacy
concerns a turn-off?: engagement and privacy in social networks.”
Proceedings of the eighth symposium on usable privacy and security.
ACM, 2012. 10.
Behaviour and Attitudes vs. Privacy Concerns … 149

[32] Statistics and facts about Social Networks. Mart 2016.


http://www.statista.com/statistics/272014/global-social-networks-
ranked-by-number-of-users/ (accessed May 15, 2016).
[33] Steinfield, C, N.B Ellison, and C Lampe. “Social capital, self-esteem,
and use of online social network sites.” Journal of Applied
Developmental Psychology 29 (2008): 434–445.
[34] Sutanto, J., E. Palme, C.H. Tan, and C.W. Phang. “Addressing the
Personalization-Privacy Paradox: An Empirical Assessment from a
Field Experiment on Smartphone Users.”Mis Quarterly 37, no. 4
(2013): 1141-1164.
[35] Taddicken, M. “The ‘Privacy Paradox’ in the Social Web: The Impact
of Privacy Concerns, Individual Characteristics, and the Perceived
Social Relevance on Different Forms of Self‐Disclosure.” Journal of
Computer‐Mediated Communication 19, no. 2 (2014): 248-273.
[36] Truste Privacy Blog. 2015. http://www.truste.com/blog/2015/01/28/
data-privacy-concern-consumers/ (accessed May 2016).
[37] Tufekci, Z. “Can you see me now? Audience and disclosure regulation
in online social network sites.” Bulletin of Science, Technology and
Society 28, no. 1 (2008): 20-36.
[38] Usage and Population Statistics. April 2016. http://www.
internetworldstats.com/stats4.htm (accessed May 2016).
[39] Wang, P., and L.A. Petrison. “Direct marketing activities and personal
privacy.” Journal of Direct Marketing 7, no. 1 (1993): 7-19.
[40] Westin, A. “Intrusions: Privacy Tradeoffs in a free society, Public
Perspective.” Public Perspective, 2000: 8-11.
[41] Westin, A. Privacy and Freedom. New York: Atheneum, 1967.
[42] Woodruff, A, V Pihur, S Consolvo, L Schmidt, L Brandimarte, and A
Acquisti. “Would a Privacy Fundamentalist Sell Their DNA for $1000...
If Nothing Bad Happened as a Result? The Westin Categories,
Behavioral Intentions, and Consequence.” SOUPS, 2014: 1-18.
[43] Young, A.L., and A. Quan-Haase. “Information revelation and internet
privacy concerns on social network sites: a case study of Facebook.”
Proceedings of the 4th International Conference on Communities and
Technologies (C&T’09). Pennsylvania: ACM, 2009.
In: Knowledge Discovery in Cyberspace ISBN: 978-1-53610-566-7
Editors: K. Kuk and D. Ranđelović © 2017 Nova Science Publishers, Inc.

Chapter 6

INFORMATION RETRIEVAL AND


DEVELOPMENT OF CONCEPTUAL
SCHEMAS IN E-DOCUMENTS FOR
SERBIAN CRIMINAL CODE

Vojkan Nikolić*, PhD, Predrag Đikanović, PhD


and Slobodan Nedeljković
Ministry of Interior, SATIT, Belgrade, Serbia

ABSTRACT
In the process of developing e-Government, Serbian government has
implemented a lot of e-Government services which produce a large
amount of data and text documents, and whose citizens use more and
more these services in their everyday lives. Text documents are in
Serbian language and commonly in HTML, PDF and Microsoft Word
format. Considering an increased amount of the text documents, Serbian
e-Government has indicated the need for certain data and information
extraction from the variety of existing text documents which are usually
in a format prepared for print.
In order to offer technical solution for a case, the authors have
developed a dedicated application that includes Lucene library. Lucene is
a specialized library for an implementation of the indexing and searching
over a large amount of data. The procedure of quick search within

*
Corresponding author: V. Nikolic, Email: vojkan.nikolic@mup.gov.rs.
152 Vojkan Nikolić, Predrag Đikanović and Slobodan Nedeljković

unstructured text documents in Serbian language leads to an efficient


detection and processing of criminal offenses and contributes to an
increased level of security in the Republic of Serbia. In this paper, the
authors deal with the possibilities of Lucene indexing and Lucene
searching of data and documents within unstructured crime text
documents in Serbian language aiming to find elements of crime in
cyberspace.

Keywords: text mining, natural language processing, unstructured data,


apache lucene

INTRODUCTION
A rapid expansion of Internet as the main medium for sharing information
and internet availability has encouraged more and more people to create and
share data, information and knowledge. Considering the fact that the
Government of the Republic of Serbia (RS) [1] has implemented [2, 3] a large
number of e-Government services during the process of e-Government
development, the use of these services on daily base leads to producing a large
amount of data and documents which are mostly in the form of text in text
documents and for these services are necessary the data and information which
should be extracted from a variety of existing text documents which are
usually in the format prepared for print [4, 5]. Bearing in mind the amount of
documents, no one has enough time to read all these documents and be able to
“extract” important information contained in them. It is obvious that there is a
need to select and separate the documents.
One approach to this problem is so called text mining [6]. The aim of text
depth analysis is finding interesting and nontrivial information, as well as
knowledge in unstructured text documents, then clustering and classifying
them. Natural language common in such documents is not suitable for an
analytical processing that gives unstructured text. For processing these
documents on a computer, the documents should be adapted and prepared for
computer processing. This process involves a lot of series of activities and
procedures.
The concepts and application of Natural Language Processing (NLP)
represent a set of techniques and methods for an automatic text generation in a
natural language. This concept is applicable and it supports many languages.
Information Retrieval and Development of Conceptual Schemas … 153

Trends in the development of e-Government indicate the necessity of NLP


application within e-Government services of the Republic of Serbia [7, 8].
For providing technical solution of an indexing process and a subsequent
process of searching through information of an unstructured data in the RS e-
Government in Serbian language was used Lucene library spaces [9]. Lucene
allows index any text documents realized in a free form and is used as virtually
any data source, as long as we can extract textual data. Lucene is used for
indexing and searching data stored in HTML documents, Microsoft Word
documents, PDF files, and so on. The greatest amount of unstructured
documents in e-Government RS is precisely in these formats.
This paper considers ways in which text analysis and classification
techniques might be used to improve efficiency and effectiveness of e-
government services, especially the ones provided by law enforcement
agencies by using techniques of automatic text reports analysis, and make
information available to decision makers. With an increasing number of
anonymous reports submitted by citizens electronically in relation to various
areas of crime leads to aggravation of the process of analyzing applications
and performing analytical conclusions. This problem appears to be more
complex because such an obtained information is not filtered or guided in a
detective-led interview and results in obtaining an irrelevant information. The
authors of the paper are in the process of developing a Decision Support
System (DSS) that is based on combination of natural language processing
techniques, similarity measures, and machine learning, i.e., “Naive
Bayes”classifier for analysis in the field of crime determines the similarities
and differences between the report’s crime. The paper presents an algorithm
essential to the DSS and its evaluation.
In this paper, the authors elaborate the possibilities of Lucene full-text
indexing of unstructured text documents and searching for the data related to
crime in general in Serbian language.

1. STATE OF THE ART IN DOMAIN RS CRIMINAL LAW


The set of all text documents which represent the laws of the Republic of
Serbia is a huge repository of the data. Representatives of the Serbian
Government produce and use on a daily basis a large amount of text
documents in a connection with the law. A special set of laws are laws relating
to criminal law and one of them is the Criminal Code. An Applicable Criminal
Code has been published and amended in the following official gazettes (“RS
154 Vojkan Nikolić, Predrag Đikanović and Slobodan Nedeljković

Official Gazette, no. 85/2005, 88/2005, 107/2005, 72/2009, 111/2009,


121/2012, 104/2013 and 108/2014) [10].
Serbian Criminal Code in its 36 chapters regulates the matter of guilt and
sanction provisions in relation to criminal offenses. One segment of the
Criminal Code regulates the matter of bodily injury. Those criminal offenses
are processed in the Criminal Code in many aspects. For the purpose of this
research, we focused on the offenses defined in the Criminal Code as bodily
injury and consequently selected three articles of the Code:

1. Serious Bodily Injury (Article 121)


2. Light Bodily Injury (Article 122)
3. Coercion (Article 135).

Criminal Offenses against People

This group of crimes includes criminal offenses where someone has been
killed, bodily injured or endangered to death, has suffered bodily injury, or a
case when someone’s health has been seriously impaired. Violence, abuse and
neglect are categorized by levels by the Procedures for Handling Cases of
Violence, abuse and neglect as follows:

 First level: head bumps, pushing, pinching, scratching, hitting,


pulling, biting, tripping, kicking, fouling, and destroying things.
 Second level: slapping, bumping, kicking, tearing someone’s clothes
with force, plunder and destruction of property, slipping the chair,
pulling ears and hair.
 Third level: beating, strangling, throwing, causing burns, food, sleep
deprivation, exposure to cold temperatures, weapon attack.

A Physical abuse is a form of a physical force against someone in a way


that a person is physically endangered and the consequences are visible
immediately in a form of bruises or scars. Since a physical assault means every
action against a body resulting in violation or violation of bodily integrity,
bodily injuries caused by a physical attack or kicking are also defined as a
crime.
Criminal Code covers serious and slight injuries. The limit is not always
easy to draw. This largely depends on an opinion given by medical experts,
Information Retrieval and Development of Conceptual Schemas … 155

but serious bodily injury is the one that endangers someone's life or damages
someone’s health.
Forensic medicine classifies bodily injury in two levels:

1. Light bodily injury (easy and temporary damage to health);


2. Serious bodily injury is divided in three sub-levels:
(a) a serious bodily injury (with no other attributes, so-called an
ordinary bodily injury);
(b) an aggravated bodily injury defined as especially dangerous,
serious bodily injury resulting in weakening of a vital function of
someone’s body or an organ, or permanent serious health
impairment or disfigurement;
(c) a serious bodily injury resulting in death.

There is a group of criminal offenses where a perpetrator does not directly


endanger a victim, such as (a) leaving someone in condition or circumstances
dangerous to life or health (exposure to danger), (b) a perpetrator causes
certain circumstances and doesn’t provide help to someone who is in mortal
danger but is able to provide help without risk to others, (c) leaving someone
because a perpetrator is in circumstances dangerous to life and health. In
addition, beatings, using dangerous tools in fights or quarrels and illegal
abortions are also considered as crimes against a person.

EuroVoc Thesaurus

EuroVoc is a multilingual thesaurus specifically developed for processing


the documentary information of the EU institutions. EuroVoc is a controlled
vocabulary which contains exclusively selected and recommended
terms/names with the aim of an organizing knowledge. Its content is a
hierarchical list of terms (descriptors) which may represent a social and
political life of the EU classified under the relevant terms. Each term
automatically includes all equivalents of recommended descriptors in all
languages. Use of EuroVoc is a great advantage in the text mining because it
allows standardization of terminology for subject indexing that enables greater
precision during searching documents. EvroVoc is available on the website
[11].
The first and second version of EuroVoc provided parallel terms in nine
languages and it was a successful translation of thematic concepts from one
156 Vojkan Nikolić, Predrag Đikanović and Slobodan Nedeljković

language to another, while the third-a multilingual version included other


languages such as Slovenian, Croatian, and Serbian.

1.1. Semantic Resources in Serbian Language

There are four types of semantic indexing resources (also called controlled
indexing languages): - Controlled Vocabulary, Taxonomy, Thesaurus and
Ontology [12]. The thesaurus as one kind of semantic resources [13] is a
network of controlled vocabularies. It is a higher level compared to
taxonomies. It is a data representation including associative relationships in
addition to hierarchical relationships. The structure of EuroVoc depends on
semantic relationships (at the specific level of descriptors and non-
descriptors): scope note, micro-thesaurus relationship, equivalence
relationship, hierarchical relationship, associative relationship [14]. The
thesaurus has equivalence (USE/UF), broader term (BT), narrower term (NT)
and related term (RT) relationships. So formed relationships enable the
structure and scope for the thesaurus.
For instance, having in mind that a broader term for “physical assault” is
“criminal law” and the narrower terms are “criminal offense against a person”
and “criminal offense” determine the scope of a set of data relating to these
terms. In order to realize granular and more consistent indexing, using
semantically relationships, can be used an expanded set of links. It provides a
very efficient process of searching from the perspective of an user.
Developed by approach [15] to semantic annotation of texts is to move
beyond bag-of-words representation, using atoms of lexical knowledge to
represent the elementary word meanings (senses), and converting the text into
a graph linking senses rather than words. WordNet synsets are well suited for
that purpose, grouping words into sets of synonyms related to word
definitions, providing sense identifiers and recording semantic relations
between synsets. The sense clustering methods are referenced in the WordNet
and EuroWordNet and ensure relations between the sets of synonyms
(synsets). Synsets represent the senses of the words, which are grouped into
clusters. Text annotated at a higher abstraction level can be clustered in a
better way because similarities between texts are more cleared.
Information Retrieval and Development of Conceptual Schemas … 157

Figure 1. EuroVoc -The term of a criminal offense in the dictionary of criminal law.

The clustering provides methods by using different types of information


[16]: “information concerning the similarity of the words found in the synsets
describing different senses; information on the similarity of the relations
between themselves and other synsets of the network; probabilistic
information extracted from corpora; syntactic criteria concerning alternations
with similar sub categorization frames; semantic criteria concerning the
semantic class of arguments, the subject domain and the underlying predicate
argument structures” (Resnik, 1995; Jiang and Conrath,1997; Mihalcea and
Moldovan, 2001; Palmer et al., 2006).
Serbian WordNet (SrpWN) represents a lexical semantic network,
containing synsets with glosses and various semantic relations, such as
antonyms, metonymy, causation, category domain, etc. [17]. Through
interlingual relations, it is connected to English WordNet (versions 2.0 and
3.0) and wordnets of many other languages.
In order to avoid the repetition of words in the texts that have been
translated (translators commonly use synonyms and near-synonyms)
158 Vojkan Nikolić, Predrag Đikanović and Slobodan Nedeljković

translators use translation equivalents (TEs) of a polysemous word. These


words are semantically similar in the target language (TL).

Figure 2. Serbian WordNet [18].

2. KNOWLEDGE EXTRACTION FROM UNSTRUCTURED


DOCUMENTS VIA TEXT MINING TECHNIQUES
Today, text mining technologies and data provide a new generation of
tools for the analysis and visualization of both structured data and text [19].
Text mining techniques played a vital role in the last few years in extraction of
knowledge from unstructured documents, especially in crime detection and
prevention [20]. The main objective of this study is to propose a text mining
approach to expand upon the special analytical capabilities of the investigative
effort carried out by law enforcement personnel [21]. One text document can
be unstructured, structured (business reports) and semi-structured. Some of
Information Retrieval and Development of Conceptual Schemas … 159

basic text mining tasks are Information Retrieval of documents in response to


a “query document”, where special case is when the query text document can
consist a few keywords (Figure 3).

Figure 3. Lucene document matcher.

To address this problem the Information Retrieval system indexes the


documents and queries. Indexing is the process that extracts descriptor terms
from a document or a query.
The term frequency is an important factor in the indexing phase. Generally
such frequent terms in a corpus are not more representatives of the contents of
documents, they are not holders of information. Weighting techniques are used
to assign a high weight to rare words. The best weighting technique known
according to [22] are relying on TF-IDF measure.
TF-IDF doesn’t measure how often a keyword appears, but offers a
measurement of importance by comparing how often a keyword appears
compared to expectations gathered from a larger set of documents.

Figure 4. TF-IDF measure.


160 Vojkan Nikolić, Predrag Đikanović and Slobodan Nedeljković

2.1. Query and Document Models

Translating the query and document from raw strings into something we
can do computation what is the first hurdle in computing a similarity score. To
do so, we use “query models” and “document models.” The “models” here are
just a fancy way of saying that the document vectors are represented in some
other way that makes computation possible.
The similarity between two documents is a function of the angle between
their vectors in the term vector space. The similarity between query (q) and
documents (d) is expressed by the cosine of the angle between two vectors (q
and d) in the next formula:

(1)

The above image illustrates this process for the query “fizičkinapad” (en.
“physical assault”) and the document Crime court Republic of Serbia (Figure
5), according with query and document models in [23].
The final step in computing the similarity score runs the query and
document representations using a scoring function.

(2)

Figure 5. The process for the query “fizičkinapad” and the Crime court
Republic of Serbia.
Information Retrieval and Development of Conceptual Schemas … 161

The vector space model is unable to discriminate between different


meanings of the same word.

(3)

No associations between words are made in the vector space


representation.

(4)

There is not a connection between topics and words. If a search engine


determines that a particular query is time sensitive, it will return news results,
or if it thinks the query intent is transactional, it will display results by topics.
Topic modelling provides methods for automatically organizing,
understanding, searching, and summarizing large electronic archives. The
types of query and document models are commonly used in search: VSM
(Vector Space Model), LSA (Latent Semantic Analysis), pLSA (probabilistic
Latent Semantic Analysis) and LDA (Latent Dirichlet Allocation).
The more abstract representations of the query are entity extraction or
latent topic representations (LDA) [23]. Indeed, Google knows that
“fizičkinapad” is news in the Serbian web-magazine Telegraf.
To extract the semantic relationships between words, we don't have to
address specific data which is “the term”, but the overall meaning contained in
these terms, we have to treat as “concepts”. The concepts are in some ways
terms containers; each concept contains a set of terms that has semantically the
same meaning [24]. Semantic resources can retrieve concepts from terms. The
most used semantic resources are controlled vocabularies, taxonomies,
thesauri and ontology’s.
The use of terminological resources improves the research process. It
allows us to find documents that exactly meet the need expressed by a query,
but also the documents that partially meet this need. However, for a query
containing the two terms T1 and T2, a document D1 containing once these two
terms and a document D2 containing 30 times the two terms, in some cases,
the semantic indexing cannot decide which of the 2 documents respond better
to the query. Usually the document D2 is more relevant; it contains lots of
information about the two terms. It would therefore be wise to return to the
user of the document D2 before D1.
162 Vojkan Nikolić, Predrag Đikanović and Slobodan Nedeljković

Figure 6. Vector TF-IDF – terms and paragraphs.

As document structure often builds a hierarchy into sections, paragraphs


and sentences, we can decompose each document into the vector stat space
representation of non-overlapping and sequential extracts that correspond to
them and store such decompositions into a term-by-extract matrix (tem) (also
called “term-location” matrix in [25].) The choice of the hierarchy’s level that
the decomposition will rely on, can result in a document representation using
term-by-section (tsm), term-by-paragraph (tpm) or term-by-sentence (tsm)
matrices.
The extraction is done by projecting terms on a semantic resource (such as
a thesaurus). The result is a set of related concepts represented by a graph. It is
a conceptual graph that describes the content of documents.
Furthermore, the fine granularity of the semantic descriptions found in
existing resources, combined to the great divergences in their structure and
content; put a hindrance to their compatibility. Usually, classic indexing
according to [26] allows to link descriptors of a query to the indexes of a
document, exclusively if the intersection of the two sets is not empty. Because
of this reason, relations of synonymy, equivalence, ... etc. are ignored. When
searching the term physical assault, the user would be potentially interested in
documents containing the term bodily injury, because the “bodily injury” word
is only a derivation form of the general meaning “physical assault”. The use of
a terminological resource (semantics) as thesauri or ontology’s can treat
descriptors as concepts rather than simple terms. They allow passing from
simple indexing to semantic indexing.
Information Retrieval and Development of Conceptual Schemas … 163

Figure 7. Graph of concepts representing the documents.

According to the semantic indexing process [12] and the concepts


weighting there are three main steps for a semantic indexing and weighting:

 Terms extraction.
 Concepts extraction.
 Concepts weighting.

3. A TEXT SEARCH ENGINE LIBRARY IN INFORMATION


RETRIEVAL- APACHE LUCENE
Information retrieval (IR) refers to the process of searching documents,
information within documents or metadata about documents. Lucene lets you
add searching capabilities to your applications. It is a mature, free, open-
source project implemented in Java; it’s a project in the Apache Software
Foundation, licensed under the liberal Apache Software License. As such,
Lucene is currently, and has been for quite a few years, the most popular free
IR library [27].
164 Vojkan Nikolić, Predrag Đikanović and Slobodan Nedeljković

Figure 8. Basics of search engine.

Applications such as Amazon are among a commercial application that


uses Lucene for indexing and allows effective searching [28]. Lucene is able
to index text from a various formats such as PDF, HTML and Microsoft Word,
and also in various languages [29]. One of the Lucene concepts is presented on
Figure 8.
Information Retrieval and Development of Conceptual Schemas … 165

The process of indexing consists of several procedures and operations that


make the Lucene indexing method [30]. All these operations areseparate and
discrete into three operational groups:

 extracting text from the document,


 analysis,
 adding to the index.

Essentially each of these groups are quite different and relatively complex
operations. The first step in indexing is to extract the text from the original
document content. Then, the extracted text is used to create the document. The
resulting document is made up of fields. Such developed text fields are
analyzed and formed a set of tokens. The last step in the indexing of text
documents is to combine token with the corresponding indices.
In order to index the document using Lucene, we have to convert in the
plain text format for Lucene processing, and then create a Lucene document.
To create an index, for the document in PDF format, is first necessary to use
the method to extract the information in the form of text from PDF manuals
and then extracted text is used to create documents. Similar approach is used
for indexing of Word, or any other document that is not in full plain text
format. Also, for HML or HTML documents using plain text characters, you
need to properly prepare your data for indexing. When you get the text that
you want to index and create a document with the fields, the text needs to
undergo a process of analysis.
Analysis of converting text data into the base unit time called token. This
is the process of converting raw text in tokens. Lucene, this is achieved by
using Analyzer, Tokenizer and Token Filter classes. Tokenizer is responsible
for the input component pieces, the tokens. Token filters can further modify
the tokens produced by Tokenizer.
Once you create Lucene document fields, you can invoke the Index
Writer. After that, Lucene first analyzes the text, and then text data is divided
into tokens, and then perform a large number of operations. Using Lucene
filter, you perform a search for a specific word or set of words that can be
written using small and capital letters.
During the analysis, text data passes through several operations: the
removal of common words, ignoring punctuation, stemming of words to
reduce them to the root-form, the changing of words to lowercase, etc.
Analysis takes place immediately prior to indexing and query. Analysis
166 Vojkan Nikolić, Predrag Đikanović and Slobodan Nedeljković

converts text data into symbols, and these symbols are added to the terms of
the Lucene index.
Lucene library contains a variety of built-in analyzer. Some of them are:
SimpleAnalyzer, StandardAnalyzer, StopAnalyzer and SnowballAnalyzer.
They are different in a way they treat text and mode of application and the
type of used filter. Such analysis can have advantages. The removal of pre-
indexing, decrease the size of the index, can have a negative impact on
processing precise queries. Applying Lucene can have more control over the
process of analysis using custom analyzer.
After analyzing the input text and creation of its representation, Lucene
index is corrected. Lucene uses the data structure known as an inverted index.
The inverted index uses both disk space and enabling faster look up key time.
Its structure is inverse, because the tokens that are used are extracted from an
input document form in the form of look up keys. This mechanism ensures that
the document is not treated as a central entity. This means that directly it seeks
concrete word in index instead of scanning the entire document.
Lucene’s default scoring system works very well for most cases. It uses
seven different variables to determine the final ranking of each document.
Along TF and IDF variables, there are (from lucenetutorial.com):

 coord = number of terms in the query found in the document,


 lengthNorm = measure of the importance of a term according to the
total number of terms in the field,
 queryNorm = normalization factor so that queries can be compared,
 boost (index) = boost of the field at index-time,
 boost (query) = boost of the field at query-time.

The similarity measure uses mathematical techniques to estimate the


degree of semantic similarity between two documents or terms/concepts. It is
based on the meaning of terms/concepts to find the similarity value. For the
default similarity measure (DefaultSimilarity) Lucene use Cosine similarity.
This default similarity function is based on TF-IDF similarity. The following
formula (straight from Lucene’s Similarity function) illustrates the basic
factors used to score a document (Equation 5).

(5)
Information Retrieval and Development of Conceptual Schemas … 167

Before starting to analyze the results, we want to clarify the evaluation of


query results for those readers who have little or no knowledge about criminal
code. Suppose that we have a query (as one version of root word “kazna”) that
should retrieve three articles of the law as documents (i.e., the set of relevant
documents is size 3). One can select the single search result by a GUI front-
end to the Lucene Luke tool, on the Explain control panel. This pops up
another window which explains the query score. In the query details for 3
documents, we see that the input has been parsed into text: “kazna”. Here is
the text of the explanation:

0.2813weight(contents:kejriwal in 8) [DefaultSimilarity], result of:


0.2813fieldWeight in 8, product of:
1.0000tf(freq=1.0), with freq of:
1.0000termFreq=1.0
1.1252idf(docFreq=14, maxDocs=17)
0.2500fieldNorm(doc=8)

where the ranking score is calculated as 1.0000* 1.1252 * 0.2500 = 0.2813.

4. CASE STUDY: INFORMATION RETRIEVAL IN CRIMINAL


CODE OF THE REPUBLIC OF SERBIA
Serbian Criminal Code in its 36 chapters regulates the matter of guilt and
sanction provisions in relation to criminal offenses. One segment of the
Criminal Code regulates the matter of bodily injury. Those criminal offenses
are processed in the Criminal Code in many aspects. For the purpose of the
research, we focused on the offenses defined in the Criminal Code as bodily
injury and consequently selected three articles of the Code which are as
follows: a) Serious Bodily Injury (Article 121), b) Light Bodily Injury (Article
122) c) Coercion (Article 135) of the Criminal Code of the Republic Serbia.
For the interpretation of this Code it is essential to define the meaning of
physical injury by major users.
As its primary aspect, it uses a great deal of intelligence to determine
which documents are the most important for you based on your query. For the
purpose of our experiment, we used the RS Criminal Code written in Latin
script and in Serbian language, which in contrast to English contains the
following signs: č, ć, ž, đ and š.
168 Vojkan Nikolić, Predrag Đikanović and Slobodan Nedeljković

To execute the process of indexing, these signs need to be identified and


supported by Lucene for Serbian. This feature has Lucene 5.2.0 version that
we have used. This version of Lucene also has an editor where it is possible to
monitor the indexing and later retrieval. This is LUKE 5.2.0. Luke is a
development and diagnostic handy tool which provides access to already
existing Lucene indexes and allows you to display and modify their content,
reconstruct the original document fields and edit them and re-insert to the
index.
For the purpose of normalization in the context of indexing is used:

public class SerbianNormalizationFilterFactory


extendsTokenFilterFactory

In this experiment we used StandardAnalyzer Lucene 5.2.0. If you use


StandardAnalyzer (), then you use the STOP_WORDS_SET, which is a set of
basic English stop-words.
Since the text is in Serbian language, we could directly use
StandardAnalyzer (). For our purposes, it was necessary to include stop-words
in Serbian language. This was done by inserting a new argument set to
StandardAnalyzer () and defining over CharArraySet as follows:

CharArraySet set = IndexFiles.fromFileToCharArrraySet(stopwPath);


Analyzer analyzer = new StandardAnalyzer(set);

Using CharArraySet it is possible to reintroduce stop-words into the text


file stop.txt and define the code of the location where this file is through
stopwPath. Actually we have only defined paths .txt files where they are. The
code is defined as follows:

String docsPath = “D:/Na1”;


String stopwPath = “D:/NaS/stop.txt”;
String indexPath = “D:/Na”;

docPath is the way where we accommodate the input document to Lucene


indexes, stopwPath input document Serbian stop-words and the places
indexPath Lucene index files.
Information Retrieval and Development of Conceptual Schemas … 169

Figure 9. Stop-words list.

We have added about 700 most common words as stop-words: koju,


koje,... Part of this list is shown in Figure 9.
Experimenting with the scoring system in Lucene may determine
relevance of some document to the query. How often a query term appears in a
document, we can see in Luke. Using Luke, we can open up our index and see
its content. It should give us statistics on the elements, number of documents
indexed, etc. Luke can also be used to run raw Lucene searches against our
index [31]. Top ranking terms on the right we see the terms stored most
frequently in field of Luke named _content.

4.1. Conceptual Graph Construction

Using a morphosyntactic analyzer (such as Bag of Words on lexically


supported query expansion system VeBRanka) descriptors (terms) of
documents in Serbian can be extracted. It is an extraction of terms which better
represent the informational content of documents. Generally, so frequent word
in a document denotes an important concept. The arrows in the conceptual
graph going from higher weight concepts to lower weight concepts. Then
corresponding concepts are found either manually or via projection on a
semantic resource (such as a EuroVoc thesaurus). Then, we obtain a
conceptual graph as follows:
170 Vojkan Nikolić, Predrag Đikanović and Slobodan Nedeljković

During a research phase (user querying) we want to find the most relevant
documents by applying weighting method by TF-IDF. The operative term will
be “povreda” or some root or variant thereof. We will search for this term in a
half dozen simple documents, as illustrated in the table below.

Figure 10. Multi-graph of concepts.

Figure 11. Multi-graph of concepts.


Information Retrieval and Development of Conceptual Schemas … 171

Each of the three columns illustrates a polysemy, a word or phrase with


multiple, related meanings. Column 1 associates “nastupiti” and “smrt” with
“napad”, “sposobnost” and “unakaženost”, while column 3 associates
“nastupiti” and “smrt” with “ubistvo”, “sila” and “otmica”. Consequently, the
terms can be transformed into concepts. For each concept we calculate its
weight according to all documents and the resulted weights are stored in the
index of the concept. Then we project the documents on this graph. The idea is
to schematize the relations between concepts and documents. Then we get a
graph as follows:

Figure 12. Multi-graph of concepts enriched with documents.

CONCLUSION
The aim of this study is to present the possibilities of access to Lucene
indexing and Lucene searching data and unstructured text documents related to
crime in Serbian language in order to find elements of crime in cyberspace.
This research is based on Vector Space Model where Tf-Idf measure is applied
on query and Lucene index, in order to advance searching data process for
creating a conceptual model. The emphasis is on three articles of the Criminal
Code of the Republic of Serbia law relating to physical violation. Considering
the criminal activity is also presented in cyberspace, fast search techniques are
important for detection and processing criminal offenses in order to increase
the level of security in the Republic of Serbia.
172 Vojkan Nikolić, Predrag Đikanović and Slobodan Nedeljković

In addition, the paper presents the possibilities of technology and Lucene


indexing and searching of unstructured data and documents within e-
Government of the Republic of Serbia in Serbian language. First, we presented
data mining and Lucene library spaces architecture, as well as the core Lucene,
and then the possibility of its application.
We based our research on 540 questions asked to users concerning these
articles of the law. In the paper are presented search results and ranking of
those three documents (articles of the Criminal Code of the Republic of
Serbia) constituting three articles of the RS Criminal Code related to a
personal injury.

REFERENCES
[1] The strategy and action plan for the development of electronic
administration until 2013 (“RS Official Gazette”, Nos. 55/05, 71/05-
correction, 101/07 and 65/08).
[2] Nikolić, V;Đikanović, P;Batoćanin, D. e-Government Republic of
Serbia: The registration of motor vehicles and trailers, YU INFO, 2013.
[3] Nikolić, V;Protić, J;Đikanović, P. G2G integraatioin MOI ofRepublic of
Serbia with e-Government PORTAL, ETRAN, 2013.
[4] Randjelović, D; Popović, B;Nikolić, V;Nedeljković, S. Intelligent search
terms in the case of police services in eGovernment, New information
technology for analitycal decision-making in the biological, economic
and social systems, State university in Novi Pazar, 2014.
[5] Dragović, R;Ivković, J;Dragović, D;Klipa, Đ;Radišić, D;Nikolić, V.
Decision support system to support the strategic management of the state
administration, YU INFO, 2015.
[6] Ning, Zhong;Yuefeng, Li; Sheng-Tang, Wu.“Effective Pattern
Discovery for Text Mining”, IEEE Transactions on Knowledge and
Data Engineering, vol.24, no. 1, pp. 30-44, January 2012,
doi:10.1109/TKDE.2010.211.
[7] Peter,Teufl; Udo, Payer; Guenter,Lackner. From NLP (Natural
Language Processing) to MLP (Machine Language Processing),
Computer Network Security, (2010).
[8] Stevic, Z; Rajcic-Vujasinovic, M; Radovanovic, I; Nikolic,
V.Modeling and Sensing ofElectrochemical Processesupon
DiracPotentiostaticExcitation of Capacitive Charging/Discharging,Int. J.
Electrochem. Sci.,10(2015)6020-6029.
Information Retrieval and Development of Conceptual Schemas … 173

[9] Lucene (http://lucene.apache.org/).


[10] Nikolić, V;Markoski, B;Ivković, M; Kuk, K;Djikanović, P. Information
retrieval for unstructured text documents in Serbian into the crime
domain, str. 6., CINTI, 2015.
[11] The EU’s multilingual thesaurus http://eurovoc.europa.eu/.
[12] Nabil, R; Amine, L; Issam, A. Ben Lahmar El Habib, Labriji El
Houssine, A New Approach that improves TF-IDF Weighting Measure,
International Journal of Information and Communication Technology
Research, Volume 5 No. 10, October 2015.
[13] Catherine,Roussey; et al. Uneméthoded'indexationsémantiqueadaptée
aux corpus multilingues, L’Institut National des Sciences Appliquées de
Lyon, N°d’ordre 01 ISAL 0059, (2001).
[14] http://www.iskoi.org/doc/development.htm.
[15] http://www.fizyka.umk.pl/publications/kmk/12-Wordnet-glosses.pdf.
[16] Apidianaki, M. Discovering semantic relations by means of
unsupervised sense clustering, Proceedings of the LREC Workshop
“Semantic relations. Theory and Applications”, 18 May, Valletta, Malta,
pp. 3-11., (2010).
[17] http://sm.jerteh.rs/Default.aspx.
[18] http://serbian-dictionary.com/wordnet.
[19] http://www.megaputer.com/down/dm/white_papers/crime_pattern_case.
pdf.
[20] Nael, T;Elyezjy, Alaa M. Elhaless, Investigating Crimes using Text
Mining and Network Analysis, International Journal of Computer
Applications, (0975 – 8887) Volume 126 – No.8, September 2015.
[21] Helbicha, M;Hagenauera, J;Leitnerb, M; Edwards, R. Exploration of
unstructured narrative crime reports: an unsupervised neural network
and point pattern analysis approach, Cartography and Geographic
Information Science, 2013.
[22] Pascal, Soucey; Guy, W. Mineau. (2005), «Beyond TFIDF Weighting
for text Categorization in the vector space model», IJCAI’05
Proceedings of the 19th international joint conference on Artificial
intelligence.
[23] https://moz.com/blog/determining-relevance-how-similarity-is-scored.
[24] Florian, SEYDOUX. (2006), exploitation de connaissances-
sémantiquesexternesdans les représentationsvectoriellesenrecherche-
documentaire, ÉCOLE POLYTECHNIQUE FÉDÉRALE DE
LAUSANNE.
174 Vojkan Nikolić, Predrag Đikanović and Slobodan Nedeljković

[25] Roelleke, T; Tsikrika, T;Kazai, G.A general matrix framework for


modelling information retrieval. J. Information Processing and
Management, 42, 4–30, 2006.
[26] CHURCH; et al. (1990). CHURCH K.W and HANKS P., «Word
association norms, mutual information and lexicography».
Computational Linguistic, vol 1, Mars 1990, pp. 22-29.
[27] Hatcher, E;Gospodnetić, O; McCandless, M. Lucene in action, Manning
Publications, 2009.
[28] http://wiki.apache.org/lucene-java/PoweredBy.
[29] Paul, T. (2004). The Lucene Search Engine. http:// www.javaranch.com/
journal/2004/04/Lucene.html.
[30] Nikolić, V;Nedeljković, S;Djikanović, P. Information retrieval for
unstructured text documents: Lucene indexing, EUROBREND, 2015.
[31] Nikolić, V;Ivković, M;Nedeljković, S;Djikanović, P. Information
retrieval for unstructured text documents: Lucene searching, AIIT, 2015.
In: Knowledge Discovery in Cyberspace ISBN: 978-1-53610-566-7
Editors: K. Kuk and D. Ranđelović © 2017 Nova Science Publishers, Inc.

Chapter 7

DEVELOPMENT OF THE ANDROID-BASED


SECURE COMMUNICATION DEVICE

Aleksandar Jevremović1,*, PhD, Mladen Veinović1, PhD


Goran Šimić2, PhD, Nikola Savanović1, MD
and Dragan Ranđelović3, PhD
1
Singidunum University, Belgrade, Serbia
2
Military Academy, Belgrade, Serbia
3
The Academy of Criminalistic and Police Studies, Belgrade, Serbia

ABSTRACT
The possibility of achieving protected communication has long been
a privilege just for professional services and systems that can afford great
investment for the development of specialized devices for this purpose.
Today, the popularization of open-source development model enabled
significantly reduction of development costs and maintaining high levels
of security. This development implies the inclusion of existing
components that enables verifying implemented principles if it is
necessary. This paper discusses key issues related to the development of
mobile devices for secure communication based on the Android platform.

Keywords: custom cipher algorithm, network system security

*
Corresponding author: A. Jevremovic, Email: ajevremovic@singidunum.ac.rs.
176 Aleksandar Jevremović, Mladen Veinović, Goran Šimić et al.

INTRODUCTION
The possibility of achieving protected communication has long been a
privilege only for professional services and systems that are able to make big
investments for the development of specialized devices. At this level as safety
communication can be considered only one that is based on the coding
algorithm developed by the end user. In addition, the principles of operation of
such cipher algorithms may not be known to anyone other than the system end
user. From this approach the popular cipher algorithms (DES, AES, etc.)
cannot be considered as safety ones [1].
Therefore, already built and in-the-box communication protecting systems
running out of the question. Cryptographic solutions require reliable (safety)
platform for implementation (hardware and system software) which can be
verified (open source systems). Method of implementation and confidence in
the cryptology synchronization procedures and resynchronization are a key
factor for confidence in their custom cryptology solution. Such an
implementation prevents the existence of secret doors through which the
cipher keys can “leak”. Moreover, the procedures about cryptology keys
manipulation are essential for such considerations (storage, selection,
distribution, deleting etc.).
Today, the popularization of open-source development model enabled
significant reduction of development costs and maintaining high levels of
security. This development implies usage of already built components that
enables verifying implemented principles if it is necessary.
This paper discusses key issues related to the development of mobile
devices for secure communication based on the Android platform. This
research is based on the collected experience [2-9] in the development of top
level communication protection systems based on Linux platform.

NETWORK LEVEL IMPLEMENTATION


Mobile computer networks include the 4G mobile telephony using
standardized protocol stacks, OSI and TCP/IP. This standardization allows
chaining protocols at different levels in order to achieve optimal performance
of various network applications and variety of infrastructures. From the aspect
of telecommunications such stratification allows the implementation of the
Development of the Android-Based Secure Communication Device 177

cipher system at various levels (from physical to the application), depending


on the capabilities and needs of a particular telecommunications system.
Encryption system implementation based on physical layer requires the
use of special telecommunication infrastructure and such a system would not
be possible to use on the Internet network. In addition, these systems are
implemented at the hardware level, which can significantly increase the cost of
development. Such systems are not flexible, requiring adaptation in
accordance with protected device physical interfaces. Also, they are
specialized in non - standardized way and therefore, they need significant
funds to invest in the development and implementation. On the other hand, the
implementation of encryption functions at the hardware level leaves minimal
space for “back door”.
Implementation of the encryption system on the application layer is
mainly easy objective in cases where developers develop complete application
protocol, or when they have modifying the source code of existing one.
Otherwise, when it is necessary to protect the enclosed application protocols,
this may represent an impossible task. Moreover, it is necessary for users to
manage some cryptography function (e.g., selection of the encryption system,
algorithm, the encryption key, etc.), which significantly burden their work and
increases the possibility of errors. An implementation of the protection system
at the application level is not reusable for more than one application protocol
and it represents another disadvantage of such implementation.

Figure 1. Different communication levels for implementation of encryption systems.

However, it should be noted that for the implementation of a protection


system at these levels, in the scenario to achieve maximum performance and
178 Aleksandar Jevremović, Mladen Veinović, Goran Šimić et al.

safety, it is necessary to have the ability to modify the kernel of operating


system.
Implementation of the encryption system on the network or transport layer
represents the optimal choice for the aspect of compatibility with a variety of
telecommunication infrastructures and the Internet, as well as from the aspect
of all application protocols that the device uses. In this case preference is given
to the network layer since the encryption function implementation at this level
protects [10] the different protocols of the transport layer (TCP, UDP, etc.).
Further, IPsec protection subsystem at the network layer can be easily used in
combination with its custom cipher algorithm. However, it should be noted
that for the realization of a system of protection at these levels, it is necessary
to have the ability to modify the kernel of operating system in order to achieve
maximum performance and safety.

SYSTEM LEVEL IMPLEMENTATION


One of the key issues for the implementation of custom encryption system
is choice of computer system level on which it will be implemented. In
general, there are three possible levels of implementation: so called user space
(in the form of an application or service), implementation at the level of the
kernel of operating system and at the hardware level.

Figure 2. Levels of implementation of custom encryption system.

Basic advantage of implementation of custom cipher algorithm at the user


space is the ability to use a wide range of programming languages, libraries
and architecture. Further, such a system is easy to install because there is no
need for modification of the operating system or hardware. On the other hand,
the implementations of custom cipher algorithm at the user space are limited.
Routing communications data of other applications and services on encryption
system is generally far from simple and often impossible. In addition, the
Development of the Android-Based Secure Communication Device 179

realization of this level will have a negative impact on the performance of the
encryption system.
Implementation of custom cipher algorithm at the hardware level may
represent a good solution if there is requirement of high performances and use
of resources that are located outside the computer system. On the other hand,
such a realization is usually very expensive. In addition, any changes to the
system are much more difficult and expensive to implement than software
changes. Finally, the user must have the ability to independently design and
produce the desired hardware or he has to have ability to thoroughly oversee
this process if it is running by somebody else. At the hardware level it can be
achieved maximum speed of encryption and decryption produced via a
dedicated crypto processor. Dedicated processor design techniques are well
known to everyone. However, the technology for the realization of these
processors has only a few countries in the world whose realizations the rest of
the world does not trust.
Our experience indicates that the kernel of operating system represents
optimal place for the implementation of custom cipher algorithm (full list of
references are sited in introductory part). It can achieved improved
performance due to reduced number of system calls. In addition, the
cryptosystem in the core of operating system can easily be paired with the
protocols of the transport and network layer as well as data link layer.
However, for such an approach it is necessary to have access to the source
code of the kernel, which is enabled within the Linux operating system.

COMPUTED AND ABSOLUTE SECURITY


The protection level is one of the first issues that should be determined in
the very beginning project phase. There are two possible solutions: the
processing encryption systems and absolutely safety encryption systems.
The safety processing ciphering algorithms are designed to use the keys of
limited length (usually 128 to 4,096 bits). Due to limited key length breaking
the cipher text is possible by trying all possible combinations of bits (keys) to
find the corresponding (total search space of all possible keys). Estimated
number of combinations to try to find used one is 2n-1/2, where n represents
the length of used key. Breaking of cipher text encrypted with keys longer than
256 bits is practically irrational due a lot of CPU time needed, so these
encryption systems are considered efficient. For breaking of cipher texts
encrypted by using of commercial ciphering algorithms (DES, 3DES, AES,
180 Aleksandar Jevremović, Mladen Veinović, Goran Šimić et al.

etc.) there is no confidence that the complete cipher key search is necessary. It
is suspected that there are shortened ways for that and there is also reasonable
suspicion that these shortened procedures are known to those who have
designed algorithms listed. It is based on some information published, mainly
via the Internet. A particular problem is the use of asymmetric encryption
system (RSA) due to fact that scientifically (mathematically) it has not been
proven that there are no shortened procedures for breaking algorithm.
Once designed cipher systems obsolete quickly. The development of
processors, computers and networks facilitating faster search of keys, and such
systems have to be examined and changed after a few years of use. The safety
processing encryption systems are been attacked by using weaknesses of
computer protocols, built-in backdoors, or human mistakes.
Different of safety processing ciphering algorithms, the content encrypted
by using absolutely safety encryption systems cannot be broken, regardless of
the amount of processing power engaged for this purpose. More precisely, it is
achieved by using unique cipher key (so called one time pad). This further
implies that the key length must be equal to or greater than the length of a
message to be encrypted. Moreover, a new key has to be used for each
message. Due to everyday improvement of storage media (more capacity at a
reduced cost) it is becoming more realistic that absolutely safety encryption
systems can be successfully applied for the protection of real-time
communication. For instance, one time pad systems can be used for protection
of standard voice-coder systems instead ordinary used safety processing
ciphering algorithms.
In both cases, regardless of which encryption system is used, there has to
be established safety communication channel for exchanging the keys between
sides in communication. Development of special protocols which enable the
exchange of a secret keys through unsafe channel [11-13] can be used as
alternative way for this purpose. However, according to published results the
performances of such protocols are still unsatisfactory as well as confidence in
them. These are the main reasons that they are not used in absolutely safety
encryption systems.
A simple and usable-in-practice method for absolutely safety encryption is
bitwise processing of message with XOR logical function (exclusive
disjunction) in which a bit array that came out of a random number generator
is used as second operand [14] (so called sequential or Bit-for-bit encryption).
It is well known that such systems are the most resistant to errors in the
channel (one false bit in the cipher text affects only one bit decoded
incorrectly in an open message), which is important for mobile
Development of the Android-Based Secure Communication Device 181

communications. Precondition for such a system is that both sides have to


have the same secret random bit array by same or greater length than the
length of messages to be exchanged. The same simple function (XOR) is used
in both of encryption and decryption of message. Random bit arrays can be
segmented and their segments can be indexed in order that the same segment
will never be used twice.

Figure 3. Implementation model for absolutely safety communication.

The weakest part of proposed solution is necessity of the secret key


exchange channel. The second one is the length of secret key - it should be
equal to or greater than the total length of all messages to be exchanged within
the planned period of use (e.g., one communication session). In case of
intensive exchange of voice or video messages, the length of this series can be
measured in tens or even hundreds of gigabytes for the planned use period of
one month between the only two points that communicate. However, given the
level of protection which is thus obtained, the problems that are avoided
(Connections, exchange session key, etc.) and low CPU requirements,
proposed solution should be put in consideration for practical applications.

LINUX AND ANDROID


The kernel of Linux operating system is used as basis for Android devices.
This way numerous functions of the Linux operating system can be used in
Android devices. In addition, since the code of Linux kernel is publicly
available, this core part of operating system can be controlled, modified and
extended in order to improve security features. Of course, in the development
182 Aleksandar Jevremović, Mladen Veinović, Goran Šimić et al.

of security systems the practice is that all components that are not required are
removed to simplify the system and to reduce the potential risk of installation
of back-doors.

Figure 4. Android architecture.

Support for IPsec protection system already exists in the Linux kernel, as
AH and ESP protocols. Cipher algorithms used in these protocols are also
implemented in the kernel, and they can be accessed via the standardized
Cryptographic API. This means that user defined cipher algorithm can be
implemented as a kernel module and it can be used for communication
protection via IPsec protection system.

Figure 5. Model of applying user defined cipher algorithm by using Cryptographic API
of Linux kernel.
Development of the Android-Based Secure Communication Device 183

Another advantage of implementation of user defined cipher algorithm as


a kernel module its exploitation is not limited just for communication
protection. It can be used for many other purposes. For instance, it can be used
for the protection of file-system (theft device scenario). Additionally, the
algorithm can easily be relocated to external hardware [15] (a case in which
the cipher algorithm acts as a device driver).

Trusting to Compiler, Hardware and Firmware

After checking of kernel source code and necessary Android components,


making modifications on them by incorporating custom cipher algorithm, the
following steps are compiling and installation on devices to be protected. The
first potential problem occurs at the first step, i.e., compiling of the source
code. It has to be made on an absolutely clean computer system (or on a
system that does not hold undocumented functionality), and using proven
compiler. Otherwise, back-door functionality can be inserted at this stage and
in this way to undo all efforts to develop a secure system.
The hardware devices procured on the market are also unreliable.
Attaching the “back door” at the hardware level has long been the subject of
suspicion researchers in this field, and recent discoveries make such a doubt
more confirmed [16]. Accordingly, the hardware of the device must also be
thoroughly checked before use or independently developed. The same method
should be followed for all the firmware incorporated in the device.

RANDOM ARRAY GENERATOR


Specific hardware device is constructed in order to obtain appropriate
random arrays for XOR function. This is a noise generator which is analyzed
to examine whether it is possible to use such a signal for encrypting purposes.
In generating true random number (TRNG) it is very important that there is a
source which continuously emits random information. To get a quick and
efficient true random array, we constructed generator that emits noise from
bipolarized transistor type NPN BC547A. Pseudorandom generator (PRNG)
was directly connected to the sound card in computer. This approach allowed
us to convert analog noise, emitted from the generator, into a digital signal and
then examine generated noise.
184 Aleksandar Jevremović, Mladen Veinović, Goran Šimić et al.

Figure 6. The architecture of the proposed solution.

It is important to define two types of pseudorandom generator (PRNG)


and their characteristics. Random numbers are divided into two basic classes,
which are called deterministic and non-deterministic. Deterministic random
numbers are produced by using some specific algorithm which performs
sequence of operations initialized by input parameter ordinary called seed.
Non-deterministic generators produce and generate outputs that are
unpredictable and depend on a physical origin that is beyond human control.
The following is a description of noise generator constructed in the active
mode.
The electrical circuit of the pseudorandom generator is based on paired
symmetrical BC547A transistor circuits [17] (Figure 6).
This electrical circuit with parallel connection represents a simple solution
in which transistors, as active semiconductor devices, generate random noise
as they are not absolutely identical regardless they are of the same type
(BC547A), series and producer.
Development of the Android-Based Secure Communication Device 185

CONDUCTED ANALYSIS AND EXPERIMENTS


Shannon theorem represented the basis for measuring entropy of generated
random array. Different tools are used for this purpose: audio card for A/D
signal conversion, Sound Forge suit for digital sound processing, MATLAB
for data processing, and finally - Java implemented tools for performing
statistical NIST tests [18]: Maurer test, serial test, poker test, frequency, gap
and tests of entropy with overlapping and non-overlapping. Generated noise
from the pseudo-random generator (PRNG), to a sixteen-bit (16-bit), at the
frequency 44.100Hz, in the stereo mode is represented on the next illustration
(Figure 7).
Presented signal was generated from the pseudo-random generator for a
period of one minute. The generated noise was recorded as wav bit stream.
This stream is captured and loaded to MATLAB for further manipulation. It
should be pointed out that bit stream still represents output levels of real signal
captured from BC547A. Its double values are just encoded by A/D conversion.
In the next stage, ‘binary array with redundancy’ acquisition is performed and
resulted in wavbinary 10848600 x 16 matrix (Figure 8). It means that there are
10848600 16-bit samples generated in one minute of time.

Figure 7. Recorded stereo analog noise.


186 Aleksandar Jevremović, Mladen Veinović, Goran Šimić et al.

Figure 8. Binary array with redundancy.

In order to improve entropy of given digital signal, wavbinary matrix is


converted into a hexadecimal string, which is considered as a signal of interest.
This conversion reduces wavbinary matrix size 32 times. As a result the signal
at this stage is presented by 2712150 x 2 matrix. In the next step hexadecimal
characters are converted into double-precision floating-point numbers stored
into matrix named wavedata with the same size (2712150 x 2). Finally, real
numbers are converted into a binary sequence, with uniform distribution and
without redundancy, ready to be used for encryption purposes (Figure 9).

Figure 9. Real numbers obtained by conversion of hex.string into double (left), the
resulting binary sequence (right).
Development of the Android-Based Secure Communication Device 187

Figure 10. 3D representation of resulted binary sequence.

Randomness of such a signal is proved by a series of standard NIST tests.


It could be also verified by its footprint analysis. Figure 10 shows the resulting
binary sequence in 3D space. It is obvious that it has uniform distribution
(horizontal plain). The distribution of signal levels is also uniform which is
verified by repeating statistical correlation measurements up to obtaining
satisfactory confidence level (entropy calculated on resulting binary sequence
is 1). As a final conclusion the proposed solution provides fully random
(totally disordered) bit sequence which is not predictive.
Further, sequence of statistical test is performed. Maurer universal
statistical test [19] is examining the possibilities for compression array without
loss of information. For array that can be significantly compressed it is
considered that is not array of random bits due to the fact that the compression
array is more efficient if array shows periodic performance. The following
table shows results of Maurer statistical test, which is recommended by NIST.
The 6 seconds noise is generated at 16 bits and frequency of 44100Hz, so there
is enough amount of information for the aforementioned statistical test. Below
is a tabulation of completed test. Focus of test was if sequence can be
compressed without losing any information. Regarding to P-Value that was
188 Aleksandar Jevremović, Mladen Veinović, Goran Šimić et al.

less then probability, it could be assumed that sequence source was from
random resource and observed sum was acceptable due to its value which was
higher than 387,840 (which is applicable).
Performed test obtains generating of non-redundant 264600 bits (or
approximately 33 KB) and it takes at least 6 seconds. During this period audio
card practically produced double – 529200 bytes (through stereo channels)
from generated noise. For further analysis, generated random bit array was
compared with the appropriate ones available on the Internet. Web services
offered at random.org were used for this purpose.
At the analog physical level, footprints of noise signals are different
(Figures 11, 12) which imply their difference in randomness. The noise signal
produced from constructed generator has more uniform distribution of bits
than the one downloaded from the Web (random.org). Further statistical
analysis and comparison verify such considerations.

Table 1. Maurer test of constructed generator

Maurer’s test
Observed sum 456584.6075373749
Observed fn 5.21453411988779
Expected fn 5.2177052
Variance 2.954
P-Value 0.9985278848454993
P-Value (Decimal) 0.99852788484549925840383

Figures 11 and 12. Generated noise (constructed generator - left, random.org - right).
Development of the Android-Based Secure Communication Device 189

Table 2. Entropy overlapping

Entropy overlapping
Constructed Random.org <
Type of random.org
generator generator
test Result
Result
Monobit 0.9999999970453606 0.9999999851224279 False
Bigram 0.9999938028778617 0.999998712659246 True
Trigram 0.9999776564462013 0.9999971314264103 True
4x4 Matrix 0.9999632491279191 0.9999918362253594 True

Tables 2 and 3 represent comparative statistical tests of entropy


overlapping and non-overlapping generated binary string, which can be
accessed through the website random.org and constructed generator.
Result of entropy overlapping test show that sequence is statistically
random, and it no contains any recognizable patterns. Noise generator of
random.org has better result only on Monobit test, where we observed
expected 50% frequency. In Monobit test, we focus on proportion of zeroes
and ones for given sequence. When we observed m-grams – probabilities test
and probabilities of random.org and constructed generator the results were
better of constructed generator. For matrices test, we checked linear
dependence. We repeated test for entropy non-overlapping (non-periodic)
matching test, as shown in Table 3.
In both of overlapping and non-overlapping scenarios randomness
obtained by proposed solution with BC547A generator is better than one from
referenced website (random.org) except in case of monobit test type.
Nevertheless, in this case there is a minor difference between calculated
entropies.

Table 3. Entropy non-overlapping

Entropy non-overlapping
Random.org
random.org Constructed generator
Type of test < generator
Result Result
Monobit 0.9999999970453606 0.9999999851224279 False
Bigram 0.9999850632203415 0.9999935149038766 True
Trigram 0.9999443542760401 0.9999798584606167 True
4x4 Matrix 0.9999632491279191 0.9999918362253594 True
190 Aleksandar Jevremović, Mladen Veinović, Goran Šimić et al.

CONCLUSION
Highly secure systems for communication protection cannot be depending
on already built solutions regardless of fact that their producers published
almost all details of implementation. This paper presents a model for the
development of its custom system for secure communication depending on
custom encryption algorithms. The basic requirement is that all system
components which can affect the safety have to be developed and tested and
approved by the authors themselves. Linux/Android platform is proposed as a
solution as it has the most of the necessary functions and that complete source
code is publicly available.
Cipher algorithms will be firstly incorporated at the telecommunication
level as a part of development of safety network system. Such solution will
aggregate particular implementations for each application protocol. More
precisely, the proposal includes IPsec security extensions as the
implementations below the network level make use of Internet impossible.
On the computer system, custom cipher algorithm can be implemented in
user (application) space, the core of the operating system or at a hardware
level. Although the implementation in user space represents the simplest
approach, such a realization is difficult to protect the growing number of
application protocols. Lowering of performances is also expected. The
implementation at a hardware level should be appropriate regarding to the
performances and security level, but such an approach is difficult to implement
and rigid for modifications. Therefore, the solution presented in the paper
proposes building of its custom algorithm in the form of Linux kernel
modules.
This paper also analyzes the choice between processing encryption
systems and absolutely secured encryption systems. With regard to the
possibilities of modern devices based on Android, with the focus on the
capacity of modern storage media, absolutely secured encryption systems are
partially preferred.
The final problem in the implementation of safety communication device
is how to obtain a “clean” platform on which the source code of the cipher
algorithm, core of operating system and other necessary software would be
compiled. In addition, the hardware platform on which such software would be
deployed should be “clean” - other words it has to be without “back doors”.
This is the only way the user has a full control over the system.
For solution proposed at OS level, specific hardware device is designed in
order to obtain bit array used for XOR message encryption. This tool acts as a
Development of the Android-Based Secure Communication Device 191

truly random number generator owing to using incidental processes in


semiconductor circuit which producing electrical noise. Such a signal is post
processed in several stages for obtaining format appropriate for encryption and
improving its randomness. The second characteristic is proved by standard
statistical NIST tests and obtained entropy is on satisfactory level (almost
one).
In the future works the focus will be on implementation, case studies and
system evaluation through different use case scenarios. Measurement of
performances and energy consumption will be analyzed as well. Encryption
processes need extra processing power and it is necessary to optimize
solutions according to concrete needs. Sustainability of solution requires
ability of system adaptation to each particular encrypting scenario.

ACKNOWLEDGMENT
Authors of this paper were participants on scientific projects - TR32054,
III44006, III44007 and ON174008 funded by Ministry of Education, Science
and Technology Development Republic of Serbia.

REFERENCES
[1] Jevremović, Aleksandar et al. 2006. “IP Security under Linux OS”,
Proceedings of 50th ETRAN Conference, Belgrade, Serbia, pp: 114-
117.
[2] Jevremović, Aleksandar et al. 2006. “IPsec – Analyzing Influence of
Cryptographic Algorithm on Lan Networks Traffic”, 14.
Telecommunications Forum Telfor, IEEE, Belgrade, Serbia.
[3] Jevremović, Aleksandar et al. 2008. “Custom Cipher Algoritm for
AJAX Requests Protection in Web applications”, Proceedings of 52th
ETRAN Conference, Belgrade, Serbia.
[4] Jevremović, Aleksandar et al. 2009. “Zaštita bežičnih komunikacija
korišćenjem sopstvenog šifarskog algoritma”, 17. Telecommunications
Forum Telfor, Belgrade, Serbia.
[5] Jevremović, Aleksandar. 2011. Integracija sopstvenih kriptoloških
sistema u standardnu računarsko-telekomunikacionu infrastrukturu”,
Univerzitet Singidunum, Belgrade, Serbia, pp. 1-122.
192 Aleksandar Jevremović, Mladen Veinović, Goran Šimić et al.

[6] Jevremović, Aleksandar et al. 2009. “Modifikacija IKEv2 protokola u


cilju izbora radnog tajnog ključa simetričnih šifarskih sistema”, Zbornik
radova 53. konferencije za elektroniku, telekomunikacije, računarstvo,
automatiku i nuklearnu tehniku - Etran, Belgrade, Serbia.
[7] Snyder, Bill. 2014. “Snowden: The NSA planted backdoors in Cisco
products”, InfoWorld.
[8] Appelbaum, Jacob. 2013. “To Protect And Infect - The Militarization of
the Internet”, 30C3: 30th Chaos Conference, Hamburg, Germany.
[9] Jevremović, Aleksandar et al. 2008. “Analysis and Implementation of
Custom Cipher Algorithm for IPsec under Linux OS”, International
Journal of Computer Science and Network Security, IJCSNS, Vol.8
No.7, pp. 80-86.
[10] Milosavljević, Milan et al. 2013. “Protokol za generisanje i razmenu
apsolutno tajnih kriptoloških ključeva putem javnih kanala u
savremenim računarskim mrežama”, Zbornik radova 57. konferencije za
elektroniku, telekomunikacije, računarstvo, automatiku i nuklearnu
tehniku - Etran, Zlatibor, Serbia.
[11] Tatović, Milomir et al. 2014. “One method for generating uniform
random numbers via civil air traffic”, Sinteza 2014, Belgrade, Serbia,
pp. 606-609.
[12] Saarinen, Markku – Juhani, 2004. “Linux for the Information
Smuggler”, Technical Aspects of Network Centric Warfare, Vol 17,
Finnish National Defence College, pp. 228-239.
[13] Jevremović, Aleksandar et al. 2008. “Model for Implementation of
Custom Cipher and Steganographic Algorithms in Case of Web Image
Galery”, 16. Telecommunications Forum Telfor, IEEE, Belgrade,
Serbia.
[14] Jevremović, Aleksandar et al. 2008. “Implementation of Propriatery
Cipher Algorithm on Linux Operating System”, Singidunum revija,
Univerzitet Singidunum, Vol.5 No.1, pp. 92-102.
[15] Šarac, Marko et al. 2012. “Analiza sigurnosti SSL saobraćaja u
bežičnim računarskim mrežama”, XI međunarodni naučno-stručni
simpozijum INFOTEH-JAHORINA 2012, Jahorina, Republika Srpska.
[16] Diffie, Whitfield and Hellman, Martin. 1976. “New directions in
cryptography”, IEEE Transactions on Information Theory, pp. 644–654.
[17] Eather, David. 2004. “Random Number Generation with a Simple
Transistor Junction Noise Source”, open access Internet resource at
http://imotp.sourceforge.net/noise.pdf.
Development of the Android-Based Secure Communication Device 193

[18] Rukhin, Andrew et al. 2010. “A Statistical Test Suite for Random and
Pseudorandom Number Generators for Cryptographic Applications,
NIST Special Publication 800-22, Rev. 1a, Computer Security, U.S.
Department of Commerce, 131 pages.
[19] Coron, Jean-Sebasitien and Naccache, David. 2002. “An Accurate
Evaluation of Maurer’s Universal Test, Lecture Notes in Computer
Science, Vol. 1556, pp 57-71.
AUTHOR CONTACT INFORMATION

Kristijan Kuk, PhD


Assistant Professor
Academy of Criminalistic and Police Studies,
Department of Informatics and Computer Sciences,
196 Cara Dusana street, 11080 Belgrade, Serbia
Email: kristijan.kuk@kpa.edu.rs, kukkristijan@gmail.com

Dragan Ranđelović, PhD


Full Professor and Head of the Department
Academy of Criminalistic and Police Studies,
Department of Informatics and Computer Sciences,
196 Cara Dusana street, 11080 Belgrade, Serbia
Email: dragan.randjelovic@kpa.edu.rs
INDEX

aspiration, 100
A assault, 66, 154, 156, 160, 162
assessment, 32
abstraction, 156
association rule, vii, 5, 7, 9, 12, 13, 14
abuse, 55, 58, 68, 71, 86, 124, 140, 148, 154
association rules algorithms, vii, 7
access, 2, 26, 40, 42, 47, 57, 58, 59, 61, 62,
asymmetry, 111
63, 65, 69, 78, 123, 125, 126, 127, 130,
atoms, 156
137, 142, 143, 144, 168, 171, 179, 192
attachment, 76, 82
accessibility, 20
attacker, 65
accountability, 42
attitudes, viii, 121, 122, 123, 124, 127, 128,
accounting, 112, 117
129, 145, 146, 147
adaptation, 177, 191
Attorney General, 82
administrators, 60
audit, 112, 117
adolescents, 123
authentication, 61
adults, 126
authorities, 60, 78, 82, 112
advertisements, 61, 64
awareness, 25, 48, 74, 77, 79, 125, 127, 147
adware, 54, 62
age, 12, 24, 41, 123, 125, 128, 131, 134,
142, 145, 146 B
agencies, ix, 1, 4, 12, 15, 20, 21, 22, 24, 25,
29, 37, 41, 42, 48, 153 back-doors, 182
algorithm, viii, ix, 7, 9, 10, 11, 12, 13, 14, bandwidth, 37
20, 22, 39, 65, 153, 175, 176, 177, 178, banking, 4
179, 180, 182, 183, 184, 190 banks, 28, 68
alternative hypothesis, 100 base, 22, 38, 39, 45, 81, 87, 88, 89, 91, 92,
android platform, ix, 175, 176, 190 94, 97, 98, 99, 102, 103, 104, 105, 106,
annotation, 156 112, 152, 165
armed forces, 54 behaviors, 147
arrest, 50, 83 behaviour and attitudes towards privacy,
artificial intelligence, 6, 55 121
198 Index

Belarus, ix commercial, 31, 40, 64, 121, 122, 123, 164,


benefits, vii, 29, 47, 49, 128, 129 179
Benfords’ distributions, 86 communication, ix, 47, 55, 121, 122, 175,
black market, 82 176, 177, 180, 181, 182, 183, 190
blogger, 11, 13 communication technologies, 47
blogs, 2, 12, 13 community, 14, 20, 22, 23, 25, 40, 41, 146
bottom-up, 11 compatibility, 162, 178
brain, 47 complexity, 15, 20
broadband, 74 compliance, 21, 133
browser, 58, 63, 67, 78 comprehension, 127
browsing, 61 compression, 187
Bulgaria, 73, 75 computation, 160
Bureau of Justice Assistance, 48 computer, viii, 2, 3, 5, 6, 16, 21, 24, 42, 46,
55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65,
66, 67, 68, 71, 72, 73, 75, 76, 77, 78, 79,
C 82, 111, 112, 113, 118, 152, 176, 178,
179, 180, 183, 190
cables, 54
computer fraud, 71
calculus, 126, 147
computer search, 4
campaigns, 67
computer systems, 55, 111, 113
case study, ix, 149, 191
computer use, 59, 63
categorization, 5, 44, 157
computerized data search and comparison, 1
category d, 157
computing, 46, 112, 160
causality, 28
conceptual model, 171
causation, 157
conference, 173
challenges, vii, 54, 77, 79, 80
confidentiality, 41, 55, 59, 66
child pornography, 56, 71
conformity, 116
children, 60
connectivity, 75
Chile, 17
consent, 60
China, ix, 118
consumer goods, 74
cipher systems, 180
consumers, 57, 64, 68, 124, 125, 129, 147,
cities, 42, 48, 116
148, 149
citizens, 35, 40, 47, 54, 77, 121, 123, 151,
containers, 161
153
contingency, 68
civil rights, 4
contour, 37
clarity, 5
contradiction, 127, 145
classes, 6, 7, 44, 64, 102, 111, 165, 184
controversial, 41
classification, ix, 6, 7, 45, 56, 153
convention, 55
cleaning, 64
cooperation, 78, 129
cluster analysis, 28
correlation, 28, 42, 48, 70, 71, 72, 82, 134,
cluster techniques, 5
136, 140, 143, 144, 145, 187
clustering, 7, 28, 37, 44, 152, 156, 157, 173
cosmetic, 116
clusters, 7, 34, 37, 44, 156
cost, 3, 25, 177, 180
CNN, 67
Council of Europe, 55, 80
coding, 117, 176
covering, 2, 37, 113
commerce, 124
Index 199

CPU, 179, 181 database, 5, 6, 10, 14


credit card fraud, 56, 71 DDoS, 58, 59, 66
crime analysis, 5, 6, 20, 21, 22, 23, 24, 25, decision makers, 153
26, 29, 34, 39, 42, 47, 48, 50 decomposition, 162
crime investigation, 1, 2, 4, 12, 15, 81 deficiencies, 112
crime mapping, 20, 21, 25, 26, 29, 37, 40, democracy, 81
41, 47, 48, 49, 50 demographic characteristics, 132
crimes, 5, 15, 23, 25, 36, 39, 41, 43, 44, 46, demographic data, 123, 131
47, 51, 154, 155 demography, 131
criminal activity, 40, 46, 171 denial, 54, 55
criminal acts, 15 Department of Justice, 51
criminal behavior, 29, 68 dependent variable, 69
criminal investigations, 4, 15 deployments, 26
Criminal Offenses, 154 depth, 152
criminals, 22, 41, 53, 54, 57, 60, 62, 67, 68, destruction, 59, 154
78, 79 detection, vii, viii, 5, 7, 13, 86, 112, 118,
critical infrastructure, 26, 55 120, 152, 158, 171
critical value, 101 developing countries, 4
Croatia, ix, 73, 75 deviation, 70, 101, 102, 142
cryptography, 177, 192 dichotomy, 127, 144
currency, 28 diffusion, 29
current prices, 108, 109 disclosure, 56, 127, 147, 149
cyber-attack, vii, 57, 78 discomfort, 58
cyber-criminals, 53, 54, 57 discrete random variable, 88
cybersecurity, 74 disorder, 22, 86, 100
cyberspace, vii, viii, 53, 54, 55, 69, 77, 78, dispersion, 37
79, 81, 152, 171 displacement, 29, 49, 51
Czech Republic, 118 distribution, viii, 20, 25, 26, 28, 39, 47, 48,
50, 71, 88, 89, 108, 110, 113, 115, 116,
135, 137, 138, 144, 176, 186, 187, 188
D DNA, 149
DOI, 115, 116, 117, 118, 119, 120
danger, viii, 53, 155
drugs, 5, 56, 69
data analysis, 2, 9, 69, 71
dynamic systems, 111
data collection, 22, 125
data mining, ix, 1, 2, 4, 5, 6, 7, 8, 9, 16, 44,
45, 48, 172 E
data mining algorithms, 7
data mining techniques, 1, 4, 5, 6, 15, 45, 48 earnings, 76, 116
data processing, 47, 185 e-banking, 75, 76
data search and comparison, 3, 6 eBay, 62
data set, 6, 7, 10, 71, 102, 113 e-commerce, 146
data structure, 166 economic development, 35
data suitable for association rules, 9 education, 12, 14, 24, 77, 123, 132
data surveillance, 2, 3 e-Government, 151, 152, 153, 172
data transfer, 53 election, 67, 120
200 Index

electronic surveillance, 3 financial, 55, 57, 58, 62, 76, 86, 111, 112,
e-mail, 6, 57, 58, 63, 65, 76, 77, 126, 130, 113
145 financial institutions, 57, 62
emerging markets, 116 financial reports, 112
employees, 60, 66, 77, 79 fingerprints, 5
employment, 128, 133, 140, 144, 145 flaws, 61
employment status, 128, 133, 140, 144, 145 flooding, 58
encryption, ix, 65, 177, 178, 179, 180, 186, food, 154
190 force, 154
endangered, 154 forecasting, 29, 120
enemies, 78 forensics, 1, 86, 112, 113, 117, 118, 119
energy, 191 formation, 79, 86
energy consumption, 191 formula, 160, 166
enforcement, 1, 29, 42 fouling, 154
engineering, 54, 57, 66, 78 foundations, 2
enlargement, 105 fraud, viii, 58, 64, 66, 71, 76, 86, 112, 113
entropy, 86, 100, 113, 185, 186, 187, 189, frequency distribution, 87
191 friendship, 127
environment, 16, 32, 37, 69, 78, 112, 125, funds, 55, 68, 177
126
equipment, 61, 78
espionage, 56, 78 G
ethnicity, 41
gambling, 28
European Union, 121
gangs, 5
evidence, 15, 56, 113, 116, 124
GAO, 83
evolution, 55, 56, 78
Geographic Information System, 21, 39
execution, 40, 56
geography, 45
expertise, 20
geology, 112
exploitation, 173, 183
Germany, 53, 120, 192
exposure, 154, 155
GIS, viii, 20, 21, 26, 27, 29, 30, 31, 32, 34,
extraction, 5, 151, 158, 161, 162, 163, 169
37, 41, 42, 47, 49
extracts, 159, 162
globalization, 57
Google, 128, 161
F governance, 80
governments, 16, 54, 57
Facebook, 58, 62, 121, 122, 123, 126, 127, GPS, 34, 62
130, 134, 144, 146, 147, 149 grants, 51
families, 44 graph, 31, 156, 162, 169, 170, 171
FBI, 49, 68, 82 gravity, 30
fear, 80, 124 Greece, 73, 75, 118
Federal Bureau of Investigation (FBI), 66, Gross Domestic Product, 114
82 grouping, 7, 156
fiber, 54 growth, 1, 3, 53, 121, 122
fights, 155 guilt, 154, 167
filters, 165
Index 201

injuries, 154
H institutions, 2, 54, 66, 77, 79, 155
integration, 4, 34
hacking, 6, 54, 57, 66, 68, 80
integrity, 55, 59, 154
hair, 154
intellectual property, 56, 60, 66
harassment, 122, 124, 126
intellectual property rights, 66
health, 43, 154, 155
intelligence, 15, 23, 24, 25, 42, 62, 78, 167,
height, 39
173
high school, 131, 132, 133
International Monetary Fund, 86, 108
hiring, 15
intervention, 25, 34, 43
history, 5, 23, 50, 78
investment, 112, 175, 176
horses, 59, 61, 77
IP address, 63, 69, 78, 79
hotspots, 23, 25, 28, 29, 34, 36, 37, 42, 44,
Iran, 12
45, 47, 49, 76
issues, ix, 5, 14, 15, 22, 24, 41, 50, 68, 130,
human, 2, 15, 25, 29, 47, 180, 184
136, 175, 176, 178, 179
human behavior, 25
Italy, 114
hypothesis, 87, 98, 99, 100

J
I
Japan, 68
ICC, 147
Java, 58, 77, 163, 185
ICS, 81
Jordan, ix, 119
identification, 5, 6, 25, 28, 29, 36, 41, 47,
judiciary, 2
57, 111, 118
justification, 32, 79
identity, viii, 12, 24, 41, 55, 58, 63, 76, 97,
juveniles, 60
123
image, 5, 65, 66, 112, 118, 126, 144, 160
improvements, 29 K
incarceration, 26
income, 24, 116 keylogging, 60, 66
income tax, 116
independent variable, 69
indexing, ix, 151, 153, 155, 156, 159, 161, L
162, 163, 164, 165, 166, 168, 171, 172,
174 labeling, 41
individuals, 1, 2, 24, 44, 54, 69, 112, 124, language processing, 152, 153
125, 126, 127, 129, 145 languages, vii, 152, 155, 156, 157, 164
industry, 60, 123 laptop, 67
infection, 73 law enforcement, viii, ix, 12, 20, 21, 27, 29,
Information and Communication 37, 41, 42, 43, 47, 48, 153, 158
Technologies, 81 laws, viii, 58, 77, 86, 100, 124, 129, 137,
information retrieval, 174 153
information sharing, 127, 130, 146 laws and regulations, 124
information technology, 2, 19, 124, 172 lawyers, 3
infrastructure, 34, 77, 78, 177 lead, 29, 42, 47, 53, 67, 123
learning, 7, 8, 9, 14, 24, 46
202 Index

legislation, 4 medicine, 112, 155


level of education, 128 memory, 79
lexical knowledge, 156 messages, 6, 43, 57, 62, 64, 65, 66, 67, 77,
linear, viii, 7, 45, 54, 69, 71, 73, 79, 80, 82, 113, 181
115, 116, 189 metamorphosis, 81
linear dependence, 189 methodology, 37, 46, 124
linear systems, 115 Microsoft, 151, 153, 164
Linux operating system, 179, 181 Microsoft Word, 151, 153, 164
local community, 29, 50 military, 67
local government, 54, 77 Ministry of Education, 191
longitudinal study, 51 misunderstanding, 45
Lucene, ix, 151, 153, 159, 163, 164, 165, misuse, 55, 56, 128, 145
166, 167, 168, 169, 171, 172, 173, 174 mobile communication, 181
mobile device, ix, 75, 122, 147, 175, 176
mobile telephony, 176
M modelling, 161, 174
models, 4, 7, 29, 30, 31, 39, 50, 160, 161
Macedonia, ix, 53, 54, 68, 69, 71, 72, 73,
modern society, 2, 20
74, 79, 83
modernization, 21
machine learning, 7, 8, 9, 12, 16, 48, 153
modifications, 183, 190
machinery, 55
modules, ix, 190
magnitude, 111
modus operandi, 23
majority, 125, 131, 132, 133, 134, 135, 137,
mortality, 87
138, 140, 144, 145, 146
mortality rate, 87
malicious software, viii, 54, 55, 57, 58, 59,
motivation, 126
61, 64, 66, 67, 78, 79
multidimensional, 126
malware, 54, 58, 61, 63, 64, 66, 67, 75
multimedia, 66
management, viii, 2, 20, 21, 25, 26, 29, 42,
multiplication, 86, 113
47, 116, 148
music, viii, 12
manipulation, 176, 185
mapping, 20, 21, 25, 26, 29, 37, 40, 41, 42,
43, 44, 45, 47, 48, 49, 50 N
marital status, 128
marketing, 7, 60, 64, 123, 124, 149 national security, 54
Mars, 174 national strategy, 74
mass, 4, 12 natural evolution, 42
mass media, 12 natural language processing (NLP), 152,
materials, 5 153, 172
mathematical methods, vii neglect, 154
mathematics, 46, 111 networking, 122, 146
matrix, 8, 162, 174, 185, 186 neural network, 7, 44, 112, 117, 173
measurements, 5, 111, 113, 159, 187 New Zealand, 16
media, viii, 2, 5, 11, 12, 13, 14, 23, 42, 68, nodes, 31
148 Norway, 68
median, 31, 111 NSA, 192
medical, 124, 154
Index 203

predicate, 157
O prevention, vii, 22, 23, 25, 26, 29, 47, 48,
49, 50, 51, 54, 79, 158
obstacles, 34
principles, 175, 176
offenders, 15, 20, 23, 39, 44, 57
prisoners, 34
Office of Justice Programs, 50
privacy concerns, ix, 122, 123, 124, 125,
online social networks, 126, 128, 147, 148
127, 128, 146, 148, 149
operating system, 58, 59, 61, 64, 65, 67,
private information, 137
178, 179, 181, 190
private sector, 54, 79
operations, 2, 22, 39, 59, 165, 184
probability, 39, 87, 88, 89, 91, 92, 94, 95,
opportunities, 29, 34
96, 99, 101, 108, 109, 113, 188
optimal performance, 176
probability distribution, 87, 88, 89, 91, 92,
optimization, 20, 21, 29, 30, 50
94, 95, 96, 101
organ, 6, 155
problem solving, 21, 22, 26
OSN users, viii, 121, 123, 127, 128, 129,
procurement, 68
131, 137, 144, 145
producers, 190
overlay, 26
professionals, 37
profit, 68
P programming, 30, 61, 178
programming languages, 178
pairing, 3, 6 project, 163, 171, 179
parallel, 155, 184 propaganda, 55
parents, 130, 137, 142, 143, 144, 145 propagation, 59
participants, 127, 129, 131, 191 protection, 2, 4, 41, 57, 60, 63, 66, 68, 75,
password, 58, 65, 75, 76, 79 76, 79, 81, 126, 129, 144, 176, 177, 178,
peer review, ix 179, 180, 181, 182, 183, 190
pensioners, 131, 138 psychology, 112
perpetrators, 44, 55, 56 public administration, 1, 3, 4
personal computers, 69 public opinion, vii, 67
personal surveillance, 3, 4 public safety, 45
physics, 112 Puerto Rico, 120
piracy, 56 P-value, 72
plants, 111
platform, ix, 128, 175, 176, 190
Q
playing, 144
police, vii, viii, 2, 5, 15, 19, 20, 21, 22, 23,
quality of life, 22, 35
24, 25, 26, 28, 29, 31, 32, 34, 40, 42, 43,
quantitative technique, 22, 43
46, 47, 48, 49, 50, 51, 68, 69, 82, 85, 172
query, 37, 40, 159, 160, 161, 162, 165, 166,
policy, 26, 50, 74, 77
167, 169, 171
political crisis, 79
questionnaire, 128, 142
political party, 14
quizzes, 134, 144
politics, 14
population, 74, 111, 121, 123
population growth, 111
pop-up ads, 62, 77
204 Index

R S

race, 24 SaaS, 46
radiation, 119 safety, 40, 67, 74, 79, 137, 176, 178, 179,
radio, 34 180, 181, 190
radius, 37 scatter, 70, 71
ramp, 34 scatter plot, 70, 71
random numbers, 184, 192 school, 131, 132, 133
rape, 41 science, 2, 22, 82, 112
rationality, 79 scope, 15, 55, 156
reading, vii, 49 search space, 179
real numbers, 186 search terms, 172
real time, 25, 34, 48, 67 secret key, 180, 181
reality, 1, 2, 148 secure communication, ix, 175, 176, 190
recall, 8 security, vii, viii, 15, 20, 21, 53, 54, 57, 58,
recidivism, 26 62, 68, 74, 75, 76, 77, 78, 79, 80, 81, 82,
recognition, 7, 47 83, 111, 112, 148, 152, 171, 175, 176,
recovery, 77 181, 190
recurrence, 111 security practices, vii
redundancy, 185, 186 security threats, 81
regression, viii, 28, 45, 54, 69, 70, 71, 72, seed, 184
73, 79, 80, 82 self-esteem, 149
regression equation, 71 semantics, 162
regression line, 70, 71, 72 semiconductor, 184, 191
regression method, 45 senses, 156, 157
regression model, viii, 79 sequence pattern, 5
regulations, 4 Serbia, ix, 1, 19, 50, 51, 53, 81, 85, 121,
relevance, 4, 169, 173 122, 123, 124, 128, 137, 144, 145, 151,
reliability, 9, 119 152, 153, 160, 167, 171, 172, 175, 191,
religion, viii, 12 192
reputation, 126 servers, 54, 58, 63
requirement, 21, 34, 179, 181, 190 sex, 24
researchers, 3, 5, 6, 12, 36, 37, 96, 111, 183 sexual harassment, 123
resource allocation, 29, 118 sexual orientation, 126
resource management, 7 shape, 69
resource utilization, 29 shock, 64
resources, 20, 21, 22, 23, 24, 26, 29, 30, 32, showing, 28, 102, 105, 108, 113, 127
34, 40, 42, 43, 44, 45, 47, 50, 59, 60, 61, side effects, 29
79, 156, 161, 162, 179 signals, 188
response, 29, 40, 42, 142, 159 significance level, 134, 138
robberies, 28, 38 signs, vii, 61, 167, 168
root, 61, 98, 165, 167, 170 simple, viii, 27, 32, 40, 54, 69, 79, 82, 118,
routes, 26 162, 170, 178, 180, 184, 192
rule discovery, 5, 7 simple linear regression, viii, 54, 69, 79
rules, viii, 5, 7, 10, 11, 13, 67, 74, 79, 129 simulation, 86
Index 205

simulations, 113 stratification, 176


sleep deprivation, 154 stress, viii, 53, 54
Slovakia, 85 structure, 11, 120, 131, 146, 156, 162, 166
SMS, 76 styles, 112
social activities, 2 success rate, 7
social capital, 122 supervision, 60
social events, 100, 112, 130, 133, 144 suppression, 4, 5
social media, viii, 2, 5, 12, 23, 148 surveillance, 2, 3, 4, 21
social network, ix, 58, 61, 62, 76, 121, 122, synchronization, 176
126, 127, 128, 130, 134, 147, 148, 149 Syria, 67
social phenomena, 12, 113
social relations, 126
social relationships, 126 T
social security, 111
tactics, 24, 78
social situations, 128
Taiwan, 116
society, 1, 2, 21, 53, 54, 77, 80, 87, 146,
target, vii, 20, 21, 26, 29, 56, 67, 78, 158
147, 149
teams, 62, 79
software, viii, 3, 15, 20, 24, 27, 31, 34, 36,
techniques, vii, ix, 1, 3, 4, 5, 6, 15, 16, 21,
39, 40, 42, 43, 45, 46, 48, 54, 56, 57, 58,
22, 23, 24, 30, 36, 37, 39, 42, 43, 44, 45,
59, 61, 64, 65, 66, 67, 68, 70, 75, 76, 78,
46, 48, 78, 118, 146, 152, 153, 158, 159,
79, 102, 108, 112, 176, 179, 190
166, 171, 179
solution, ix, 26, 30, 42, 100, 151, 153, 176,
technology, viii, 3, 4, 6, 16, 20, 21, 22, 26,
179, 181, 184, 187, 189, 190, 191
27, 29, 30, 34, 37, 40, 42, 47, 49, 54, 80,
solution space, 30
125, 146, 148, 158, 172, 179
South Africa, 147
telecommunications, 176
spam, 6, 56, 75, 76
telephone, 3, 58
speech, 7
territory, 34
spending, 122, 124
terrorist groups, 57
spyware, 60
terrorist organization, 66
stability, 48, 55
terrorists, 60
staffing, 23
testing, 86, 101, 102, 105, 108, 114
stakeholders, 54
text mining, 152, 155, 158
standard deviation, 70, 71, 72, 86, 102, 103,
theft, 56, 58, 59, 63, 64, 67, 123, 183
104, 105, 106, 134, 142
Theory of Everything, 96, 100
standardization, 155, 176
thesaurus, 155, 156, 162, 169, 173
stars, 127
threats, 53, 54, 57, 58, 66, 77, 78, 79, 123,
statistic technique, 36
125, 126
statistics, viii, 6, 37, 42, 54, 86, 87, 131,
time constraints, 34
133, 134, 137, 149, 169
time periods, 13
stigmatized, 41
time series, 112
stock, 111
time use, 38
stock exchange, 111
tourism, 11
storage, 112, 176, 180, 190
trade, 68
storage media, 180, 190
trafficking, 56
strategic management, 172
training, 15
206 Index

transactions, 4, 5, 10, 11, 15, 111, 112, 127 victims, 20, 22, 23, 39, 41, 44, 68, 79
transfer of money, 76 videos, 42, 133, 135, 145
transgression, 41 violence, 48
transistor, 183, 184 virus, 54, 59, 63, 64, 65, 73, 75, 77, 82
translation, 155, 158 viruses, 55, 56, 57, 59, 61, 65, 66
transmission, 56, 66 visualization, viii, 7, 20, 21, 26, 27, 32, 42,
transparency, 122, 123, 127 47, 48, 51, 158
transport, 34, 178, 179 vocabulary, 155
turbulent flows, 119 vulnerability, 65
Turkey, 1, 73, 75
turnover, 11, 13, 14
W

U war, 54, 55, 67


Washington, 17, 48, 49, 50, 51, 83
U.S. Department of Commerce, 193 weapons, 69
uniform, 62, 186, 187, 188, 192 web, 12, 29, 40, 41, 58, 62, 63, 67, 69, 75,
United Nations, 19 77, 78, 125, 126, 161
United States (USA), 39, 46, 48, 66, 67, 83, web browser, 63, 75, 77
116, 146 web pages, 62, 63, 77
universe, 100 web space, viii, 2
unstructured data, 42, 48, 152, 153, 172 websites, 57, 58, 61, 62, 63, 64, 77, 148
updating, 75, 79 WEKA, 8, 13, 14, 15
urban, 38 White Paper, 50
user data, 60, 64 Wi-Fi, 76
user space, 178, 190 windows, 64
wireless sensor networks, 117
word meanings, 156
V working memory, 59
world order, 78
validation, 7
worldwide, 19, 69, 121, 122, 124, 144
valuation, 23
worms, 55, 57, 59, 61
Valuation, 148
variables, 7, 26, 69, 71, 113, 138, 166
vector, 37, 44, 160, 161, 162, 173 Y
vehicles, 5, 32, 68, 82, 172
Venezuela, 120 yield, 3, 47
versatility, 15 young people, 138, 146
victimization, 24, 34, 41
go to

it-eb.com
for more...

You might also like