Professional Documents
Culture Documents
In recent years, the Smart City concept has become popular for its promise to improve the
quality of life of urban citizens .Smart City services typically contain a set of applications with
data sharing options. Most of the services in Smart Cities are actually mashups combined data
from several sources. This means that access to all available data is vital to the services.
Seamless access of Smart Network services requires resource migration in terms of VM
migration during the offloading process to ensure QoS for the user. Ant Colony Optimization
(ACO) based joint VM migration technique is proposed in which user mobility is also
considered. Propose an Ant Colony Optimization (ACO) based joint VM migration model for a
heterogeneous, MCC based Smart Network system in Smart City environment. In this model, the
user’s mobility and provisioned VM resources in the cloud address the VM migration problem.
Use DSMS, CEP, batch-based MapReduce and other processing mode and FPGA, GPU, CPU,
ASIC technologies differently to processing the data at the terminal of data collection. Here
structured the data and then upload to the cloud server and Map Reduce the data combined with
the powerful computing capabilities cloud architecture. Present a thorough performance
evaluation to investigate the effectiveness of our proposed model compared with the state-of-the-
art approaches.
CHAPTER 1
INTRODUCTION
In the previous section, it was discussed that big data introduces new security and privacy
issues. For the Network care sector these issues are even amplified, due to fact that Network care
data are considered privacy sensitive data and traditional security and privacy methods to protect
privacy Network care data seem insufficient or even obsolete. This is a problem for Security as
personal information can unwillingly be derived from these Network information systems and
end up in wrong hands. Besides that individuals have some rights against intrusion of their
personal information, in wrong hands, personal information can potentially harm individuals.
On the other hand, weak security and privacy methods can hinder the adoption of big data in the
Network care. There can be public resistance from individuals or government against the use of
big data in Network care, when there is no trust in the protection of their personal information.
Hindering in the adoption of big data in the Network care could also hinder potential benefits big
data could bring to the Network care, which are for example improved quality of Network.
Therefore, the owners of the problem are the hospitals and other organizations in the Network
that potentially can benefit from an adoption of big data in Network care. These organizations
have to deal with hurdles such as privacy legalization and the public perception of privacy before
they can successfully adopt big data.
The objective is to select the optimal cloud server for a mobile VM in addition to minimizing the
total number of VM migrations, reducing task-execution time. Honey Bean Optimization
algorithm (HBO) to identify the optimal target cloudlet.
Big Data Challenges in Network
Thesis Contribution:
This thesis reports on the use of phrases to Seamless access of Smart Network services
requires resource migration in terms of VM migration during the offloading process to ensure
QoS for the user. Honey Bean Optimization algorithm (HBO) based joint VM migration
technique is proposed in which user mobility is also considered.
According to Oracle, big data is the inclusion of additional sources to augment existing
data analytics operations. Exploiting big data entails dealing with multiple sources and
combining structured and unstructured data. Various sources advices not to dispense existing
infrastructures and capabilities of BI, but current capabilities should be integrated with the new
requirements of Big Data. Big Data technologies work best in cooperation with the original
enterprise data warehouses, as used with Business Intelligence
Every generation introduces new data types, which require new capabilities to deal with
these new data types. The first generation of BI&A applications and research focused on mostly
structured datacollected by companies through legacy systems and stored in relational database
management systems (RDBMS)
Analytical techniques used in this generation of BI&A are rooted in statistical methods
and data mining techniques developed in the 70s and 80s respectively.
The second generation of BI&A is a result of the development of the internet; it
encompasses analysis of web-based unstructured content.
The third generation of BI&A is emerging as a result of smartphones, tablets and other
sensorbased information supplies and includes analysis on location-based, person-
centered and context-relevant analysis
HADOOP FRAMEWORK
Unlike RDBMS and NoSQL, Hadoop is not referring to a type of database, but rather a
software platform that allows for massively parallel computing. Hadoop is an open source
software framework, which consists of several software modules that are targeted to process big
data, large volume and high variety of data. Core modules of the Hadoop ecosystem are Hadoop
Distributed File System (HDFS) and Hadoop Mapreduce. Below we describe the most popular
modules of the Hadoop framework.
HDFS is the software module that arranges the storage in a Hadoop big data ecosystem.
HDFS breaks down data into pieces and distributes these pieces to multiple nodes of
physical data storage in a system. The main advantages of HDFS are that it is designed to
be scalable and fault tolerant. Additionally, by dividing data into pieces HDFS prepares
data for parallel processing. Other modules in the Hadoop framework are designed to
take advantage of distributed data over multiple nodes.
Mapreduce is a software framework that provides a programming language that takes
full advantage of parallel processing. Tasks that programmed in MapReduce are divided
in smaller tasks, which are sent to the relevant nodes in the system. The MapReduce
framework takes care of the whole process: managing communication between nodes,
running tasks in parallel and providing redundancy and fault-tolerance.
HBase is a software module that runs as non-relational database on top of HDFS. HBase
is NoSQL database that stores data according a key-value model. As it is a NoSQL type
of database it requires low level programming to query. Like other software modules of
Hadoop, HBase is open-source and is modeled after Google’s Big Table database.
Hive is essentially a data warehouse that runs on top of HDFS. Hive structures data into
concepts like tables, columns, rows and partitions, similar to a relational database. Data
in a Hive database can be queried using (limited) SQL like language, named HiveQL.
OVERVIEW OF MOBILE CLOUD
Here assume a three-tier Mobile Cloud Computing (MCC) environment, where a set of M
access points (APs) comprise the backbone network. Tier one represents the master cloud, which
consists of several public cloud providers, such as Google App Engine, and Microsoft Azure
Amazon EC2. A set of high-speed interconnected cloudlets constitute the tier two or the
backbone layer of the mobile cloud architecture. Smartphones, wearable devices or other mobile
devices constitute the tier three or user layer. Users access the nearest cloud resources using
devices from tier three. A set of cloudlets is controlled and monitored by the master cloud (MC).
All cloudlets route their hypervisor information to the master clouds and they are connected to
the MC with a high-speed network connection.
Latency
The amount of time required by a certain application between the event happening and the event
being acquired by the system.
Energy consumption
The energy consumed for executing a certain application locally or remotely
Throughput
The amount of bandwidth required by a specific application to be reliably executed in the
smart city environment
Cloud Computing
The amount of computing processes requested by a certain application.
Exchanged data
The amount of input, output, and code information to be transferred by means of the
wireless network.
Storage
The amount of storage space required for storing the sensed data and/or the processing
application.
Users
The number of users needed to achieve reliable service
Challenges of Present Cloud
OBJECTIVES
Definition of Mobile Cloud Computing the Mobile Cloud Computing (MCC) term was
introduced after the concept of Cloud Computing. Basically, MCC refers to an infrastructure
where both the data storage and the data processing happen outside of the mobile device.
Regarding the definition, mobile applications move the computation power and storage from the
mobile phones to the cloud. It can be thought as a combination of the cloud computing and
mobile environment. The cloud can be used for power and storage, as mobile devices don’t have
powerful resources compared to traditional computation devices.
Mobile Network
There are many reasons to use cloud computing with mobile applications. MCC provides
some solutions to the obstacles which mobile subscribers are usually face up with.
Battery Life
Battery life is one of the main concerns in the mobile environment. There are already
several solutions for extending battery life by enhancing CPU performance, using disk and
screen in an efficient manner to reduce power consumption. But these solutions generally require
changes in the mobile devices’ structure or a new hardware which means increasing the cost.
Computation or data offloading techniques are suggested to migrate the huge and complex
computations from limited resource devices like mobile devices to powerful machines like
servers in clouds. This avoids taking a long application execution time on mobile devices which
results in large amount of power 6 and/or read-write time consumption [4]. There are many
evaluations to show effectiveness of these techniques.
Another obstacle is storage capacity of mobile devices. Mobile devices are generally
have limited storage. To overcome this problem, MCC can be used to access, query or store the
large data on the cloud through wireless networks. There are several examples which are widely
used such as Amazon Simple Storage Service (Amazon S3) to provide file storage on the cloud.
In addition, MCC reduces the time and energy consumption for compute-intensive applications,
which is too applicable when thinking of the limited-resource devices.
Reliability:
With the help of CC paradigm, reliability can be improved since data and application are
stored and backed up on several numbers of computers on the cloud. This provides more
confidentiality by reducing the chance of data lost on the mobile devices. In addition,
copyrighting digital contents and preventing illegal distributions like music, video can be more
available in this model. Also security services like virus detection applications can be easily
provided and used in an efficient way without effecting mobile device performance.
Furthermore, CC scalability, elasticity advantages can be used in MCC, as well since cloud
flexibility is applicable as a whole infrastructure, in the same way.
Privacy:
Privacy is an important issue, when thinking about private data. As in the CC era, the
same trust problem comes out with the mobile network providers and cloud providers. They can
monitor at all the communication and data stored in the cloud or network provider, although
there is encryption mechanisms to crypt data communicated or stored. So from this perspective,
it is a big headache to be solved.
Communication:
The communication is composed from multiple parts from mobile subscriber to the cloud
provider. Therefore there can be some problems like poor network speed or 7 limited bandwidth.
It can be a big concern because the number of mobile and cloud users is dramatically increasing.
As mentioned in the previous section, Mobile Cloud Computing has many benefits and good
application examples for mobile users and service providers. On the other hand, as mentioned in
some parts, there are also some challenges related to cloud computing and mobile networks
communication. This section gives some explanation about these obstacles and solutions. 3.1.
Mobile Side Challenges In the mobile network side, main obstacles and solutions are listed
below:
Low Bandwidth
: Bandwidth is the one of important issues in mobile cloud environment because mobile
network resource is much smaller compared with the traditional networks. Therefore, P2P Media
Streaming for sharing limited bandwidth among the users who are located nearby in the same
area for the same content such as the same video [6]. By this method, each user transmits or
exchanges parts of the same content with the other users, which is resulted in improvement of
content quality, especially for videos.
Availability:
Network failures, out of signal errors, or high traffic related poor performance problems are
main threats to prevent users to connect to the cloud. But there are some solutions to help mobile
users in the case of any disconnection from the clouds. One of them is Wi-Fi Based Multihop
BDSA. It is a distributed content sharing protocol for the situation without any infrastructure [7].
In this mechanism, nearby nodes are detected in case of the failure of direct connection to the
cloud. In this case, instead of having a link directly to the cloud, mobile user can connect to the
cloud through neighboring nodes. Although there are some considers about security issues for
such mechanisms, these issues can also be solved.
Heterogeneity:
There are types of networks which are used simultaneously in mobile environment such as
WCDMA, GPRS, WiMAX, CDMA2000, and WLAN. As a result, handling like heterogeneous
network connectivity becomes very hard while satisfying mobile cloud computing requirements
such as connectivity which is always on, on-demand scalable connectivity, and the energy
efficiency of mobile devices. This problem can be solved 10 by using standardized interfaces and
messaging protocols to reach, manage and distribute contents.
Pricing:
Using multiple services in mobile requires with both mobile network provider and cloud
service provider. However, these providers have different methods of payment and prices for
services, features and facilities. Therefore, this has possibility of leading to many problems like
how to determine price, how the price could be shared among the providers or parties, and how
the subscribers can pay. As an example, when a mobile user wants to run a not free mobile
application on the cloud, this participates three stakeholders as one of them is application
provider for application licence, second one is mobile network provider for used data
communication from user to cloud, and third one is cloud provider for providing and running
application on the cloud.
LITERATURE REVIEW
Marcos D. Assuncao et.al. [1] have discussed approaches and environments for carrying
out analytics on clouds for Big data applications. They have identified possible gaps in
technology and provide recommendations for the research community on future directions on
Cloud-supported Big Data computing and analytics solutions.
KhairulMunadi et.al. [2] Have proposed a conceptual image trading framework that
enables secure storage and retrieval over internet services. The aim is to facilitate secure
storage and retrieval of original images for commercial transactions, while preventing
untrusted server providers and unauthorized users from gaining access to true contents.
Rupali S. Khachane et.al. [3] have focused on Privacy Homomorphism technique which
emphasize to resolve the security of query processing from client side, cloud with R-tree index
query and distance re-coding algorithm.
BadrishChandramouli et.al. [4] Have proposed a new progressive analytics system based
on a progress model called Prism that allows users to communicate progressive samples to the
system and efficient and deterministic query processing over samples.
[5]
Satoshi Tsuchiya et.al. Have discussed about two fundamental technologies :
distributed data store and complex event processing, and workflow description for distributed
data processing.
Divyakant Agrawal et.al. [6] Have focused on an organized picture of the challenges
faced by application developers and DBMS designers in developing and deploying internet
scale applications.
Ms.Preeti Tiwari et.al. [7] Have discussed that the performance of distributed query
optimization is improved when ACO is integrated with other optimization algorithms.
Haibo Hu et.al. [8] Have proposed a holistic and efficient solution that comprises a
secure traversal framework and an encryption scheme based on privacy homomorphism. The
framework is scalable to large datasets by leveraging an index-based approach. Based on this
framework, we devise secure protocols for processing typical queries such as k-nearest-
neighbor queries (kNN) on R-tree index.
Ku Rahaneet. al. [9] has proposed about a framework for big data clustering which
utilizes grid technology and ant-based algorithm.
Sudipto das et.al. [10] Have discussed to clarify some of the critical concepts in the
design space of big data and cloud computing such as: the appropriate systems for a specific
set of application requirements on the mobile data.
Related Papers
Title: A Genetic Algorithm for Virtual Machine Migration in Heterogeneous Mobile Cloud
Computing.
Author: Md. Mofijul Islam, Md. AbdurRazzaque and Md. Jahidul Islam
Year: 2016
Description:
Mobile Cloud Computing (MCC) improves the performance of a mobile application by
executing it at a resourceful cloud server.
Virtual Machine (VM) migration in MCC brings cloud resources closer to a user so as to
further minimize the response time of an offloaded application.
The key challenge is to find an optimal cloud server for migration that offers the
maximum reduction in computation time.
The goal of GAVMM is to select the optimal cloud server for a mobile VM and to
minimize the total number of VM migrations, resulting in a reduced task execution time.
Advantages:
Mass storage capacity and high-speed computing power.
It will assign multiple tasks in larger bandwidth the VM, but the smaller bandwidth VM
will be assigned rarely tasks.
Load balancing of the entire system can be handled dynamically by using
virtualization technology
Disadvantages:
VM placement problem is the hinge of scheduling and management in cloud data center.
Limited-bandwidth and other limited resources.
VM placement problem needs to consider the influence of network factors.
Algorithm:
Genetic algorithm based virtual Machine migration (GAVMM).
Title: A Survey of Mobile Cloud Computing Application Models.
Author: Atta urRehman Khan, Mazliza Othman, Sajjad Ahmad Madani and Samee Ullah Khan.
Year: 2014
Description:
Smartphones are now capable of supporting a wide range of applications, many of which
demand an ever increasing computational power.
This poses a challenge because smartphones are resource-constrained devices with
limited computation power, memory, storage, and energy.
The cloud computing technology offers virtually unlimited dynamic resources for
computation, storage, and service provision.
The traditional smartphone application models do not support the development of
applications that can incorporate cloud computing features and requires specialized
mobile cloud application models.
Advantages:
eXCloud, it transfers only the top stack frames, unlike the traditional process migration
techniques in which full state migrations are performed.
MAUI provides a programming environment where independent methods can be marked
for remote execution.
model is a wide range of elasticity patterns to optimize the execution of applications
according to the users’ desired objectives.
Disadvantages:
The sharing of data and states between the web lets that execute on distributed locations
are prone to security issues.
The data replication may give rise to data synchronization and integrity issues.
The latency issue is very crucial in mobile cloud application models.
Algorithm:
Application partitioning algorithms such as
All-step
K-step
Title: Big Data-Driven Service Composition Using Parallel Clustered Particle Swarm
Optimizationin Mobile Environment.
Author: M. Shamim Hossain, MohdMoniruzzaman, Ghulam Muhammad and Ahmed Ghoneim,
AtifAlamri.
Description:
A mobile service providers support numerous emerging services with differing quality
metrics but similar functionality.
The mobile environment is ambient and dynamic in nature, requiring more efficient
techniques to deliver the required service composition promptly to users.
Selecting the optimum required services in a minimal time from the numerous sets of
dynamic services is a challenge.
By using parallel processing, the optimum service composition is obtained in
significantly less time than alternative algorithms.
Advantages:
The performance of this algorithm can be improved by using efficient optimization
techniques like PSO.
Qualities of the mobile environment demand efficient optimization and clustering
techniques.
Disadvantages:
The issue of parallel and distributed data operations where the structure of data is multi-
dimensional.
Dynamic QoS and the rapidly changing nature of services in the mobile environment.
Algorithm:
Particle swarm optimization
k-means clustering
Title: Clone Cloud: Elastic Execution between Mobile Device and Cloud
Author: Byung-GonChun,SunghwanIhm, PetrosManiatis, MayurNaik and Ashwin Patti.
Description:
Mobile applications are becoming increasingly ubiquitous and provide ever richer
functionality on mobile devices.
Such devices often enjoy strong connectivity with more powerful machines ranging from
laptops and desktops to commercial clouds.
CloneCloud uses a combination of static analysis and dynamic profiling to partition
applications automatically at a fine granularity while optimizing execution time and
energy use for a target computation and communication environment.
At runtime, the application partitioning is effected by migrating a thread from the mobile
device at a chosen point to the clone in the cloud, executing there for the remainder of the
partition, and re-integrating the migrated thread back to the mobile device.
Advantages:
Like desktops and laptops and place demands on an extremely limited supply of energy.
The granularity of partitioning is coarse since it is at class level, and it focuses on static
partitioning.
Supporting native method calls was an important design choice we made, which
increases its applicability.
Disadvantages:
Web page Consistency problem.
Optimization problem.
Algorithm:
DEFLATE compression algorithm
Title: Federated Internet of Things and Cloud Computing Pervasive Security Network
Monitoring System
Author: Jemal H. Abawajy and Mohammad Mehedi Hassan
Year: 2017
Description:
In the conventional hospital-centric Network system, Security are often tethered to
several monitors.
It develop an inexpensive but flexible and scalable remote Network status monitoring
system that integrates the capabilities of the IoT and cloud technologies for remote
monitoring of a Security’s Network status.
The Network spending challenges by substantially reducing inefficiency and waste as
well as enabling Security to stay in their own homes and get the same or better care.
It demonstrates the suitability of the proposed PPHM infrastructure, a case study for real-
time monitoring of a Security suffering from congestive heart failure using ECG is
presented.
Advantages:
A flexible, energy-efficient, and scalable remote Security Network status monitoring
framework.
A Network data clustering and classification mechanism to enable good Security care.
Performance analysis of the PPHM framework to show its effectiveness.
Disadvantages:
IoT-cloud convergence is crucial issue in Network application.
Access control, location privacy, data confidentiality.
Algorithm:
Rank correlation coefficient algorithm.
Classification algorithm.
Advantages:
It likely to see an increasingly diverse set of stakeholders involved, spanning the
technical, Network, and policy domains.
Big data tools with their merits that facilitate the execution of specified tasks in the
Network ecosystem.
Disadvantages:
Security, integrity and privacy violations of these data can cause irremediable damage to
the Network, or even death, of the individual and loss to society.
The standardization and format of big data, big data transfer and processing, searching
and mining of big data, and management of services.
Security with similar symptoms and diseases can share their experiences through social
media to get ad-hoc counseling, which constitutes a big data problem.
Algorithm:
Support vector machines (SVM)
Extreme learning machine (ELM)
Gaussian mixture model (GMM)
Title: Migrate or not? Exploiting dynamic task migration in Mobile cloud computing systems.
Author: LazarosGkatzikis and IordanisKoutsopoulos
Year: 2013
Description:
Contemporary mobile devices generate heavy loads of computationally intensive tasks,
which cannot be executed locally due to the limited processing and energy capabilities of
each device.
Cloud facilities enable mobile devices-clients to offload their tasks to remote cloud
servers, giving birth to Mobile Cloud Computing (MCC).
The challenge for the cloud is to minimize the task execution and data transfer time to the
user, whose location changes due to mobility.
It provides quality of service guarantees is particularly challenging in the dynamic MCC
environment, due to the time-varying bandwidth of the access links.
Advantages:
The elasticity of resource provisioning and the pay as- you-go pricing model.
We delineate the performance benefits that arise for mobile applications and identify the
peculiarities of the cloud that introduce significant challenges in deriving optimal
migration strategies.
Reducing the energy consumption of individual servers by moving the processes from
heavily loaded to less loaded servers (load balancing).
Disadvantages:
A strategy that does not consider migration cost and downloads time.
No migration.
Title: Mobiles on Cloud Nine: Efficient Task Migration Policies for Cloud Computing Systems.
Author: LazarosGkatzikis and IordanisKoutsopoulos
Year: 2014
Description:
Due to limited processing and energy resources, mobile devices outsource their
computationally intensive tasks to the cloud.
Clouds are shared facilities and hence task execution time may vary significantly.
It investigates the potential of task migrations to reduce contention for the shared
resources of a mobile cloud computing architecture in which local clouds are attached to
wireless access infrastructure.
It devises online migration strategies that at each time make migration decisions
according to the instantaneous load and the anticipated execution time.
Advantages:
The modification of program to incorporate state capture and recovery function.
Simplified IT management and maintenance capabilities.
Enormous computing resources available on demand.
Disadvantages:
Classifying current computation offloading frameworks. Analyzing them by identifying
their approaches and crucial issues.
Process migration applications are strongly connected with the system in the form of
sockets.
Application development complexity and unauthorized access to remote data demand
a systematized plenary solution.
Title: Smart City Solution for Sustainable Urban Development
Author: Mostafa Basiri, Ali Zeynali Azim, Mina Farrokhi.
Year: 2017
Description:
Large, dense cities can be highly efficient in which it is most desirable that side, by the
heads of the green, and the future porticos.
Bearing to the influx of the citizens of the new challenges of the rapid advance to
command positions.
The globalization of urban economics, cities increasingly have to compete directly with
worldwide and regional economies for international investment to generate employment,
revenue and funds for development.
Smart Cities are those towns which use information technology to improve both the
quality of life and accessibility for their inhabitants.
Advantages:
Reducing resource consumption, notably energy and water, hence contributing to
reductions in CO2 emissions.
Improving commercial enterprises through the publication of real-time data on the
operation of city services.
The growing penetration of fixed and wireless networks that allow such sensors and
systems to be connected to distributed processing centers and for these centers in turn
to exchange information among themselves.
Disadvantages:
Where there are threats of serious or irreversible damage, lack of full scientific
certainty shall not be used as a reason for postponing cost effective measure to
prevent environmental degradation.
The substitutability of capital.
Sustainable development problem.
Technique:
Information management technique
CHAPTER 3
NETWORK BIG DATA SOURCE ECO SYSTEM
Network big data is a revolutionary tool in the Network industry, and is becoming vital in
current Security-centric care. Owing to the massive growth of data in the Network industry,
diverse data sources have been aggregated into the Network big data ecosystem. These data
sources are is used by a Network provider to enable him or her to make decisions and provide
appropriate care. Major data sources, along with the challenges involved, are discussed below:
1. Physiological data.
These data are huge in terms of volume and velocity. Regarding data volume, a variety of
signals is collected from heterogeneous sources to monitor Security characteristics, including
blood pressure, blood glucose, and heart rate. Sources include electroencephalogram,
electrocardiogram, and electroglottogram. Data velocity can be observed from the growing rate
of data generation from continuous monitoring, especially for Security in a critical condition,
requires these signals to be processed in real-time, for decision making. These signals need to be
extracted efficiently and processed with the suitable machine learning algorithm to provide
meaning ful data for effective Security care. Efficient and comprehensive methods are also
required to analyze and process the collected signals to provide useable data to the Network
professionals and other related stakeholders. The combination of EHR and physiological signals
may increase the precision of data based on the surrounding context of the Security.
2) EHRs/EMRs.
EHRs or electronic medical records (EMRs) are digitized structured Network data from a
Security. The EHRs are collected from and shared among hospitals, research centers,
government agencies, and insurance companies. Security, integrity and privacy violations of
these data can cause irremediable damage to the Network, or even death, of the individual and
loss to society. Thus, big Network data security is now a key topic of research.
3) Medical images.
These images generate a huge volume of data to assist Network professionals for identifying or
detecting disease, treatment, predicting and monitoring of Security. Medical imaging techniques
such as X-ray, ultrasound, or computed tomography scan play a crucial role in diagnosis and
prognosis. Owing to the complication, dimensionality and noise of the collected images, efficient
image processing methods are required to provide clinically suitable data for Security care.
4) Sensed data
A sensed data from Security’s are collected using different wearable or implantable
devices, environment mounted devices, ambulatory devices, and sensors and smart phones from
home or in hospitals. The sensed data forms a key part of Network big data, as these sensors are
used to capture critical events or provide continuous monitoring. However, sensed data must be
collected, pre-processed, stored, shared and delivered correctly in a reasonable time to be of use
to Network providers when making clinical decisions. Owing to the enormous volume of data
collected, automated algorithms are required to reduce noise and to allow for the deployment
with big data analytics so that computation time can be reduced. Moreover, it is a challenge to
collect and collate multimodal sensed data from multiple sources at the same time.
5) Clinical notes.
The clinical notes, claims, recommendations, and decisions constitute one of the largest
unstructured sources of Network big data. Owing to the variety in format, reliability,
completeness, and accuracy of the clinical notes, it is challenging to ensure the Network care
provider has the correct information. Efficient data mining and natural language processing
techniques are required to provide meaningful data.
Fig: Big Network data source ecosystem.
6) Gene data.
The genome data makes a major contribution to Network big data. The human genome has a
huge number of genes; collecting, analyzing, and classifying data on these genes has taken years.
These gene data have now been integrated from the genetic level to physiological level of a
human being.
The Map Reduce programming model divides computation into map, and reduces phases
as shown
The map phase partitions input data into many input splits and automatically stores them
across a number of machines in the cluster. Once input data are distributed across the cluster, the
runtime creates a large number of map tasks that execute in parallel to process the input data.
The map tasks read in a series of key-value pairs as input and produce one or more intermediate
key-value pairs. A key-value pair is the basic unit of input to the map task.We uses the Word
Count application, which counts the number of occurrences of each word in a series of text
documents, as an example. In the case of processing text documents, a key-value pair can be a
line in a text document. The user can customize the dentition of a key-value pair.
public class MyMapper extends Mapper <LongWritable,Text, Text, LongWritable>
{ ... }
{ ... }
Map Reduce runtime is an open source implementation of the Map Reduce model first
proposed by Google. It automatically handles task distribution, fault tolerance and other aspects
of distributed computing, making it much easier for programmers to write data parallel
programs. It also enables Google to exploit a large number of commodity computers to achieve
high performance at a fraction of the cost of asystem built from fewer but more expensive high-
end servers. Map Reduce scales performance by scheduling parallel tasks on nodes that store the
task inputs. Each node executes the tasks with loose communication with other nodes.Hadoop is
an open source implementation of Map Reduce. To use the Hadoop Map Reduce framework, the
user first writes a Map Reduce application using the programming model we described in the
previous section. The user then submits the Map Reduce job to a jobtracker, which is a Java
application that runs in its own dedicated JVM. The job tracker is responsible for coordinating
the job run. It splits the job into a number of map/reduce tasks and schedules the execution of the
tasks
Fig : Hadoop runs a Map Reduce job
Task trackers have a fixed number of slots for map tasks and for reduce tasks. Each slot
corresponds to a JVM executing a task. Each JVM only employs a single computation thread. To
utilize more than one core, the user needs to configure the number ofmap/reduces slots based on
the total number of cores and the amount of memory available on each node. The configuration
can be set in the mapred-site.xml file. The relevant properties are map reduce .The examples in
Figure 3.5 shows how to set four map slots and two reduce slot son each compute node. The
setting can be used to express heterogeneity of the machines in the cluster. This setting can be
different for each compute node. The reason is that different machines in the cluster can have a
different number of cores and differing amounts of memory
For example, a typical hash join application requires each map task to store a copy of the lookup
table in memory . Duplicating the lookup table will decrease the amount of memory available to
each map task. To make sufficient memory available to each map task, memory intensive
applications are often forced to restrict the number of JVMs created to be smaller thatthe number
of cores in a node at the expense of reducing CPU utilization. For example, in a machine with
four cores and 4 GB of RAM, the system needs to create four map tasks to use the four cores.
However, if 1 GB RAM is in sufficient for each map task, the Hadoop MapReduce system can
create only two map tasks with 2 GBRAM available to each task. With two map tasks, the
runtime system utilizes only
Fig: Hadoop Map Reduce on a four cores system of the four available cores or 50 percent
of the CPU resources.
Map Reduce for Big Data Analysis
• There is a growing trend of applications that should handle big data. However, analyzing
big data is a very challenging problem today.
• For such applications, the Map Reduce framework has recently attracted a lot of
attention. Google’s Map Reduce or its open source equivalent Hadoop is a powerful tool
for building such applications
• Effective management and analysis of large-scale data poses an interesting but critical
challenge.
Recently, big data has attracted a lot of attention from academia, industry as well as government
Big data proof-of-value and data lake build out for a Network provider. The goal was a
big data solution for predictive analytics and enterprise reporting. The challenge was that the
organization did not have an enterprise data warehouse to consolidate data from their enterprise
operational systems. Further, key business stakeholders were not provided with tools needed to
access data. Lastly, teams tasked with creating analytics spent 90% of their time integrating data
in SAS, leaving them little time to analyze the data or create predictive analytics. The successful
project involved meeting with key business and IT stakeholders to determine reporting and
analytic challenges and priorities, and also performing a Current State Assessment, along with
meta data Discovery, profiling and outlier analysis of source data. Dell EMC proposed data lake
architecture to address enterprise reporting and predictive analytic needs. The solution also
initiated a governance program to ensure data quality and to establish stewardship procedures.
Finally, the project identified federated business data lake hardware and Pivotal big data suite
software as the target platform for the data lake.
The results of the project included new client analytics environment that facilitated the
execution of analytics and reporting activities to reduce time to insight. Further, client
governance structure ensured that metadata for new data sources into the data lake was shared
with users. The environment also supported the rapid creation of sandboxes to support analytics
Thesis.
boosting Security care, service levels and efficiency by simplifying data access
staff can view Security information with high reliability
saving money by avoiding data migrations and upgrades
increasing agility by breaking free of proprietary file constraint
Hadoop is a strong example of a technology that allows Network to store data in its native
form. If Hadoop didn’t exist, decisions would have to be made about what can be incorporated
into the data warehouse or the electronic medical record (and what cannot). Now everything can
be brought into Hadoop, regardless of data format or speed of ingests. If a new data source is
found, it can be stored immediately. No data is left behind. By the end of 2017, the number of
Network records of millions of people is likely to increase into tens of billions. Thus, the
computing technology and infrastructure must be able to render a cost efficient implementation
of:
Hadoop technology is successful in meeting the above challenges faced by the Network industry
as the Map Reduce engine and Hadoop Distributed File System (HDFS) have the capability to
process thousands of terabytes of data. Hadoop makes use of highly optimized, yet inexpensive
commodity hardware making it a budget friendly investment for the Network industry.
Integrating Network data dispersed among different Network organizations and social
media.
Providing a shared pool of computing resources that is capable of storing and analyzing
Network big data efficiently to take smarter decisions at the right time.
Providing dynamic provision of reconfigurable computing resources which can be scaled
up and down upon user demand. This will help reduce the cost of cloud based Network
systems.
Improving user and device scalability and data availability and accessibility in Network
systems
Big Data store and analyzing data from all possible resources
Up till now the collection of data is limited to the major available resources in the
Network sector. However, with the advent of Smartphone apps and wearable’s, data is now
everywhere. And this allows practitioners to know Security’ Network conditions in a more
precise manner. Apps that act like pedometers to measure your steps, the calorie counter for your
diet, the app for monitoring and recording heart rate, blood pressure and blood sugar levels, and
wearable devices like Fit bit, Jawbone etc. are all sources of data nowadays. In the near future,
the Security will share this data with the doctor who can utilize it as a diagnostic toolbox to
provide better treatment in less time.
CHAPTER 4
BIG DATA NETWORK USING ANT COLONY OPTIMIZATION
Big Data Network is the drive to capitalize on growing Security and Network system data
availability to generate Network innovation. By making smart use of the ever-increasing amount
of data available, we can find new insights by re-examining the data or combining it with other
information. In Network this means not just mining Security records, medical images, biobanks,
test results , etc., for insights, diagnoses and decision support advice, but also continuous
analysis of the data streams produced for and by every Security in a hospital, a doctor’s office, at
home and even while on the move via mobile devices.
Current medical hardware, monitoring everything from vital signs to blood chemistry, is
beginning to be networked and connected to electronic Security records, personal Network
records, and other Network systems.
The resulting data stream is monitored by Network professionals and Network software systems.
This allows the former to care for more Security, or to intervene and guide Security earlier
before an exacerbation of their (chronic) diseases. At the same time data are provided for bio-
medical and clinical Smart Network care to mine for patterns and correlations, triggering a
process of “data-intensive scientific discovery”, building on the traditional uses of empirical
description, theoretical computer-based models and simulations of complex phenomena.
Big Data has been characterized as raising five essentially independent challenges:
Volume,
Velocity,
Variety,
As elsewhere, in Big Data Network the data volume is in-creasing and so is data velocity
as continuous monitoring technology becomes ever cheaper. With so many types of tests, and the
existing wide range of medical hardware and personalized monitoring devices Network data
could not be more varied, yet data from this variety of sources must be combined for processing
to reap the expected rewards. In Network, veracity of data is of paramount importance, requiring
careful data curation and standardization efforts but at the same time seeming to be in opposition
to the enforcement of privacy rights2.
Finally, extracting value out of big Network data for all its beneficiaries (clinicians,
clinical Smart Network care, pharmaceutical companies, Network policy-makers, etc.) demands
significant innovations in data discovery, transparency and openness, explanation and
provenance, summarization and visualization, and will constitute a major step towards the cove-
ted democratization of data analytics.
It is firstly used to solve traveling salesman problem (TSP). Because of the characteristics
of distributing computing, self-organization and positive feedback, ACO has been used in prior
works for routing in Sensor Networks “Node Potential” is the heuristic used to evaluate the
potential of next hop selection based on three factors: the candidate’s distance to the sink node,
its distance to the nearest aggregation node and its data correlation with the current node.
In this algorithm, random searching for the destination (sink node) is needed in early
iterations. Use a simpler heuristic by only considering the distance to the sink node. An
algorithm composed of path construction, path maintenance, and aggregation schemes including
synchronization scheme, loop-free scheme, and avoiding collision scheme.
Although route discovery overhead can be reduced, those algorithms do not taken into
consideration limitations of WSNs, especially energy limit of Sensor nodes and number of
agents required to establish the routing Repeatedly using the same optimal path exhausts the
relaying nodes’ energy quickly.
Relatively frequent efforts to maintain the Network and to explore new paths are needed.
Therefore, this approach is not energy efficient and results in shorter Sensor nodes’ lifetime and
consequently Network lifetime. Algorithms that separate path establishment and data delivery
processes suffer from this problem.
Fig : Double bridge experiment. (a) Ants start exploring the double bridge. (b) Eventually
most of the ants choose the shortest path in principle capable of building a solution (i.e., of
finding a path between nest and food resource), it is only the colony of ants that presents
the “shortest path finding” behavior. In a sense, this behavior is an emergent property of
the ant colony.
The VM migration is provisioned more resources than the required. Therefore, this over-
provisioned resource greatly decreases the system objectives, as it reduces the number of
provisionedVMsinthecloudlets.Furthermore,thejointVM
migrationapproach,whereasetofVMsisremappedbasedon the VM task execution time and over-
provisioned resources canhelptoeffectivelyincreasetheoverallsystemobjectives. In contrast to the
joint VM migration approach, single VM migration can only improve particular user objectives
but not the system objectives.
Cloud-wide task migration, where the task-migration decision is made by a central cloud,
which maximizes the objectives of a cloud provider.
Server-centric task migration, where all migration decisions are made by the server,
where the task is currently executing.
Task-based migration, where migration is initiated by the task itself.
7. When moving from node f to neighbor node g, the agent update the pheromone trails t (fg) on
the edge (f,g).
8. Once the data is retrieved from the cloud, the agent can retrace the same path backward,
update pheromone trails and close the operation.
Algorithm Overview
In ACO algorithms, a colony of artificial ants is used toconstruct solutions guided by the
pheromone trails and heuristic information. The original idea of ACO comes from observing the
exploitation of food resources among ants. Ants explore the area surrounding their nest initially
in a random manner. As soon as an ant finds a source of food (source node), it evaluates the
quantity and quality of the food and carries some of it to the nest (sink node). During the back
tracking, the ant deposits a pheromone trail on the ground. The quality of deposited pheromone,
which may depend on the quantity and quality of the food, will guide other ants to the food
source. The pheromone trails are simulated via a parameterized probabilistic model. The
pheromone model consists of a set of parameters. In general, the ACO approach attempts to find
the optimal routing by iterating the following two steps:
1. Solutions are constructed using a node selection model based on a predetermined
heuristic and the pheromone model, a parameterized probability distribution over the solution
space.
2. The solutions that were constructed in earlier iterations are used to modify the
pheromone values in a way that is deemed to bias the search toward high quality solutions.
The algorithm runs in two passes: forward and backward. In the forward pass, the route is
constructed by a group of ants, each of which starts from a unique source node. In the first
iteration, an ant searches a route to the destination randomly. Later, an ant searches the nearest
point of the previously discovered route. This could take much iteration before the ant can find a
correct path with a reasonable length. A solution is flooding the sink node ID from the sink to all
the Sensor nodes in the Network before any ant starts. The points where multiple ants join are
aggregation nodes. In the backward pass every ant starts from sink node and travels back to the
corresponding source node by following the path discovered in the forward pass. Pheromone is
deposited hop by hop during the traversal.
Nodes of the discovered path are given weights as a result of node selection depending
on the node potential which indicates heuristics for reaching the destination Pheromone trails are
the heuristics to communicate with other ants of the route discovered. The trail followed by ants
most often gets more and more pheromone and eventually converges to the optimal route.
Pheromone in non-optimal route gets evaporated with time. The aggregation points on the
optimal tree identify data aggregation.
The procedure is mainly composed of forward and backward passes. In the forward pass,
an ant tries to explore a new path based on the heuristic rule and the pheromone amount on the
edges. Backtracking is used in the forward pass when an ant finds a dead end or is running into a
loop. In the backward pass, the ant updates the pheromone amount on the path constructed in the
forward pass. Other important components in the algorithms include data-aggregation, loop
control, and Network maintenance. In WSN, each node has a unique identity. Every node is able
to calculate and remember its current heuristic value. Initially, the sink node floods its identity to
all the nodes in the Network. After a node receives the packet, it computes its hop-count to the
sink node and correspondingly its initial heuristic value.
Forward Pass
Each ant is assigned a source node. After that, an ant starts fromthe source node and
moves towards the sink node using ad-hoc routing. The forward pass ends only if all the ants
have arrived at the sink node. Single ant-based solution construction uses following steps:
If the node has been visited in the same iteration, follow a previous ant’s path
Use a node selection rule
If all the neighbors have been visited, use the shortest path
If no neighbor nodes, backtrack to the previous node
If no neighbor nodes and the previous node is dead, record the Network
Lifetime and exit the program.
The current node sends the packet. The selected node receives the packet. Both Nodes
update the residual energy after transmission. If the current node does not have enough energy to
send, this transmission fails. The Network is maintained afterwards. Transmission failure is
mostly prevented by doing a receiving and sending energy check in the node selection step.
Backward Pass
Ants start from the sink node and move towards their sourcenodes. The ants follow the
paths discovered in the forward pass. Before an ant arrives at its source node, the algorithm
repeats:
Each Sensor node maintains two queues to store packets: a receiving queue and a sending
queue. The packet sending process includes:
Network Maintenance
When a node does not have sufficient energy to send orreceive (the “dead node”), it is
removed from the neighbor list of its neighborhood nodes. Nodes with more hop-count than the
“dead node” recalculate their hop-count and heuristic value. If the “dead node” is a source node,
find the node with the maximum energy in the Network as the new source. Afterwards, update
the source node of the ant.
If the “dead node” is the sink node, recharge the node with more energy. Sink node is
different from other nodes because it needs to perform more frequent transmission and
computation for the purpose of application. Therefore, it is assumed that the sink node has plenty
energy to last until the Network dies.
Leading Exploration
Among all the neighborhood nodes, select the first nodewith the highest probability, even
if there are multiple nodes with the same probability. This method is deterministic. In every
iteration, an ant always discovers the same path to the sink node until one of the intermediate
nodes dies. If the same Network topology is tested repeatedly, the total energy cost and Network
lifetime are the same.
Combined Rule
Node selection is divided into sessions. Each session includesone or more iterations. A
node discovered from the current or a previous iteration is used. Similar to “Leading
Exploration,” the probability of each neighborhood node is calculated. A group of nodes with
highest probability is stored in a cache. In each iteration, one node is randomly selected and
removed from the cache. When the cache is empty, the probability calculation of all the alive
neighborhood nodes is repeated.
Probability Calculation
When a node is ready to send a packet, it calculates theprobability of all the neighbors
using the equation below.
(1)
In equation (1), an ant k having data packet in node i chooses to move to node j until the
sink node, where τ is the pheromone, η is the heuristic, Ni is the set of neighbors of node i, and β
is a parameter which determines the relative importance of pheromone versus distance (β>0).
Value η is calculated using equation (2). Multiple factors can be used and each one is weighted.
The pheromone value is associated with the link (edge)between two nodes. Each edge
has a pheromone value, which is initially all the same. The value is updated in each iteration in
order to bias the node selection process in the next iteration. The value is updated twice in each
iteration.
In the backward pass, each ant deposits or reduces the pheromone value on its own
solution path. This step is different from the conventional ACO algorithm, in which pheromone
is always deposited using the same rate. Encouraging or discouraging a node choice in the
forward pass depends on the comparison of performance in the forward pass with the one of the
best iteration found so far. The new pheromone is calculated using equation (4). Equations (5)
and (6) are used to support equation (4).
Δωj = ∑ (5)
In equation (4), ρ is the pheromone decay parameter, τij is the pheromone value on the
edge between nodes i and j, and e0 is the encouraging or discouraging rate derived from the
forward pass. A path resulting in less energy consumption and smaller total hop-count is
preferred. The best iteration is one with the least energy consumption and hop-count among all
previous iterations. It is used as a control to calculate the e0 in the current iteration. If the
forward pass is a failed path exploration or used more hop-count and energy consumption than
the best iteration, the path is discouraged. Very small amount of pheromone is deposited on the
edge to differentiate from those links not been visited, and e0 is set to a predetermined
“PunishRate,” which is a relatively low rate between 0 and 1.
If the forward pass found a path using the same hop-count and energy consumption as
the best iteration, e0 is set to a relatively higher rate between 0 and 1-- the “encourageRate.” If
the forward pass found a path with the same hop-count but less energy consumption than the
best iteration, e0 = 1.5 × encourageRate. If the forward pass found a path using less hop-count
and energy consumption than the best iteration, e0 = hop-count difference × encourageRate.
In equation (5), ζ is a positive number, hi is the hop-count between node i and the sink,
and hj is the hop-count between node j and the sink. If the value of (hi - hj) is greater than zero,
it can be concluded that node j is closer to the sink node than node i. Therefore, the algorithm
rewards the path from node i to node j by depositing more pheromone. If the value equals to
zero, it means that both nodes i and j have the same hop-count to the sink, then the algorithm
lays little pheromone on the path. If the value is less than zero, the algorithm does not lay
pheromone on this path. In equation (6), Rj is the total hop-counts of these sources before vising
node j. Therefore, Δωj is the total hop-counts of some sources to the sink through node j. The
less the total hop accounts, the larger amount of pheromone is added on the path from node i to
node j, as shown in equation (5).
This means that more ants are encouraged to follow this path. For an aggregation node,it updates
the pheromone levels of its all neighbors by equation (4) when an ant moves to it. If a node does
not have ants visit it within a limited time, its pheromone is evaporated according to equation (3).
CHAPTER 4
HONEY POTS DETECTION USING SMART NETWORK CARE
History of Honeypots:
The concept of Honeypots was first described by Clifford Stoll in 1990. The book is a
novel based on a real story which happened to Stoll. He discovered a hacked computer and
decided to learn how the intruder gained access to the system. To track the hacker back to his
origin, Stoll created a faked environment with the purpose to keep the attacker busy. The idea
was to track the connection while the attacker was searching through prepared documents. Stoll
did not call his trap a Honeypot; he just prepared a network drive with faked documents to keep
the intruder on his machine. Then he used monitoring tools to track the hacker’s origin and find
out how he came in.
In 1999 that idea was picked up again by the Honey net project, lead and founded by
Lance Spritzer. During years of development the Honey net project created several papers on
Honeypots and introduced techniques to build efficient Honeypots. The Honey net Project is a
non-profit research organization of security professionals dedicated to information security.
Types of Honeypots:
To describe Honeypots in greater detail it is necessary to define types of Honeypots. The type
also defines their goal, as we will see in the following. A very good description on those can also
be found in.
The concept of Honeypots in general is to catch malicious network activity with a prepared
machine. This computer is used as bait. The intruder is intended to detect the Honeypot and try
to break into it. Next the type and purpose of the Honeypot specifies what the attacker will be
able to perform. Often Honeypots are used in conjunction with Intrusion Detection Systems. In
these cases Honeypots serve as Production Honeypots and only extend the IDS. But in the
concept of Honeynets the Honeypot is the major part. Here the IDS is set up to extend the
Honeypot’s recording capabilities
A common setup is to deploy a Honeypot within a production system. The figure above
shows the Honeypot colored orange. It is not registered in any naming servers or any other
production systems, i.e. domain controller. In this way no one should know about the existence
of the Honeypot. This is important, because only within a properly configured network, one can
assume that every packet sent to the Honeypot, is suspect for an attack. If misconfigured packets
arrive, the amount of false alerts will rise and the value of the Honeypot drops. Production
Honeypot Production Honeypots are primarily used for detection. Typically they work as
extension to Intrusion Detection Systems performing an advanced detection function. They also
proove if existing security functions are adequate, i.e. if a Honeypot is probed or attacked the
attacker must have found a way to the Honeypot. This could be a known way, which is hard to
lock, or even an unknown hole. However measures should be taken to avoid a real attack. With
the knowledge of the attack on the Honeypot it is easier to determine and close security holes.
A Honeypot allows justifying the investment of a firewall. Without any evidence that there were
attacks, someone from the management could assume that there are no attacks on the network.
Therefore that person could suggest stopping investing in security as there are no threats. With a
Honeypot there is recorded evidence of attacks. The system can provide information for statistics
of monthly happened attacks.
The Honeypot Security Analytics Suite provides AI Engine rules that perform real-time,
advanced analytics on all activity captured in the honeypot, including successful logins to the
system, observed successful attacks, and attempted/successful malware activity on the host. As a
result, the Honeypot suite allows AI Engine to also detect when similar activity captured from
the honeypot is observed on the production network. For example, if an observed attacker
interaction on the honeypot is followed by a subsequent interaction with legitimate hosts within
the environment such as production web servers, LogRhythm can generate an alarm alerting IT
and security personnel to the suspicious activity.
Transferring data from one remote system to another under the control of a local system
is remote uploading. Remote uploading is used by some online file hosting services. It is also
used when the local computer has a slow connection to the remote systems, but they have a fast
connection between them. Without remote uploading functionality, the data would have to first
be download to local host and then uploaded to the remote file hosting server, both times over
slow connections.
Algorithm Implementation
INPUT:VMsetVandaccessiblecloudletsetCu foreach VM
u. System Parameters
Step 14: until every VM is provisioned to a cloudlet 15: for (VM k∈Vs) do
Production Honeypots:
These honeypots are deployed by organizations as a part of their security infrastructure.
These add value to the security measures of an organization. These honeypots can be used to
refine an organization’s security policies and validate its intrusion detection systems. Production
honeypots can provide warnings a head of an actual attack. For example, lots of HTTP scans
detected by honeypot is an indicator that a new http exploit might be in the wild. Normally
commercial servers have to deal with large amounts of traffic and it is not always possible for
intrusion detection systems to detect all suspicious activity. Honeypots can function as early
warning systems and provide hints and directions to security administrators on what to lookout
for.
The real value of a honeypot lies in it being probed, scanned and even compromised, so it
should be made accessible to computers on the Internet or at least as accessible as other
computers on the network. As far as possible the system should behave as a normal system on
the Internet and should not show any signs of it being monitored or of it being a honeypot. Even
though we want the honeypot to be compromised it shouldn’t pose a threat to other systems on
the Internet. To achieve this, network traffic leaving the honeypot should be regulated and
monitored.
Security Issues:
Honeypots don’t provide security (they are not a securing tool) for an organization but if
implemented and used correctly they enhance existing security policies and techniques.
Honeypots can be said to generate a certain degree of security risk and it is the administrator’s
responsibility to deal with it. The level of security risk depends on their implementation and
deployment. There are two views of how honeypot systems should handle its security risks.
Honeypots that fake or simulate: There are honeypot tools that simulate or fake services or
even fake vulnerabilities. They deceive any attacker to think they are accessing one particular
system or service. A properly designed tool can be helpful in gathering more information about a
variety of servers and systems. Such systems are easier to deploy and can be used as alerting
systems and are less likely to be used for further illegal activities.
Honeypots that are real systems: This is a viewpoint that states that honeypots should not be
anything different from actual systems since the main idea is to secure the systems that are in
use. These honeypots don’t fake or simulate anything and are implemented using actual systems
and servers that are in use in the real world. Such honeypots reduce the chances of the hacker
knowing that he is on a honeypot. These honeypots have a high risk factor and cannot be
deployed everywhere. They need a controlled environment and administrative expertise. A
compromised honeypot is a potential risk to other computers on the network or for that matter
the Internet.
Legal issues:
To start with, a honeypot should be seen as an instrument of learning. Though there is a
viewpoint that honeypots can be used to “trap” hackers. Such an idea can be considered as an
entrapment. The legal definition of entrapment is “Entrapment is the conception and planning of
an offense by an officer, and his procurement of its commission by one who would not have
perpetrated it except for the trickery, persuasion, or fraud of the officers."
Prevention:
Prevention means keeping the bad guys out. Normally this is accomplished by firewalls and well
patched systems. The value Honeypots can add to this category is small. If a random attack is
Performed, Honeypots can detect that attack, but not prevent it as the targets are not predictable.
One case where Honeypots help with prevention is when an attacker is directly hacking into a
server. In this case a Honeypot would cause the hacker to waste time on a non-sufficient target
and help preventing an attack on a production system. But this means that the attacker has
attacked the Honeypot before attacking a real server and not otherwise.
Detection:
Detecting intrusions in networks is similar to the function of an alarm system for
protecting facilities. Someone breaks into a house and an alarm goes off. In the realm of
computers this is accomplished by Intrusion Detection Systems.
The problems with these systems are false alarms and non-detected alarms. A system might alert
on suspicious or malicious activity, even if the data was valid production traffic. Due to the high
network traffic on most networks it is extremely difficult to process every data, so the chances
for false alarms increase with the amount of data processed. High traffic also leads to non-
Detected attacks. When the system is not able to process all data, it has to drop certain packets,
which leaves those un scanned. An attacker could benefit of such high loads on network traffic.
Response:
After successfully detecting an attack we need information to prevent further threats of the same
type. Or in case an institution has established a security policy and one of the employees violated
against them, the administration needs proper evidence. Honeypots provide exact evidence of
malicious activities. As they are not part of production systems any packet sent to them is
suspicious and recorded for analysis. The difference to a production server is that there is no
traffic with regular data such as traffic to and from a web server. This reduces the amount of data
recorded dramatically and makes evaluation much easier. With that specific information it is
fairly easy to start effective countermeasures.
Honeypots were described by their role of application. To describe them in greater detail it is
necessary to explain the level of interaction with the attacker.
Low-interaction Honeypots:
A low-interaction Honeypot emulates network services only to the point that an intruder can log
in but perform no actions. In some cases a banner can be sent back to the origin but not more.
Low-interaction Honeypots are used only for detection and serve as production Honeypots.
In comparison to IDS systems, low-interaction Honeypots are also logging and detecting attacks.
Furthermore they are capable of responding to certain login attempts, while an IDS stays passive.
The attacker will only gain access to the emulated service. The underlying operating system is
not touched in any way. Hence this is a very secure solution which promotes little risk to the
environment where it is installed in.
Medium-interaction Honeypots:
Medium-interaction Honeypots are further capable of emulating full services or specific
vulnerabilities, i.e. they could emulate the behavior of a Microsoft IIS web server. Their primary
purpose is detection and they are used as production Honeypots.
Similar to low-interaction Honeypots, medium-interaction Honeypots are installed as an
application on the host operating system and only the emulated services are presented to the
public. But the emulated services on mediuminteraction Honeypots are more powerful, thus the
chance of failure is higher which makes the use of medium-interaction Honeypots more risky.
High-interaction Honeypots:
These are the most elaborated Honeypots. They either emulate a full operating system or use a
real installation of an operating system with additional monitoring. High-interaction Honeypots
are used primarily as research Honeypots but can also serve as production Honeypots.
As they offer a full operating system the risk involved is very high. An intruder could easily use
the compromised platform to attack other devices in the network or cause bandwidth losses by
creating enormous traffic.
Types of attacks:
There are a lot of attacks on networks, but there are only two main categories of attacks.
Random attacks:
Most attacks on the internet are performed by automated tools. Often used by unskilled
users, the so-called script-kiddies they search for vulnerabilities or already installed Backdoors
(see introduction). This is like walking down a street and trying to open every car by pulling the
handle. Until the end of the day at least one car will be discovered unlocked. Most of these
attacks are preceded by scans on the entire IP address range, which means that any device on the
net is a possible target.
Direct attacks:
A direct attack occurs when a Black hat wants to break into a system of choice, such as
an ecommerce web server containing credit card numbers. Here only one system is touched and
often with unknown vulnerabilities. A good example for this is the theft of 40 million credit card
details at MasterCard International. That Card Systems Solutions, a third-party processor of
payment data has encountered a security breach which potentially exposed more than 40 million
cards of all brands to fraud. "It looks like a hacker gained access to Card Systems' database and
installed a script that acts like a virus, searching out certain types of card transaction data, direct
attacks are performed by skilled hackers; it requires experienced knowledge. In contrast to the
tools used for random attacks, the tools used by experienced Black hats are not common. Often
the attacker uses a tool which is not published in the Black hat community. This increases the
threat of those attacks.
Security categories:
To assess the value of Honeypots we will break down security into three categories as
defined by Bruce Schneider in Secrets and Schneider breaks security into prevention, detection
and response.
Prevention:
Prevention means keeping the bad guys out. Normally this is accomplished by firewalls
and well patched systems. The value Honeypots can add to this category is small. If a random
attack is performed, Honeypots can detect that attack, but not prevent it as the targets are not
predictable. One case where Honey pots help with prevention is when an attacker is directly
hacking into a server. In this case a Honeypot would cause the hacker to waste time on a non-
sufficient target and help preventing an attack on a production system. But this means that the
attacker has attacked the Honeypot before attacking a real server and not otherwise. Also if an
institution publishes the information that they use a Honeypot it might deter attackers from
hacking. But this is more in the fields of psychology and quite too abstract to add proper value to
security.
Hadoop Server
Hadoop is an open source distributed processing framework that manages data processing
and storage for big data applications running in clustered systems. It is at the center of a growing
ecosystem of big data technologies that are primarily used to support advanced analytics
initiatives, including predictive analytics, data mining and machine learning applications.
Hadoop can handle various forms of structured and unstructured data, giving users more
flexibility for collecting, processing and analyzing data than relational databases and data
warehouses provide.
Configuration for Hadoop 1.x Fetch Hadoop using version control systems subversion
or git and checkout branch-1 or the particular release branch. Otherwise, download a
source tarball from the CDH3 releases or Hadoop releases.
Generate Eclipse project information using Ant via command line:
For Hadoop (1.x or branch-1), “ant eclipse”
For Smart City releases, “ant eclipse-files”
Pull sources into Eclipse:
Go to File -> Import.
Select General -> Existing Projects into Workspace.
For the root directory, navigate to the top directory of the above downloaded source
Fig : Hadoop Initialized
1. Generate Eclipse project information using Maven: mvn clean && mvn install –D skip
Tests && mvn eclipse : eclipse. Note: mvn eclipse : eclipse generates a static .class path
file that Eclipse uses, this file isn’t automatically updated as the project/dependencies
change.
2. Pull sources into Eclipse:
1. Go to File -> Import.
2. Select General -> Existing Projects into Workspace.
3. For the root directory, navigate to the top directory of the above downloaded
source.
Execute tar –xzf Hadoop-0.19.1tar.gz in the cygwin prompt, this will start the process of
unpacking Hadoop distribution. Once this is done, it will display newly created directory
called hadoop-0.19.1
Verify whether unpacking is success by executing cd Hadoop-0.19.1 and then -1, which
provides the output as mentioned below which tells that everything is unpacked correctly.
Fig : Hadoop Advance Setting Configuration
In the next step, click on Configure Hadoop Installation link, displayed on the right side
of the project configuration window. Project preferences window display is shown in the image
below. Fill in the location of Hadoop directory in Hadoop Installation Directory in preferences
and click OK, and then close the project window after clicking on finish
Scenario I – unprotected environment
In an unprotected environment any IP address on the internet is able to initiate connections to
any port on the Honeywell. The Honeypot is accessible within the entire internet.
In this scenario the Honeypot is connected to the internet by a firewall. The firewall limits the
access to the Honeypot. Not every port is accessible from the internet resp. not every IP address
on the internet is able to initiate connections to the Honeypot. This scenario does not state the
degree of connectivity; it only states that there are some limitations. However those limitations
can be either strict, allowing almost no connection, or loose, only denying a few connections.
The firewall can be a standard firewall or a firewall with NAT1capabilities (see chapter 3.3).
However a public IP address is always assigned to the firewall.
Fig: Protected Environment
This scenario also focuses on the IP address on the Honeypot. In this scenario the Honeypot is
assigned a private address. Private addresses are specified in [RFC 1918]. In contrast to public
addresses, private IPs can not be addressed from the internet. Packets with private addresses are
discarded on internet gateways routers. To connect to a private address, the host needs to be
located within the same address range or it needs provision of a gateway with a route to the
target network.
The Internet Assigned Numbers Authority (IANA) reserved three blocks of IP addresses, namely
10.0.0.0/8, 172.16.0.0/12, and 192.168.0.0/16 for private internets. For interconnecting private
and public networks an intermediate device is used. That device needs to implement Network
Address Port Translation (NAPT) [RFC 3022]. NAPT allows translating many IP addresses and
related ports to a single IP and related ports. This hides the addresses of the internal network
behind a single public IP. Outbound access is transparent to most of the applications.
Unfortunately some applications depend on the local IP address sent in the payload, i.e. FTP
sends a PORT command [RFC 959] with the local IP. Those applications require an Application
Layer Gateway which rewrites the IP in the payload. Therefore the applications on the Honeypot
are not aware of the public IP and limited by the functionality of the intermediate network
device.
A Honeypot allows external addresses to establish a connection. This means that packets from
the outside are replied. Without a Honeypot there would be no such response. So a Honeypot
increases traffic on purpose, especially traffic which is suspicious to be malicious.
Security mechanisms need to make sure, that this traffic is not affecting the production systems.
Moreover the amount of traffic needs to be controlled. A hacker could use the Honeypot to
launch a DoS2 or DDoS3 attack. Another possibility would be to use the Honeypot as a file
server for stolen software, in hacker terms called warez. Both cases would increase bandwidth
usage and slow production traffic.
As hacking techniques evolve, an experienced Black hat could launch a new kind of attack which
is not recognized automatically. It could be possible to bypass the controlling functions of the
Honeypot and misuse it. Such activity could escalate the operation of a Honeypot and turn it into
a severe threat. A Honeypot operator needs to be aware of this risk and therefore control the
Honeypot on a regular basis.
Scenario VI – Honeypot-out-of-the-box
A Honeypot-out-of-the-box is a ready-to-use solution, which also could be thought as a
commercial product. The question is which features are needed. As showed in the previous
chapters there is a wide range of eventualities. A complete product needs to cover security, hide
from the attacker, good analyzability, and easy access to captured data and automatic alerting
functions to be sufficient.
The data owner splits the access right of the encrypted data into n pieces, with each
legitimate user holding one piece of the access right. This can effectively reduce the risk of
information leakage in big data storage.
Validate Result
In case of Tera sort experiment, it has been observed with an interesting case (Case 2).
It has been analyzed and verified that the execution time is first decreases and then increases,
when Data size is kept constant and increasing number of nodes from 1 to 4. This is happening
because the load (Data set size) is constant but the communication between nodes is increasing
since number of Node is increasing from 1 to 4
Fig: Data Centers Optimization Time
Performance modeling framework also enables automated tuning of the job settings (i.e.,
number of reduce tasks) along the applications defined as sequential Map Reduce workflows for
optimizing both completion time and the resource usage for the workflows. To develop novel
performance models and resource allocation strategies that can take into considerations the high
degrees of variance in highly virtualized environments.
Conclusion
This dissertation centers on performance modeling and resource management for Map
Reduce applications. It introduces a performance modeling framework for estimating
completion time for complex Map Reduce applications defined as a DAG of Map Reduce jobs
when it is executed on a given platform with different resource allocations and different input
data set(s). Based on the performance modeling framework, we further introduce resource
allocation strategies as well as our customized deadline-driven scheduler in estimating and
controlling the appropriate amount of resource that should be allocated to each application to
meet their (soft) deadlines.