Harnessing Big Data For Security: Intelligence in The Era of Cyber Warfare.

AFROMEDIA BIG DATA COMMUNICATIONS LTD

HARNESSING BIG DATA FOR SECURITY : INTELLIGENCE IN THE ERA OF CYBER
WARFARE
WRITTEN BY CHRISTOPHER ALVIN MOKAYA

BASICS: DEFINING BIG DATA AND RELATED TECHNOLOGIES.
According to Gartner.com , Big Data is high-volume, high-velocity and high-variety
information assets that demand cost-effective, innovative forms of information processing for
enhanced insight and decision making.
Are there examples of Big Data in action at major global IT giants?
It is evident from allprogrammingtutorials.com that Big Data generates value from the storage
and processing of very large quantities of digital information that cannot be analyzed with
traditional computing techniques. It requires different techniques, tools, algorithms and
architecture. Some of Big Data tools and technologies are Apache Hadoop, Apache Spark, R
Language and Apache ZooKeeper.
Facebook and Google heavily rely on Big Data. Let us base our first case study on Facebook-Every time one of the 1.2 billion people who use Facebook visits the site, they see a completely
unique, dynamically generated home page. There are several different applications powering this
experience--and others across the site--that require global, real-time data fetching. In this section,
we will discuss some of the tools, frameworks and applications that Facebook developed to
overcome the challenge of processing the huge data:
1. RocksDB - RocksDB is an embeddable persistent key-value store for fast storage.
RocksDB can also be the foundation for a client-server database but our current focus is
on embedded workloads. RocksDB builds on LevelDB to be scalable to run on servers
with many CPU cores, to efficiently use fast storage, to support IO-bound, in-memory
and write-once workloads, and to be flexible to allow for innovation.
Facebook built RocksDB for storing and accessing hundreds of petabytes of data and is
constantly improving and overhauling its tools to make this as fast and efficient as
possible.
2. Corona - Corona is a new scheduling framework developed by Facebook to overcome
the limitations of Apache Hadoop MapReduce scheduling framework. Apache
MapReduce scheduling framework is responsible for 2 functions - cluster resource
management and jobs tacking. Facebook noticed that Apache MapReduce scheduling
framework was not able to cope well with the peak data loads. Facebook solved this
problem by coming up with Corona scheduling framework which separates out the
cluster management and jobs tracking allowing it to enable processing of peak data
loads at optimal speeds
3. Presto - Presto is an open source distributed SQL query engine for running interactive
analytic queries against data sources of all sizes ranging from gigabytes to petabytes.
Presto allows querying data where it lives, including Hive, Cassandra, relational
databases or even proprietary data stores. A single Presto query can combine data from
multiple sources, allowing for analytics across your entire organization.
Facebook uses Presto for interactive queries against several internal data stores,
including their 300PB data warehouse. Over 1,000 Facebook employees use Presto daily
to run more than 30,000 queries that in total scan over a petabyte each per day.

Google probably processes more information than any company on the planet and tends to have
to invent tools to cope with the data. As a result, its technology runs a good five to 10 years
ahead of the competition. Google has come up with quite a few big data processing algorithms
such as MapReduce, Flume on which many big data technologies such as Hadoop have been
developed. We, in this section, will discuss about some of the big data technology stack at
Google:
1. Google Mesa - Mesa is a highly scalable analytic data warehousing system that stores
critical measurement data related to Google's Internet advertising business. Mesa is
designed to satisfy a complex and challenging set of user and systems requirements,
including near real-time data ingestion and queryability, as well as high availability,
reliability, fault tolerance, and scalability for large data and query volumes. Specifically,
Mesa handles petabytes of data, processes millions of row updates per second, and serves
billions of queries that fetch trillions of rows per day. Mesa is geo-replicated across
multiple datacenters and provides consistent and repeatable query answers at low latency,
even when an entire datacenter fails.
2. Google File System - Google File System is a scalable distributed file system for large
distributed data-intensive applications. It provides fault tolerance while running on
inexpensive commodity hardware, and it delivers high aggregate performance to a large
number of clients.It is widely deployed within Google as the storage platform for the
generation and processing of data used by Google service as well as research and
development efforts that require large data sets.
Google File System is the base of hadoop's HDFS that is being used actively in a lot of
big data tools and databases such as HBase, Cassandra, Spark etc.
3. BigTable - Bigtable is a distributed storage system for managing structured data that is
designed to scale to a very large size: petabytes of data across thousands of commodity
servers. Many projects at Google store data in Bigtable, including web indexing, Google
Earth, and Google Finance. These applications place very different demands on Bigtable,
both in terms of data size (from URLs to web pages to satellite imagery) and latency
requirements(from backend bulk processing to real-time data serving).
4. Google Flume - FlumeJava is a Java framework developed at Google for MapReduce
computations. MapReduce though enables distributed computing but not all real life
problems can be described using a MapReduce task. Instead, most of the real life
problems require a chain of MapReduce tasks for complete processing. This requires
intermediate code to pipeline MapReduce tasks. Apache Flume attempts to resolve this
problem by providing pipelining of MapReduce tasks out of the box.
Flume has been handed over to Apache and there is an active project running on this
named as Apache Flume.
5. Google MilWheel - MillWheel is a framework for building low-latency data-processing

applications that is widely used at Google. Users specify a directed computation graph
and application code for individual nodes, and the system manages persistent state and
the continuous flow of records, all within the envelope of the framework's fault-tolerance
guarantees. MillWheel's programming model provides a notion of logical time, making it
simple to write time-based aggregations. MillWheel was designed from the outset with
fault tolerance and scalability in mind. In practice, we find that MillWheel's unique
combination of scalability, fault tolerance, and a versatile programming model lends
itself to a wide variety of problems at Google.
6. Dremel - Dremel is a scalable, interactive ad-hoc query system for analysis of read-only
nested data. By combining multi-level execution trees and columnar data layout, it is
capable of running aggregation queries over trillion-row tables in seconds. The system
scales to thousands of CPUs and petabytes of data, and has thousands of users at Google.
Google present a novel columnar storage representation for nested records and discuss
experiments on few-thousand node instances of the system.
Google offers a cloud analytics platform called BigQuery based on Dremel to enable
companies get their huge structured data processed at lightening fast speeds.

HARNESSING BIG DATA FOR SECURITY: INTELLINGENCE IN THE ERA OF CYBER
WARFARE.
There can be information without security but the can be no security without information. This is
unequivocally supported by the great French military strategist and tactician, Napoleon Bonaparte who
rightfully asserted that War is ninety percent Information.
In the Digital Era that has heralded the dawn of the Information Age and the frontier of Cyber-Terrorism,
Napoleon would have obviously noted that there can be No Command without Cyber Command!
Blind movement in the world wide web must not be confused with motion. Intelligence navigation
requires real Predictive Analytics. We must not only detect events, but also study, analyze and
foresee future patterns of occurrence.
With a combination of powerful information technology tools and methodologies such as Data Mining,
Predictive Analytics, Artificial Intelligence and Machine Learning, it is possible to fight a smart war
against bandits, terrorists, cyber fraudsters and other gangsters at the click of a button.
Will the construction of a physical wall covering the border with Al-Shabaab stop terrorists from
attacking Kenya?
As seen recently in a shooting in Texas, USA, even walls in form of big oceans between the
Middle East and America have not stopped the Islamic State from launching an attack on
American soil! Pietersen gives a snapshot of what it takes to prepare for intelligence-led
missions:
Now looking back over nearly 40 years, I think I have learned the following six things.
First, how one thinks about the mission affects deeply how one does the mission.
Second, intelligence failures come from failing to step back to think about underlying
trends, forces, and assumptions not from failing to connect dots or to predict the future.
Third, good analysis makes the complex comprehensible, which is not the same as
simple.
Fourth, there is no substitute for knowing what one is talking about, which is not the
same as knowing the facts.
Fifth, intelligence analysis starts when we stop reporting on events and start explaining
them.
Sixth, managers of intelligence analysts get the behavior they reward, so they had better
know what they are rewarding.

BIG DATA OR SMART DATA?
Today, data is growing by leaps and bounds in volume, variety and velocity. There can be no meaningful
data analysis without proper management of Big Data.
There is no telling who controls the Internet nor who contributes exactly to the millions of websites being
created every moment. The more traditional, rigid and bureaucratic government departments are finding it
hard to keep with modern, young minds full of dynamism and terrific zeal. The most horrifying fact is
that these young minds are being tapped, funded and indoctrinated by jihadist and terrorist organizations
to further their evil agendathe agenda of eliminating any individual who does not accept their ideology.
Therefore, it is crystal clear that for security agencies and governments to effectively fight terrorism, they
must equally invest in dynamic pool of digital talent that will ignite a seamless network of smart, agile
adaptive and disruptive army of Cyber-genius credentials. It is possible! Perhaps you would be happy to
work in such an exciting pool of digital talent.
So, how can governments leverage the power of Big Data for the Security needs of our Digital Age?
It is not all about abracadabra solutions. Thinks tanks must be created, digital resources must be
mobilized and brows must be knit as the mind retires into depths of thought that would yield remarkable
new streak of innovations that will not only anayze the huge gig data piles around us, but also invent
brand new intelligence tools that must work smart round the clock to process Big Data into actionable and
smart information to enhance security.
Wait a minute! Arent there enough technologies already about Big Data?
Yes, there are technologies such as Apache Hadoop and it is Open Source! However, it is not enough for
it can be developed into an advanced and better innovation to keep pace with the dynamism of our digital
era. Lets have a flashback--Recently, an eminent expert gave a great definition on data security:
Security. What is security? Dan Geer defined it best. Keynoting at the Recorded Future User
Network (RFUN) Conference in Washington, D.C. Geer said:
Security is about the absence of surprises that cant be mitigated. As such, security that is well
thought is security that changes the probability of surprise while foregoing as little as possible.
Therefore, there is no taking chances when it comes to matters security. Planning must be
comprehensive. Implementing data security plans must be surgically thorough.
For instance, America is at the forefront of the Data Mining/Big Data battleground through Data
Analysis and Research for Trade Transparency System (DARTTS). This is an office affiliated to
the Homeland Security Department that works pretty well in prevention of money laundering
and trade-related crimes.
Why then cant Kenya use the platform of Big Data to detect, deter and decimate terrorists?

REFERENCE
Recorded Future. (2015). Dan Geer Keynote: Data and Open Source Security. Retrieved online
from https://www.recordedfuture.com/dan-geer-keynote/
Department of Homeland Security. (2015). Data Mining Report 2011. Retrieved online on 8th
May 2015 from http://www.au.af.mil/au/awc/awcgate/dhs/privacy_rpt_datamining_2011.pdf
Gartner, Inc. (2013). IT Glossary: Big Data. Retrieved online on 11th May 2015 from
http://www.gartner.com/it-glossary/big-data
Sain Technology Solutions. (2015). Big Data Fundamentals. Retrieved online 11th May 2015 from
http://www.allprogrammingtutorials.com/big-data-fundamentals/

Harnessing Big Data For Security: Intelligence in The Era of Cyber Warfare.

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Harnessing Big Data For Security: Intelligence in The Era of Cyber Warfare.

Uploaded by

Copyright:

Available Formats

AFROMEDIA BIG DATA COMMUNICATIONS LTD

AFROMEDIA BIG DATA COMMUNICATIONS LTD

AFROMEDIA BIG DATA COMMUNICATIONS LTD

AFROMEDIA BIG DATA COMMUNICATIONS LTD

AFROMEDIA BIG DATA COMMUNICATIONS LTD

5. Google MilWheel - MillWheel is a framework for building low-latency data-processing

AFROMEDIA BIG DATA COMMUNICATIONS LTD

AFROMEDIA BIG DATA COMMUNICATIONS LTD

AFROMEDIA BIG DATA COMMUNICATIONS LTD

You might also like