You are on page 1of 52

WWW.DBTA.

COM

From the publishers of

BIG DATA
SOURCEBOOK
DECEMBER 2014

CONTENTS
introduction

From the publishers of

PUBLISHED BY Unisphere Mediaa Division of Information Today, Inc.

The Big Data Frontier


Joyce Wells

EDITORIAL & SALES OFFICE 630 Central Avenue, Murray Hill, New Providence, NJ 07974
CORPORATE HEADQUARTERS 143 Old Marlton Pike, Medford, NJ 08055
Thomas Hogan Jr., Group Publisher
609-654-6266; thoganjr@infotoday

Norma Neimeister,
Production Manager

Joyce Wells, Managing Editor


908-795-3704; Joyce@dbta.com

Denise M. Erickson,
Senior Graphic Designer

Joseph McKendrick,
Contributing Editor; Joseph@dbta.com

Jackie Crawford,
Ad Trafficking Coordinator

Alexis Sopko, Advertising Coordinator


908-795-3703; asopko@dbta.com

Sheila Willison, Marketing Manager,


Events and Circulation
859-278-2223; sheila@infotoday.com

Adam Shepherd,
Editorial and Advertising Assistant
908-795-3705

industry updates

John OBrien

DawnEl Harris, Director of Web Events;


dawnel@infotoday.com

Celeste Peterson-Sloss, Lauree Padgett,


Alison A. Trotta, Editorial Services

10

ADVERTISING

INFORMATION TODAY, INC. EXECUTIVE MANAGEMENT

Roger R. Bilboul,
Chairman of the Board
John C. Yersak,
Vice President and CAO

The Enabling Force Behind


Digital Enterprises
Joe McKendrick

Stephen Faig, Business Development Manager, 908-795-3702; Stephen@dbta.com

Thomas H. Hogan, President and CEO

How Businesses Are Driving


Big Data Transformation

14

Thomas Hogan Jr., Vice President,


Marketing and Business Development

Data Integration Evolves to Support


a Bigger Analytic Vision
Stephen Swoyer

Richard T. Kaser, Vice President, Content


Bill Spence, Vice President,
Information Technology

BIG DATA SOURCEBOOK is published annually by Information Today, Inc.,


143 Old Marlton Pike, Medford, NJ 08055

18

Turning Data Into Value Using Analytics


Bart Baesens

POSTMASTER
Send all address changes to:
Big Data Sourcebook, 143 Old Marlton Pike, Medford, NJ 08055
Copyright 2014, Information Today, Inc. All rights reserved.

22

As Clouds Roll In, Expectations for


Performance and Availability Billow

PRINTED IN THE UNITED STATES OF AMERICA

Michael Corey, Don Sullivan

The Big Data Sourcebook is a resource for IT managers and professionals providing information
on the enterprise and technology issues surrounding the big data phenomenon and the need
to better manage and extract value from large quantities of structured, unstructured and
semi-structured data. The Big Data Sourcebook provides in-depth articles on the expanding
range of NewSQL, NoSQL, Hadoop, and private/public/hybrid cloud technologies, as well
as new capabilities for traditional data management systems. Articles cover business- and
technology-related topics, including business intelligence and advanced analytics, data security
and governance, data integration, data quality and master data management, social media
analytics, and data warehousing.

26 Social Media Analytics Tools


and Platforms: The Need for Speed
Peter J. Auditore

No part of this magazine may be reproduced and by any meansprint, electronic or any
otherwithout written permission of the publisher.

30

COPYRIGHT INFORMATION
Authorization to photocopy items for internal or personal use, or the internal or personal use
of specific clients, is granted by Information Today, Inc., provided that the base fee of US $2.00
per page is paid directly to Copyright Clearance Center (CCC), 222 Rosewood Drive, Danvers,
MA 01923, phone 978-750-8400, fax 978-750-4744, USA. For those organizations that have
been grated a photocopy license by CCC, a separate system of payment has been arranged.
Photocopies for academic use: Persons desiring to make academic course packs with articles
from this journal should contact the Copyright Clearance Center to request authorization
through CCCs Academic Permissions Service (APS), subject to the conditions thereof. Same
CCC address as above. Be sure to reference APS.

The Big Data Challenge to Data Quality


Elliot King

36 Building the Unstructured


Big Data/Data Warehouse Interface
W. H. Inmon

Creation of derivative works, such as informative abstracts, unless agreed to in writing by the
copyright owner, is forbidden.
Acceptance of advertisement does not imply an endorsement by Big Data Sourcebook. Big Data
Sourcebook disclaims responsibility for the statements, either of fact or opinion, advanced by
the contributors and/or authors.
The views in this publication are those of the authors and do not necessarily reflect the views
of Information Today, Inc. (ITI) or the editors.



2014 Information Today, Inc.

40

Big Data Poses Security Risks


Geoff Keston

The Big Data Frontier


By Joyce Wells

The rise of big data, cloud, mobility, and the proliferation of connected devices, coupled with newer data
management approaches, such as Hadoop, NoSQL, and
in-memory systems, are increasing the opportunities for
enterprises to harness data. However, with this new frontier there are challenges to be overcome. As they work to
maintain legacy applications and systems, IT organizations must address new demands for more timely access
to more data from more users, in addition to maintaining continuous availability of IT systems, and enforcing
appropriate data governance.
Its a lot to think about. How can companies choose the
right approach to leverage big data while keeping newer
technologies in line with budgetary, application availability, and security concerns?
Over the past year, Unisphere Research, a division of
Information Today, Inc., has conducted surveys among IT
professionals to gain insight into the challenges organizations are facing.
The information overload is already taking its toll on
IT organizations and professionals. According to a Unisphere Research report, Governance Moves Big Data
From Hype to Confidence, the percentage of organizations with big data projects is expected to triple by the
end of 2015. However, while organizations are investing
in increasing the information at their disposal, they are
finding that they are committing more time to simply
locating the necessary data, as opposed to actually analyzing it. In addition, the report, based on a survey of 304
data management professionals and sponsored by IBM,
found that respondents tend to be less confident about
data gathered through social media and public cloud
applications.
With all this data, there are also concerns about the
ability to maintain the high availability mandated by
todays stringent service level agreements. According to
another Unisphere Research survey sponsored by EMC,
and conducted among 315 members of the Independent Oracle Users Group (IOUG), close to one-fourth
of respondents organizations have SLAs of four nines of
availability or greater, meaning that they can have only 52
minutes or less of downtime a year. The survey, Bringing
Continuous Availability to Oracle Environments, found
that more than 25% of respondents dealt with more than
8 hours of unplanned downtime during the previous

B I G D ATA SOU RCE BO O K 2014

year, which they attributed to network outages, server


failures, storage failures, human error, and power outages.
As data management and access becomes more critical
to business success, Unisphere Research finds that IT professionals are embracing their expanded roles and relish
the opportunity to work with new technologies. Increasingly, they want to be at the center of the action, and are
assuming roles associated with data science, but too often
they see themselves being forced into the job of firefighting rather than strategic, high-value tasks. The benefits of
ongoing staff training and use of cloud and database automation are some of the approaches cited in the report,
The Vanishing Database Administrator, sponsored by
Ntirety, a division of HOSTING.
Indeed, the increasing size and complexity of database environments is stretching IT resources thin, causing organizations to seek ways to automate routine tasks
to free up assets such as tapping into virtualization and
cloud. According to The Empowered Database, a report
based on a survey of 338 IOUG members, and sponsored
by VMware and EMC, nearly one-third of organizations
are using or considering a public cloud service, and almost
half are currently using or considering a private cloud.
Still, we are just at the beginning of the changes to
come as a result of big data. In a recent Unisphere Research
Quick Poll, close to one-third of enterprises, or 30%,
report they have deployed the Apache Hadoop framework
in some capacity while another 26% said they planned
to adopt Hadoop within the next year. Strikingly, 91% of
respondents at Hadoop sites will be increasing their use
of Hadoop over the next 3 years, and one-third describe
expansion plans as significant. Key functions or applications supported by Hadoop projects include analytics and
business intelligence, working with IT operational data,
and supporting special projects.
To help shed light on the expanding territory of big
data, DBTA presents the second annual Big Data Sourcebook, a guide to the key enterprise and technology matters
IT professionals are grappling with as they take the journey to becoming data-driven enterprises. In addition to
articles penned by subject matter experts, leading vendors
also showcase their products and approaches to gaining
value from big data projects. Together, this combination
of articles and sponsored content provides insight into the
current big data issues and opportunities. n

sponsored content

Operational Big Data


Its the other big data. In fact, its
a source of big data. Today, operational
databases must meet the challenges of
variety, velocity, and volume with millions
of users and billions of machines reading
and writing data via enterprise, mobile, and
web applications. The data is stored in an
operational database before its stored in an
Apache Hadoop distribution.
Its audits, clickstreams, customer
information, financial investments and
payments, inventory and parts, locations,
logs, messages, patient records, plays and
scores, sensor readings, scientific data, social
interactions, user and process status, user
and visitor profiles, and more.
It drives the eCommerce, energy,
entertainment, finance, gaming,
healthcare, insurance, retail, social media,
telecommunications industries, and more.
Today, operational databases must read
and write billions of values, maintain low
latency, and sustain high throughput to
meet the challenges of velocity and volume.
They must sustain millions of operations
per seconds, maintain sub-millisecond
latency, and store billions of documents
and terabytes of data. They must be able to
support the evolution of data in the form of
new attributes and new types.
The ability to meet these challenges is
necessary to support an agile enterprise.
By doing so, the agile enterprise extracts
actionable intelligence. However, time is
of the essence. When a new type of data
emerges, operational databases must store
it without delay. When the number of users
and machines increases, the operational
database must continue to provide data access
without performance degradation. When the

size of the data set increases, the operational


database must continue to store data.

to enable full text search over operational


big data.

These challenges are met by a)


supporting a flexible data model and b)
scaling out on commodity hardware. They
are met by NoSQL databases. They are met
by Couchbase Server. Its a scalable, highperformance, document database engineered
for reliability and availability. By supporting a
document model via JSON, it can store new
attributes and new types of data without
modification, index the data, and enable
near-real time, lightweight analytics. By
implementing a shared-nothing architecture
with no single point of failure and consistent
hashing, it can scale with ease, on-demand,
and without affecting applications. By
integrating a managed object cache and
asynchronous persistence, it can maintain
sub-millisecond response times and sustain
high throughput. Couchbase Server was
engineered for operational big data and its
requirements.

Finally, operational databases must


meet the requirements of a global economy
in the information age. Today, users and
machines read and write data to enterprise,
mobile, and web applications from multiple
countries and regions. To maintain data
locality, operational databases must support
deployment to multiple data centers. To
maintain the highest level of data locality,
operational databases must extend to mobile
phones / tablets and connected devices.

While operational databases provide realtime data access and lightweight analytics,
they must integrate with Apache Hadoop
distributions for predictive analytics,
machine learning, and more. While
operational data feeds big data analytics,
big data analytics feed operational data. The
result is continuous refinement. By analyzing
the operational data, it can be updated to
improve operational efficiency. The result is
a big data feedback loop.
Couchbase provides and supports
a Couchbase Server plugin for Apache
Sqoop to stream data to and from Apache
Hadoop distributions. In fact, Cloudera
certified it for Cloudera Enterprise 5. In
addition, Couchbase provides and supports
a Couchbase Server plugin for Elasticsearch

Couchbase Server supports both


unidirectional and bidirectional cross
data center replication. It enables the agile
enterprise to deploy an operational database
to multiple data centers in multiple regions
and in multiple countries. It moves the
operational database closer to users and
machines. In addition, Couchbase Server
can extend to mobile phones / tablets and
connected devices with Couchbase Mobile.
The platform includes Couchbase Lite, and
native document database for iOS, Android,
Java / Linux, and .NET, and Couchbase Sync
Gateway to synchronization data between
local databases and remote database servers.
The combination of cross data center
replication and mobile synchronization
enables the agile enterprise to extend global
reach to individual users and machines. If
deployed to cloud infrastructure like Amazon
Web Services or Microsoft Azure, there is no
limit to how far Couchbase Server can scale
or how far the agile enterprise can reach.

COUCHBASE
www.couchbase.com
DBTA.C OM

industry updates

The State of Big Data in 2014

How Businesses
Are Driving Big Data
Transformation
By John OBrien

In 2014, we continued to watch how big data


is enabling all things big about data and its
business analytics capabilities. We also saw the
emergence (and early acceptance) of Hadoop
Version 2 as a data operating platform, with
cornerstones of YARN (Yet Another Resource
Negotiator) and HDFS (Hadoop Distributed
File System). The ecosystem of Apache Foundation projects has continued to mature at a
rapid pace, while vendor products continue
to join, mature, and benefit from Hadoop
improvements.
In last years Big Data Sourcebook we
highlighted several items in The State of

B I G D ATA SOU RCE BO O K 2014

Big Data article worth recapping. First, we


referenced the battle over persistence for
data architectures, primarily in enterprise
adoption that dealt with the promise of
everything in Hadoop pundits and the its
OK to have another data platform. In 2014,
we witnessed the acceptance of these multitiered, specific workload capability architectures that, at Radiant Advisors, we refer to
as the modern data platform. With gaining
acceptance, Hadoop is here to stay and many
analysts refer to its role as inevitable. This,
naturally, is tempered with its maturity, the
ability for enterprises to find and/or train

resources, and specifying the proper first use


case project and long term strategy, such as
the data lake or enterprise data hub strategies.
We also discussed how companies needed
to understand how data is data when
approaching big data with big eyes. For
the most part, in 2014 we saw mainstream
companies shift from a the sky is falling if I
dont start a big data project mindset to distinguishing big data projects as those for situations where the data wasnt typically relationally structured, or when it had volatile
schemas. Schema on read versus schema
on write benefits and situations became a

industry updates

The State of Big Data in 2014

The next wave of big data implementations by


mainstream adopters is expected to be multiple
times larger than that of the early adopters.

much better understood term in 2014, too.


And, more importantly, we have seen an
increasing understanding that all data can
be valuable and the need to explore data
for discovery and insights.
Last year, we said that 2014 would be
the race for access hill as companies
demanded better access to data in Hadoop
by business analysts and power users and
that this access no longer be restricted to
programmers. As SQL reasserted itself as
the de-facto standard for common knowledge users and existing data analysis and
integration tools, the SQL access capabilities of Hadoop was under incredible
pressure to improve both in performance
and capability. Continued releases by Hortonworks with Hive/Tez, Cloudera Impala,
and MapR Drill initiative made orders
of magnitude performance improvements
for SQL access. The race was on: Actians
Vortex made a splash at the Hadoop Summit in June, and otherssuch as IBM and
Pivotalmade significant improvements,
too. The race in 2014 continues going into
2015 with more SQL analytic capabilities
and performance improvements.

Hadoop 2 Ushers in
the Next Generation
The significance of Hadoop 2 has
recently started to resonate with companies and enterprise architects. Moving away from its batch-oriented origins,
YARN has clearly positioned the data
operating system as two separate fundamental architecture components.
While the HDFS will continue to evolve
as the caretaker of data in the distributed
file system architecture with improved
name node high availability and perfor-

mance, YARN, introduced in Hadoop 2,


completely changes the paradigm of data
engines and access. Though the primary
role of YARN is still that of a resource negotiator for the Hadoop cluster and focused
on managing the resource needs of tens of
thousands of jobs in the cluster, it has also
now established a new framework.
The YARN framework serves as a pluggable layer of YARN certified engines
designed to work the data in different
ways. Previously, MapReduce was the primary programming framework for developers to create applications that leveraged
the parallelism of the data nodes. As other
project and data engines could work with
HDFS directly without MapReduce, a
centralized resource manager was needed
that would also enable innovation for new
data engines. MapReduce became its own
YARN engine for existing Hadoop 1 legacy
code, and Hive decoupled to work with
the new Tez engine. Long recognized as
ahead of the curve, Google caused quite a
fury when it announced that MapReduce
was dead and that they would no longer
develop in it. YARN was positioned for the
future of next-generation engines.
Sometimes in 2014 we felt that the
booming big data drum was starting to
die down. And, sometimes we wondered
if it only seemed that was because everyone was chanting Storm just a bit louder.
Another major driver in the Hadoop
implementations was that big data didnt
mean fast data. The industry wanted
both big and fast: The Spark environment
is where both early adopters were writing
new applications, and the development
community was quickly developing Spark
to be a high-level project to meet those

needs. The Spark community touts itself


as lightning-fast cluster computing primarily leveraging in-memory capabilities
of the data nodes, but also a newer, faster
framework than MapReduce on disk.
While Spark was in its infancy in 2013,
we saw this need for big data speed being
tackled by two-tier distributed in-memory
architectures. Today, Spark is a framework
for Spark SQL, Spark Streaming, Machine
Learning, and GraphX running on
Hadoop 2s YARN architecture. In 2014,
this has been very exciting for the industry,
but many of the mainstream adopters are
patiently waiting for the early adopters to
do their magic.

Two Camps: Early Adopters


and Mainstream Adopters
For years, overwhelming data volumes,
complexity, or data science endeavors were
the primary drivers behind early big data
adopters. Many of these early adopters
were in internet-related industries, such
as search, e-commerce, social networking,
or mobile applications that were dealing
with the explosion of internet usage and
adoption.
In 2014, we saw mainstream adopters
become the next wave of big data implementations that are expected to be multiple times larger than the early adopters. We
define mainstream adopters as those businesses that seek to modernize their data
platforms and analytics capabilities for
competitive opportunities and to remain
relevant in a fast changing world, but are
tempered with some time to research, analyze, and adopt while maintaining current
business operations. Mainstream adopters have had pilots and proof of concepts
DBTA.C OM

industry updates

The State of Big Data in 2014

for the past year or two with one or two


Hadoop distributors and now are deciding how this also fits within their overall
enterprise data strategy.
Leading the way for mainstream adopters is, by consequence, meeting enterprise
and IT requirements for data management,
security, data governance, and compliance
in a new, more complicated, set of data
that includes public social data, private
customer data, third-party data enrichment, and storage in cloud and on-premises. Over the past year, it has often felt like
the fast-driving big data vehicle hit some
pretty thick mud to plow through, and
some in the industry argued that forcing Hadoop to meet the requirements of
enterprise data management was missing
the point of big data and data science. For
now, we have seen most companies agree
that risk and compliance are things that
they must take seriously moving forward.

Mainstream Adopters Redefining


Commodity Hardware
As mainstream adopters worked
through data management and governance
hurdles for enterprise IT, next up was the
startling exclamation: I thought you said
that was cheap commodity hardware?!
This has become an interesting reminder
of the roots of big data and the difference
with IT enterprise-class hardware.
The explanation goes like this. Early
developers and adopters were driven to
solve truly big data challenges. In the simplest of terms, big data meant big hardware
costs and, in order to solve that economic
challenge, big data needed to run on the
lowest cost commodity hardware and
software that was designed to be faulttolerant to cope with high failure rates without disrupting service. This is the purpose
of HDFS, though HDFS does not differentiate how a data node is configured and
this is where ITs standard order list differs.
Enterprise infrastructure organizations have been maintaining the data center needs of companies for years and have
efficiently standardized orders with chosen
vendors. In this definition of commodity
servers, its more about industry standards
6

B I G D ATA SOU RCE BO O K 2014

in parts, and no proprietary hardware


could limit the use of these servers as data
nodes (or any other server needs in the data
center). While big data implementation
with hundreds to thousands of servers per
cluster strive for the lowest cost white box
servers from less recognized industry vendors with the lowest cost components, their
commodity servers can be as low as $2,000
per server. Similar servers from industry
recognized big names with their own components or industry best of breed components touting stringent integration and
quality testing have averaged $25,000 per
server in several recent Hadoop implementations that we have been involved with. We
have started to coin these servers as commodity-plus for mainstream companies
operationalizing Hadoop clustersand
they dont seem to mind.

YARN, introduced in
Hadoop 2, completely
changes the paradigm of
data engines and access.
Another discussion that continues
from the early adopters is how a data
node should be configured. Some implementations concerned with truly big data
configure data nodes with 25 front-loading bays and multi-terabyte slower SATA
drives for the highest capacity within
their cluster. Other implementations are
more concerned with performance and
opt for faster SAS drives at lower capacities but balanced with more servers in the
cluster for further increased performance
from parallelism. Some hyper-performance-oriented clusters will even opt for
faster SSD drives in the cluster. This also
leads to discussions regarding multi-core
CPUs and how much memory should
be in a data node. And, there have been
equations for the number of cores related
to the amount of memory and number of
drives for optimal performance of a data
node. We have seen that enterprise infrastructure has leaned more toward fewer

nodes in a production cluster (832 data


nodes) rather than 100-plus nodes. Their
reasoning is twofold: More powerful data
nodes are actually more interchangeable
with data centers also converging data
virtualization and private cloud strategies. Second, ordering more of the powerful servers can yield increased volume
discounts and maintain standardization
of IT servers in the data center.

The Data Lake Gains Traction


In 2014, we saw more acceptance of
the term data lake as an enterprise data
architecture concept pushed by Hortonworks and its modern data architecture
approach. The enterprise data hub is a
similar concept promoted by Cloudera
and also has some of the industry mindshare. Informally, we saw the data lake term
used most often by companies seeking to
understand an approach to enterprise data
strategy and roadmaps. However, we also
saw backlash from industry pundits that
called the data lake a fallacy or murky.
Terms such as data swamp and data
dump were also thrown around as how
things could go wrong without a good
strategy and governance in place. Like the
term big data, the data lake has started
out as a high-level concept to drive further
definition and patterns going forward.
Throughout 2014, we worked with
companies ready to define a clear, detailed
strategy based on the data lake concept for
enterprise data strategy. While this is profound, it is very achievable with data management principles that require answers to
new questions regarding a new approach
to data architecture. Some issues are simple and more technical, such as keeping
online archiving of historical data warehouse data still easily accessible by users
with revised service-level agreements.
Some issues are more fundamental, such as
the data lake serving as single repository of
all data including being a staging area for
the enterprise data warehouse (with lower
cost historical persistence for other uses as
data scientists are more interested in raw
unaltered data). Other concerns are a bit
more complex, such as persisting customer

Re-envisioning
System z to
Power the
Application
Economy
Visit us at ca.com/getdynamic

industry updates

The State of Big Data in 2014

or other privacy-compliant data in the data


lake for analysis purposes. Data governance
is concerned with who has access to privacy-controlled data and how it is used. Data
management questioned the duplication
of enterprise data and consistency.
These are hard data management and
governance decisions for enterprises to
make, but they are making themand
acknowledging that patience and adaptability are key for the coming years as
data technologies continue to evolve and
change the landscape. The data lake will
continue to prove itself and make a fundamental shift in enterprise architecture
in the coming years. When you take a step
back and watch the business and IT drivers, momentum, and technology development, you can see how the data lake will
become an epicenter in enterprise data
architecture. If you take two steps back,
you will see how 2015 developments could
begin the evolution that transforms the
data lake into a data operating system for
the enterprise, evolving beyond business
intelligence and analytics into operational
applications and further realization of service-oriented architectures.

Whats Ahead
In 2015, the mainstream adoption with
enterprise data strategies and acceptance
of the data lake will continue as data management and governance practices provide
further clarity. The cautionary tale of 2014
to ensure business outcomes drive big data
adoption, rather than the hype of previous years will likewise continue. Hadoop
is clearly here to stay and inevitable,
and will have its well-deserved seat at the
enterprise data table, along with other
data technologies. While Hadoop wont be
taking over the world any time soon and
principle-based frameworks (such as our
own modern data platform) recognize the
evolution of both data technologies and
computing price/performance on modern data architecture. Besides the usual
maturing and improvements overall and
for existing big data tools, we predict some
major achievements in big data for 2015
that were keeping an eye on.
8

B I G D ATA SOU RCE BO O K 2014

The Apache Spark engine will continue to mature, improve, and gain acceptance in 2015. With this adoption and the
incredible capabilities that it delivers, we
could start to see applications and capabilities beyond our imagination. Keep an eye
for these early case studies as inspiration
for your own needs.
With deepening acceptance and recognition of YARN as the standard for operating Hadoop clusters, open-source projects
and existing vendors will port their products to YARN certification and integration.
This will not only close the gap between
existing data technologies to work with
Hadoop clusters but more exciting will be
to see data technologies port over to YARN
so that they can operate and improve their
own capabilities within Hadoop. New
engines and existing engines running on
YARN in 2015 will further influence and
drive the adoption of Hadoop in enterprise data architecture.

In 2015, watch for Hadoops


subtle transformation as
business drivers moveit
beyond a primary write-once/
read-manyreputation.
In 2014, we saw mainstream companies requiring data management features
such as security and access control. These
first steps will be critical to keep an eye on
during 2015 for your own companys data
management requirements. Our concern
here is that the sexy high-performance
world of Spark and improved SQL capabilities will get the majority of attention,
while the less sexy side of security and governance will not mature at the same rate.
There is significant pressure to do so with
the mountain of mainstream adopters
waiting, so well keep an eye on this one.
Finally, our most exciting item to watch
in 2015 will be Hadoops subtle transformation as business drivers move it beyond
a primary write-once/read-many repu-

tation to that of full create/read/update/


delete (CRUD) operational capability at
big data scale. The benefits of the Hadoop
architecture with YARN and HDFS go
well beyond big data analytics and enterprise data architects can start thinking
about what a YARN data operating system
can do with operational systems. In a few
years, this could also redefine the data lake
or well simply create another label for
the industry to debate. Once big data, high
performance, and CRUD requirements are
met within Hadoop, enterprise architects
will start thinking about the economies
of scale and efficiency gained from this
next-generation architecture. n

John OBrien is princi-

pal and CEO of Radiant


Advisors. With more than
25 years of experience
delivering value through
data warehousing and
business intelligence programs, OBriens unique perspective
comes from the combination of his roles
as a practitioner, consultant, and vendor
CTO in the BI industry. As a globally recognized business intelligence thought
leader, OBrien has been publishing articles and presenting at conferences in
North America and Europe for the past 10
years. His knowledge in designing, building, and growing enterprise BI systems
and teams brings real-world insights to
each role and phase within a BI program. Today, through Radiant Advisors,
OBrien provides research, strategic advisory services, and mentoring that guide
companies in meeting the demands of
next-generation information management,
architecture, and emerging technologies.
In Q1 2014, Radiant Advisors released its
Independent Benchmark: SQL on Hadoop
Performance that captured the current
state of options and widely varying performance. Radiant Advisors plans to release
the next benchmark 1 year later in Q1 2015
to quantify those efforts.

sponsored content

Big Data for Tomorrow


The landscape of enterprise solutions
has changed. It has become distributed and
real-time work. A famous NY Times writer
Thomas Friedman summarizes it succinctly,
The World is Flat. In addition to this
technological advancement, the compute
and online world is demanding real-time
answers to questions. These ever growing
and disparate data sources need to be
efficiently connected to enable new discovery
and more insightful answers.
To maintain competitive advantage in
this new landscape, organizations must be
prepared to weed out the hype and focus
on proven ways to future-proof existing
systems while efficiently integrating with
new technologies to provide the required
value of real-time insight to users and
decision-makers. Companies need to focus
on the following key requirements for new
technologies to take advantage of data and
find unique business value, new revenues.

DISTRIBUTED
The world is moving towards distributed
architectures. Memory is becoming a
commodity; the Internet is easily accessible
and fairly inexpensive and with more sources
of data creating an increase in information it
is easy to understand how organizations will
require multiple, distributed data centers to
store it all.
With distributed architectures comes a
need for distributed features such as parallel
ingest or the ability to quickly obtain data
using multiple resources/locations to enable
real-time application access to information
that is being processed. Then there is a
need for distributed task processing, which
helps to move the processes closer to the
locations where data is stored, thus saving
time and improving query performance as a
side effect. Finally, there becomes a need for
distributed query as well. This is the ability
to perform a search of data across different

locations, quickly in order to find hidden


value within the data for improved business
decision support.

SCALABLE
The next requirement revolves around
ease of scalability. When working with
distributed architecture, it is inevitable that
companies will need to eventually scale out
their applications across multiple locations
in order to keep up with growing data
demands. Technology that is easily scalable/
adaptable is very important in long-term
success and helps with managing ROI.

FLEXIBLE
Another requirement, due to the many
different types of data being collected, is the
ability to handle multiple data types. If a
technology is too limited in the way it needs
to collect information from structured,
unstructured, semi-structured sources,
organizations will find it difficult to grow
their solution long-term due to concerns
with data type limitations. On the other
hand, a technology that is able to natively or
alternatively store and access many types of
information from multiple data sources will
be key to enabling long-term competitive
advantage and growth.

COMPLEMENTARY
And finally, there is a need to address
existing and legacy solutions already
implemented at a large scale. Most
enterprises will not be tearing out widely
implemented solutions spanning across
their organization. It is important to require
that any new technologies being assessed
have the ability to complement existing
legacy solutions as well as any potential new
technologies that may add benefit to the
business, its customers and solution/services.
Todays enterprise success depends on the
ability to obtain key information quickly and

accurately and then apply that knowledge


to your business to make more reliable
decisions. Utilizing technology that is able
to offer the peace of mind to be successful
through distributed, scalable, flexible and
complementary features is priceless.
For over a quarter century, Objectivity,
Inc.s embedded database software has
helped discover and unlock the hidden
value in Big Data for improved realtime intelligence and decision support.
Objectivity focuses on storing, managing
and searching the connection details
between data. Its leading edge technologies,
InfiniteGraph, a unique distributed, scalable
graph database, and Objectivity/DB, a
distributed and scalable object management
database, enable unique search and
navigation capabilities across distributed
datasets to uncover hidden, valuable
relationships within new and existing data
for enhanced analytics and facilitate custom
distributed data management solutions for
some of the most complex and missioncritical systems in operation around the
world today.
By working with a well-established
technology provider with long-term, proven
Big Data implementations, enterprise
companies can feel confident that the future
requirements of their organizations will be
met along with the ability to take advantage
of new technological advances to keep ahead
of the market.
For more information on how to get
started with evaluating technologies for your
business, contact Objectivity, Inc. to inquire
about our complimentary 2-hour solution
review with a senior technical consultant.
Visit our website at www.objectivity.com for
more information.

OBJECTIVITY, INC.
www.objectivity.com
DBTA.C OM

industry updates

The State of Big Data Management

The
Enabling Force
Behind Digital
Enterprises
By Joe McKendrick

For decades, data management was part of


a clear and well-defined mission in organizations. Data was generated from transaction
systems, then managed, stored, and secured
within relational database management systems, with reports built and delivered to business decision makers specs.
This rock-solid foundation of skills,
technologies, and priorities served enterprises well over the years. But lately, this
arrangement has been changing dramatically. Driven by insatiable demand for IT
services and data insights, as well as the
proliferation of new data sources and formats, many organizations are embracing
new technology and methods such as cloud,
database as a service (DBaaS), and big data.
And, increasingly, mobile isnt part of a vendors pitch sheet, or futuristic overview at a
conference presentation. Its part of todays
reality, a part of everyday business. Many
organizations are already providing faster
delivery of applications, differentiated products and services, and some are building

10

BI G D ATA SOU RC EBO O K 2014

new customer experiences through social,


mobile, analytics, and cloud.
Over the coming year
2015we will
likely see the acceleration of the following
dramatic shifts in data management:

1. More Automation to Manage


the Squeeze
There is a lot of demand coming from the
user side, but data management professionals often find themselves in a squeeze. Business demand for database services as well as
associated data volumes is growing at a rate
of 20% a year on average, a survey by Unisphere Research finds. In contrast, most IT
organizations are experiencing flat or shrinking budgets. Other factors such as substantial
testing requirements and outdated management techniques are all contributing to a cost
escalation and slow IT response.
Database professionals report that they
spend more time managing database lifecycles than anything else. A majority still overwhelmingly perform a range of tasks manu-

ally, from patching databases to performing


upgrades. Compliance remains important
and requires attention. As databases move
into virtualized and cloud environments,
there will be a need for more comprehensive enterprise-wide testing. Another recent
Unisphere Research study finds that for more
than 50% of organizations, it takes their IT
department 30 days or more to respond to
new initiatives or deploy new solutions. For
a quarter of organizations, it takes 90 days
or more. In addition, more than two-thirds
of organizations indicate that the number
of databases they manage is expanding. The
most pressing challenges they are facing as
a result of this expansion are licensing costs,
additional hardware and network costs, additional administration costs, and complexity.
(The Empowered Database: 2014 Enterprise
Platform Decisions Survey, September 2014)
As data professionals find their time
and resources squeezed between managing
increasingly large and diverse data stores,
increased user demands, and restrictive

BUILD A SINGLE VIEW


OF YOUR CUSTOMER
Dont spend months rebuilding
data tables to combine multiple
systems.
Use Postgres with JSON to import your data
and relate customer to contracts and contracts
to customers and anything else.

Make decisions based on your business


needs not on the limitations of
NoSQL solutions.

Visit www.enterprisedb.com to learn more.

industry updates

The State of Big Data Management

Changes are being driven by insatiable demand


for IT services and data insights, as well as the
proliferation of new data sources and formats.
budgets, there will be greater efforts
to automate data management tasks.
Expect a big push to automation in the
year ahead.

2. Big Data Becomes Part of


Normal Day-to-Day Business
Relational data coming out of transactional systems is now only part of the enterprise equation, and will share the stage to
a greater degree with data that previously
could not be cost-effectively captured, managed, analyzed, and stored. This includes
data coming in from sensors, applications,
social media, and mobile devices.
With increased implementations
of tools and platforms to manage this
dataincluding NoSQL databases and
Hadooporganizations will be better
equipped to prepare this data for consumption by analytic software. A recent
survey of Database Trends and Applications
readers finds 26% now running Hadoop
within their enterprisesup from 12% 3
years ago. A majority, 63%, also now operate NoSQL databases at their locations
(DBTA Quick Poll: New Database Technologies, April 2014).

3. Cloud Opens Up


Database as a Service
More and more, data managers and
professionals will be working with cloudbased solutions and data, whether associated with a public cloud service, or an
in-house database-as-a-service (DBaaS)
solution. This presents many new opportunities to provide new capabilities to
organizations, as well as new challenges.
Moving to cloud means new programming and data modeling approaches will
be needed. Integration between on-premises and off-premises data also will be
intensifying. Data security will be a frontburner issue.
12

BI G D ATA SOU RC EBO O K 2014

Recent Unisphere Research surveys


find that close to two-fifths of enterprises
either already have or are considering running database functions within a private
cloud, and about one-third are currently
using or considering a public cloud service. For more than 25% of organizations,
usage of private-cloud services increased
over the past year.
Cloud and virtualization are being
seamlessly absorbed into the jobs of most
database administrators, and in some
cases, reducing traditional activity while
expending their roles. Database as a service (DBaaS), or running databases and
managing data within an enterprise private cloud setting, offers data managers
and executives a means to employ shared
services to manage their fast-growing
environments. The potential advantage
of DBaaS is that database managers need
not re-create processes or environments
from scratch, as these resources can be
pre-packaged based on corporate or compliance standards and made readily available within the enterprise cloud. Close to
half of enterprises say they would like
to see capacity planning services offered
through private clouds, while 40% look
for shared database resources. A similar number would value cloud-based
services providing automated database
provisioning.

4. Virtualization and Software-Defined


Data Centers on the Way
Until recently, mentioning the term
platform brought images of Windows,
mainframe, and Linux servers to mind.
However, for most enterprises, platform
has become irrelevant. This extends to
the database sphere as wellmany of the
functions associated with specific databases can be abstracted away from underlying hardware and software.

The use of virtualization is helping


to alleviate strains being created by the
increasing size and complexity of database environments. The use of virtualization within database environments is
increasing. Almost two-thirds of organizations in a recent Unisphere Research
survey say there have been increases over
the past year. Nearly half report that more
than 50% of their IT infrastructure is
virtualized. The most common benefits
organizations report as a result of using
virtualization within their database environments are reduced costs, consolidation,
and standardization of their infrastructure
(The Empowered Database: 2014 Enterprise Platform Decisions Survey, September 2014).
Another emerging trendsoftwaredefined data centers, software-defined
storage, and software-defined networkingpromises to take this abstraction to a
new level. Within a software-defined environment, services associated with data
centers and database servicesstorage,
data management, and provisioningare
abstracted into a virtual service layer. This
means managing, configuring, and scaling
data environments to meet new needs will
increasingly be accomplished from a single control panel. It may take some time
to reach this stage, as many of the components of software-defined environments
are just starting to fall into place. Expect
to see significant movement in this direction in 2015.

5. Data Managers and Professionals


Will Lead the Drive to Secure
Corporate Data
One need only look at recent headlines
to understand the importance of data
securitymajor enterprises have suffered data breaches over the past year, and
in some cases, have taken CIOs and top

executives down with them. The rise of


big data and cloud, with their more com
plex integration requirements, accessibil
ity, and device variety, has increased the
need for greater attention to data security
and data governance issues.
Data security has evolved to that of a
top business challenge, as villains take
advantage of lax preventive and detective
measures. In many ways, it has become an
enterprise-wide issue in search of leadership. Senior executives are only too painfully aware of whats at stake for their
businesses, but often dont know how to
approach the challenge. This is an oppor
tunity for database administrators and
security professionals to work together,
take a leadership role, and move the enter
prise to take action.
Over the coming year, database manag
ers and professionals will be called upon
to be more proactive and lead their com
panies to successfully ensure data privacy,
protect against insider threats, and address
regulatory compliance. An annual survey
by Unisphere Research for the Indepen
dent Oracle Users Group (IOUG) finds
there is more awareness than ever of the
critical need to lock down data environ
ments, but also organizational hurdles in
building awareness and budgetary support for enterprise data security (DBA
Security Superhero: 2014 IOUG Enterprise
Data Security Survey, October 2014).

6. Mobile Becomes an Equal Client


Mobile computing is on the rise, and
increasingly mobile devices will be the client of choice with enterprises in the year
ahead. This means creating ways to access
and work with data over mobile devices.
More analytics, for example, is being
supported within mobile apps. Some of
the leading BI and analytics solutions vendors now offer mobile apps that offer dashboardsoften configurablethat provide
insight and visibility into operational
trends to decision makers who are outside
of their offices. While industry watchers
have been predicting the democratization of data analytics across enterprises
for years, the arrival of mobile apps as front

end clients to BI and analytics systems may


be the ultimate gateway to easy-to-use
analytics across the enterprise. By their
very nature, mobiles apps need to be
designed to be as simple and easy to use
as possible. Over the coming year, mobile
app access to key data-driven applications
will become part of every enterprise.
The ability to access data from any and
all devices, of course, will increase security concerns. While many enterprises
have tacitly approved the bring your own
device (BYOD) trend in recent years,
some are looking to move to corporateissued devices that will help lock down
sensitive data. The coming year will see
increased efforts to better ensure the security of data being sent to mobile devices.

7. Storage Enters the Limelight


Storage has always been an unappreciated field of endeavor. It has been almost
an afterthought, seen in disk drives and
disk arrays running somewhere in the
back of data centers. This is changing rapidly, as enterprises recognize that storage is
shaping their infrastructures capabilities.
Theres no question that many organizations are dealing with rapidly expanding data stores. Much of todays data
growthcoming out of enterprise applicationsis being exacerbated by greater
volumes of unstructured, social media and
machine-generated data making their way
into the business analytics platform. Many
enterprises are also evolving their data
assets into data lakes, in which enterprise data is stored up front in its raw form
and accessed when needed, versus being
loaded into purpose-built, siloed data
environments.
The question becomes, then, where
and how to store all this data. The storage
approach that has worked well for organizations over the decadesproduce data
within a transaction system, then send it
downstream to a disk, and ultimately, a
tape systemis being overwhelmed by
todays data demands. Not only is the
amount of data rapidly growing, but
more users are demanding greater and
more immediate access to data, even

when it may be several weeks, months, or


years old.
Over the coming year, there will be a
push by enterprises to manage storage
smartlyversus simply adding more
disk capacity to existing systems or purchasing new systems from year to year. A
recent survey by Unisphere Research finds
growing impetus toward smarter storage
solutions, which include increased storage efficiency through data compression,
information lifecycle management and
consolidation, or deployment strategies
such as tiered storage. At the same time,
storage expenditures keep risingeating a
significant share of IT budgets and impeding other IT initiatives. For those with
significant storage issues, the share storage takes out of IT budgets is even greater
(Managing Exploding Data Growth in
the Enterprise: 2014 IOUG Database Storage Survey, May 2014).

Whats Ahead
The year 2015 represents new opportunities to expand and enlighten data
management practices and platforms to
meet the needs of the ever-expanding
digital enterprise. To be successful, digital business efforts need to have solid
data management practices underneath.
As enterprises go digital, they will be relying on well-managed and diverse data to
explore and reach new markets. n

Joe McKendrick is an

author and independent


researcher covering innovation, information technology trends, and markets.
Much of his research work
is in conjunction with Unisphere Research, a division of Information
Today, Inc. (ITI), for user groups including
SHARE, the Oracle Applications Users Group,
the Independent Oracle Users Group, and the
International DB2 Users Group. He is also
a regular contributor to Database Trends
and Applications, published by ITI.
DBTA. COM

13

industry updates

The State of Big Data Management

industry updates

The State of Data Integration

Data Integration
Evolves to Support
a Bigger Analytic Vision
By Stephen Swoyer

What has traditionally made data a


hard problem is precisely the issue of accessing, preparing, and producing it for machine
and, ultimately, for human consumption.
What makes this a much harder problem in
the age of big data is that the information
were consuming is vectored to us from so
many different directions.
The data integration (DI) status quo is
predicated on a model of data-at-rest. The
designated final destination for data-at-rest is
(and, at least for the foreseeable future, will
remain) the data warehouse (DW). Traditionally, data of a certain type was vectored
to the DW from more or less predictable
directionsviz., OLTP systems, or flat files
and at the more or less predictable velocities
circumscribed by the limitations of the batch
model. Thanks to big data, this is no longer the case. Granted, the term big data is
empty, hyperbolic, and insufficient; granted,
theres at least as much big data hype as big
data substance. But still, as a phenomenon,
big data at once describes 1) the technological capacity to ingest, store, manage, synthesize, and make use of information to an
unprecedented degree and 2) the cultural
capacity to imaginatively conceive of and
meaningfully interact with information in
fundamentally different ways. One conse-

14

BI G D ATA SOU RC EBO O K 2014

quence of this has been the emergence of a


new DI model that doesnt so much aim to
supplant as to enrich the status quo ante. In
addition to data-at-rest, the new DI model is
able to accommodate data-in-motioni.e.,
data as it streams and data as it pulses: from
the logs or events generated by sensors or
other periodic signalers to the signatures or
anomalies that are concomitant with aperiodic events such as fraud, impending failure,
or service disruption.
Needless to say, comparatively little of
this information is vectoring in from conventional OLTP systems. And thatas poet
Robert Frost might put itmakes all the
difference.

Beyond Description
Were used to thinking of data in terms
of the predicates we attach to it. Now as ever,
we want and need to access, integrate, and
deliver data from traditional structured
sources such as OLTP DBMSs, or flat and/
or CSV files. Increasingly, however, were
alert to, or were intrigued by, the value of the
information that we believe to be locked into
multi-structured or so-called unstructured data, too. (Examples of the former
include log files and event messages; the latter is usually used as a kitchen-sink category

to encompass virtually any data-type.) Even


if we put aside the philosophical problem
of structure as such (semantics is structure;
schema is structure; a file-type is structure),
were confronted with the fact that data integration practices and methods must and
will differ for each of these different types.
The kinds of operations and transformations we use to prepare and restructure the
normalized data we extract from OLTP systems for business intelligence (BI) reporting
and analysis will prove to be insufficient (or
quite simply inapposite) when brought to
bear against these different types of data.
The problem of accessing, preparing, and
delivering unconventional types of data from
unconventional types of sourcesas well as
of making this data available to a new class
of unconventional consumersrequires new
methods and practices, to say nothing of new
(or at least complementary) tools.
This has everything to do with what might
be called a much bigger analytic vision.
Inspired by the promise of exploiting data
mining, predictive analytics, machine learning, or other types of advanced analytics on a
massive scale, the focus of DI is shifting from
that of a static, deterministic disciplinein
which a kind of two-dimensional world is
represented in a finite number of well-defined

sponsored content

Unleashing the Value of Big Data & Hadoop


Traditionally, organizations have relied
upon a single data warehouse to serve
as the center of their data universe. This
data warehouse approach operated on a
paradigm in which the data revealed a single,
unified version of the truth. But today, both
the amount and types of data available have
increased dramatically. With the advent
of Big Data, companies now have access
to more business-relevant information
than ever before, resulting in many data
repositories to store and analyze it.

THE CHALLENGES OF
MOVING BIG DATA
However, to use Big Data, you must
be able to move it, and the challenges of
moving Big Data are multi-faceted. Out of
the gate, the pipes between data repositories
remain the same size, while the data grows
at an exponential rate. The issue worsens
when traditional tools are used to attempt to
access, process and integrate this data with
other systems. Yet, companies cannot rely on
traditional data warehouses alone.
Thus, companies are increasingly
turning to Apache Hadoopthe free, open
source, scalable software for distributed
computing that handles both structured
and unstructured data. The movement
towards Hadoop is indicative of something
bigger: a new paradigm thats taking over
the business worldthat of the modern
data architecture and the data supply
chain that feeds it. The data supply chain
describes a new reality in which businesses
find themselves coordinating multiple data
sources rather than using a single data
warehouse. The data from these sources,
which often varies in content, structure, and
type, has to be integrated with data from
other departments and other target systems
within an enterprise. Big Data is rarely used
en mass. Instead, different types of data tell
different stories, and companies need to be
able to integrate all of these narratives to
inform business decisions.

HADOOPS ROLE IN THE


DATA SUPPLY CHAIN
In this new world, companies must
constantly move data from one place to
another to ensure efficiency and lower costs.
Hadoop plays a significant role in the data
supply chain. However, its not an end-all
solution. The standard Hadoop toolsets
lack several critical capabilities, including
the ability to move data between Hadoop
and relational databases. The technologies
that exist for data movement across
Hadoop are cumbersome. Companies need
solutions that make data movement to and
from Hadoop easier, faster, and more cost
effective.
While open source tools like Sqoop are
designed to deal with large amounts of data,
they are often not enough by themselves.
These tools can be difficult to use, require
specialized skills and time to implement,
typically focus only on certain types of data,
and cannot support incremental changes or
real-time feeds.

EFFECTIVELY MOVING BIG DATA


INTO AND OUT OF HADOOP
The most effective answer to this
challenge is to implement solutions that are
specifically designed to ease and accelerate
the process of data movement across a broad
number of platforms. These technologies
allow IT organizations to easily move data
from one repository to another in a highlyvisible manner. The software should also
unify and integrate data from all platforms
within an enterprise, not just Hadoop. And
they should include change data capture
(CDC) technology to keep the target data up
to datein a way thats sensitive to network
bandwidth.
Attunity offers a solution for companies
looking to turbocharge the flows across their
data supply chain while fully supporting
a modern data architecture. Attunity
Replicate features a user-friendly GUI, with
a Click-to- Replicate design and drag-and-

drop functionality to move data between


repositories. Attunity supports Hadoop
as a source and as a target, as well as every
major commercial database platform and
data warehouse available. It is scalable and
manageable and can be used to move data
to and from the cloud when combined with
Attunity CloudBeam.

MAKING BIG DATA & HADOOP


WORK FOR YOU!
Attunity enables companies to improve
their data flows to capitalize on all their data,
including Big Data sources. Their solutions
limit the investment a company needs to
make by reducing the hardware and software
needed for managing and moving data
across multiple platforms out-of-the box.
Additionally, Attunity solutions are high
performance and provide an easy-to-use
graphical interface that helps companies
make timely and fully-informed decisions.
Using high-performance data movement
software like Attunity, companies can not
only unleash the full power of Hadoop but
also the power of all their other technologies
to enable real-time analytics and true
competitive advantage.

To learn more,
download
this Attunity
whitepaper:
Hadoop and
the Modern Data
Supply Chain

http://bit.ly/HadoopWP

ATTUNITY
www.attunity.com
DBTA. COM

15

industry updates

The State of Data Integration

The new DI model is able to accommodate


data-in-motioni.e., data as it streams
and data as it pulses.

dimensionsto a polygonal or probabilistic discipline with a much greater number


of dimensions. The static stuff will still
matter and will continue to power the
great bulk of day-to-day decision making,
but this will in turn be enriched, episodically, with different types of data. The
challenge for DI is to accommodate and
promote this enrichment, even as budgets
hold steady (or are adjusted only marginally) and resources remain constrained.

Automatic for the People


What does this mean for data integration? For one thing, the day-to-day work
of traditional DI will, over time, be simplified, if not actually automated. This
work includes activities such as 1) the
exploration, identification, and mapping
of sources; 2) the creation and maintenance of metadata and documentation;
3) the automation or acceleration, insofar
as feasible, of testing and quality assurance; and, crucially, 4) the deployment of
new OLTP systems and data warehouses,
as well as of BI and analytic applications
or artifacts. These activities can and will
be accelerated; in some cases (as with the
generation and maintenance of metadata
or documentation) they will, for practical,
day-to-day purposes, be more or less completely automated.
This is in part a function of the maturity of the available tooling. Most DI and
RDBMS vendors ship platform-specific
automation features (pre-fab source connectivity and transformation wizards; data
model design, generation, and conversion
tools; SQL, script, and even procedural
code generators; scheduling facilities; in
some cases even automated dev-testing
routines) with their respective tools. Similarly, a passel of smaller, self-styled data
warehouse automation vendors market
platform-independent tools that purport
to automate most of the same kinds of
16

BI G D ATA SOU RC EBO O K 2014

activities, and which are also optimized for


multiple target platforms. On top of this,
data virtualization (DV) and on-premises-to-cloud integration specialists can
bring intriguing technologies to bear, too.
Most DI vendors offer DV (or data federation) capabilities of some kind; others
market DV-only products. None of these
tools is in any sense a silver bullet: custom-fitting and design of some kind is
still required andfranklyalways will
be required. The catch, of course, is that
even though such tools can likewise help
to accelerate key aspects of the day-to-day
work of building, managing, optimizing,
maintaining, or upgrading OLTP and BI/
decision support systems, they cant and
wont replace human creativity and ingenuity. The important thing is that they
give us the capacity to substantively accelerate much of the heavy-lifting of the work
of data integration.

Big Data Integration:


Still a Relatively New Frontier
This just isnt the case in the big data
world. As Douglas Adams, author of The
Hitchhikers Guide to the Galaxy, might put
it, traditional data integration tools or services are mature and robust in exactly the
way that big data DI toolsarent.
At this point, guided and/or selfservice features (to say nothing of management-automation amenities) are still
mostly missing from the big data offerings.
As a result, organizations will need more
developers and more technologists to do
more hands-on stuff when theyre doing
data integration in conjunction with big
data platforms.
Industry luminary Richard Winter
tackled this issue in a report entitled The
Real Cost of Big Data, which highlights
the cost disparity between using Hadoop
as a landing area and/or persistent store
for data versus using it as a platform for

business intelligence (BI) and decision


support workloads. As a platform for data
ingestion, persistence, and preparation,
the research suggests, Hadoop is orders of
magnitude cheaper than a conventional
OLTP or DW system. Conversely, the cost
of using Hadoop as a primary platform
for BI and analytic workloads is orders of
magnitude more expensive.
An issue that tends to get glossed over
is that of Hadoops efficacy as a data management platform. Managing data isnt
simply a question of ingesting and storing it; its likewise, and to a much greater
extent, a question of retrieving just the
right data, of preparing it in just the right
format, and of delivering it at more or less
the right time. In other words, big data
tools arent only less productive, than are
those of traditional BI and decision support, but big data management platforms
are themselves comparatively immature,
too. Generally speaking, they lack support
for key database features or for core transaction-processing concepts, such as ACID
integrity. The simple reason for this is that
many platforms either arent databases
or eschew conventional DBMS reliability and concurrency features to address
scaling-specific or application-specific
requirements. The upshot, then, is that
the human focus of data integration
is shifting and will continue to shift to
Hadoop and other big data platforms
not least because these platforms tend to
require considerable human oversight and
intervention.
This doesnt mean that data, applications, and other resources are shifting
or will shift to big data platforms, never
to return or to be recirculated. For one
thing, theres cloud, which is having no
less a profound impact on data integration and data management. Data must be
vectored from big data platforms (in the
cloud or on-premises) to other big data

platforms (in the cloud or on-premises),


to the cloud in generali.e., to SaaS, platform-as-a-service (PaaS), and infrastructure-as-a-service (IaaS) resourcesand,
last but not least, to good old on-premises
resources like applications and databases.
Theres no shortage of data exchange
formats for integrating data in this contextJSON and XML foremost among
thembut the venerable SQL language
will continue to be an important and even
a preferred mechanism for data integration in on-premises, big data, and even
cloud environments. The reasons for this
are many. First, SQL is an extremely efficient and productive language: According
to a tally compiled by Andrew Binstock,
editor-in-chief of Dr. Dobbs Journal, SQL
trails only legacy languages such as .ASP
and Visual Basic (at number 1 and 2,
respectively) and Java (at number 3) productivity-wise. (Binstock based his tally
on data sourced from the International
Software Benchmarking Standards Group,
or ISBSG, which maintains a database
of more than 6,000 software projects.)
Second, theres a surfeit of available SQL
query interfaces and/or adapters, along
with (to a lesser extent) of SQL-savvy coders. Third, open source software (OSS) and
proprietary vendors have expended a simply shocking amount of effort to develop
ANSI-SQL-on-Hadoop technologies. This
is a very good thing, chiefly because SQL
is arguably the single most promising tool
for getting the right data in the right format out of Hadoop.
Two years ago, for example, the most
efficient ways to get data out of Hadoop
included:
1. Writing MapReduce jobs in Java
in order to translate the simple
dependency, linear chain, or directed
acyclic graph (DAG) operations
involved in data engineering into map
and reduce operations;
2. Writing jobs in PigLatin for Hadoops
Pig framework to achieve basically the
same thing;
3. Writing SQL-like queries in Hive
Query Language (HiveQL) to achieve
basically the same thing; or
4. Exploiting bleeding-edge technologies
(such as Cascading, an API layered

on top of Hadoop thats supposed to


make it easier to program/manage) to
achieve basically the same thing.
Today, theres no shortage of mechanisms to get data from Hadoop. Take Hive,
an interpreter that compiles HiveQL queries into MapReduce jobs. As of Hadoop
2.x, Hive can leverage both Hadoops
MapReduce engine or the new Apache Tez
framework. Tez is just one of several designs
that exploit Hadoops new resource manager, YARN, which makes it easier to manage
and allocate resources for multiple compute
engines, in addition to MapReduce. Thus,
Apache Tezwhich is optimized for the
operations, such as DAGs, that are characteristic of data transformation workloads
now offers features such as pipelining and
interactivity for ETL-on-Hadoop. Theres
also Apache Spark, a cluster computing
framework that can run in the context of
Hadoop. Its touted as a high-performance
complement and/or alternative to Hadoops
built-in MapReduce compute engine; as of
version 1.0.0, Spark is paired with Spark
SQL, a new, comparatively immature, SQL
interpreter. (Spark SQL replaces a predecessor project, dubbed Shark, which was
conceived as a Hive-oriented SQL interpreter.) Over the last year, especially, Spark
has become one of the most hyped of
Hadoop-oriented technologies; many DI or
analytic vendors now support Spark to one
degree or another in their products. Generally speaking, most vendors now offer SQLon-Hadoop options of one kind or another,
while others also offer native (optimized)
ETL-on-Hadoop offerings.

winds up looking a lot like what you do


with traditional DI. And the good news is
that you can do a lot more with traditional
DI tools or platforms than used to be the
case. Most data integration offerings can
parse, shred, and transform the JSON
and XML used for data exchange; some
can do the same with formats such as
RDF, YAML, or Atom. Several prominent
database providers offer support for indatabase JSONs (e.g., parsing and shredding JSONs via a name-value-pair function or landing and storing them intact
as variable character text), while others
offer some kind of support for in-database
storage (and querying) of JSON data. DV
vendors are typically no less accomplished
than the big DI platforms with respect
to their capacity to accommodate a wide
variety of data exchange formats, from
JSON/XML to flat files.
Any account of data integration and
big data is bound to be insufficient simply because there is so much happening.
As noted, the Hadoop platform is by no
means the onlynor, for that matter, the
most excitinggame in town. Apache
Spark, which (a) runs in the context of
Hadoop and which (b) can both persist
data (to HDFS, the Hadoop Distributed
File System) and run in-memory (using
Tachyon) last year emerged as a bona
fide big data superstar. Spark is touted as
a compelling platform for both analytics
and data integration. Several DI vendors
already claim to support it to some extent.
Spark, like almost everything else in the
space, will bear watching. And so it goes. n

Whats Ahead
Cloud is a critical context for data integration. One reason for this is that most
providers offer export facilities or publish
APIs that facilitate access to cloud data.
Another reasonas I wrote last yearis
that doing DI in the cloud doesnt invalidate (completely or, even, in large part)
existing best practices: if you want to
run advanced analytics on SaaS data,
youve either got to load it into an existing, on-premises repository oralternativelyexpose it to a cloud analytics provider. What you do in the former scenario

Swoyer is a
technology writer with
more than 16 years of
experience. His writing
has focused on business
intelligence, data warehousing, and analytics
for almost a decade. Hes particularly
intrigued by the thorny people and process problems most BI and DW vendors
almost never want to acknowledge, let
alone talk about. You can contact him at
stephen.swoyer@gmail.com.

Stephen

DBTA. COM

17

industry updates

The State of Data Integration

industry updates

The State of Business Intelligence and Advanced Analytics

Turning Data
Into Value
Using Analytics
By Bart Baesens

Data is everywhere. IBM projects that


every day we generate 2.5 quintillion bytes of
data. In relative terms, this means 90% of the
data in the world has been created in the last
2 years. Gartner projects that by 2015, 85%
of Fortune 500 organizations will be unable
to exploit big data for competitive advantage
and about 4.4 million jobs will be created
around big data. These massive amounts
of data yield an unprecedented treasure of
internal customer knowledge, ready to be
analyzed using state of the art analytical techniques to better understand and exploit customer behavior by identifying new business
opportunities together with new strategies.
Big data and analytics are all around these
days, and if firms arent doing so already, they
should plan to invest in it. Lets consider a few
examples. Financial institutions use credit
scoring models on a daily basis to gauge the
creditworthiness of their customers for the
next 12 months on all their credit products
(mortgages, credit cards, installment loans).
They will use this score to do debt provision-

18

BI G D ATA SOU RC EBO O K 2014

ing, Basel II/Basel III capital calculation and


marketing (e.g., increase/decrease the limit
on a credit card in case of a good/bad credit
score). Telco operators run churn prediction
models using all recent call behavior data, to
see whether customers are likely to churn or
not in the next 1 to 3 months. The resulting
retention score will then be used to set up
marketing campaigns to prevent customers
from churning (unless they would not be
profitable). Facebook and Twitter posts are
continuously analyzed using social media
analytics to study both their content and
sentiment so as to better understand brand
perception, and/or further fine-tune product and/or service design. Online retailers
(such as Amazon and Netflix) continuously
analyze purchases to decide upon product
bundling and the next best offer as part of a
recommender system. Credit card companies
use sophisticated analytical fraud detection
models to see whether payments are legitimate or fraudulent as a result of activities
such as identity theft. The government uses

analytics to predict tax evasion, optimize the


allocation of social benefits, improve public
safety by analyzing transport data, and guarantee national security by identifying terrorism threats. As this article is made available
online, it will be analyzed and categorized
by Google and other search engines and
included into their search results.
Analytics is all around and is getting more
and more pervasive and directly embedded
into our daily lives. Businesses ranging from
international firms to SMEs are jumping
on the big data and analytics bandwagon to
create added value. It comes as no surprise
that this not only brings about a whole new
series of opportunities but also challenges.
There are some critical success factors that
companies must come to grips with as they
approach this seemingly Herculean task of
creating added value out of data.

Setting Up an Analytics Project


Data is the key ingredient for any analytical model. When starting an analytics project,

industry updates

The State of Business Intelligence and Advanced Analytics

Increasingly, firms are splitting up their


analytical teams into a model development
and a model validation team.

it is important to meticulously list all data


within the enterprise that could potentially be beneficial to the analytical exercise. The more data, the better is the rule
here. Analytical models have sophisticated
built-in facilities to automatically decide
what data elements are important for the
task at hand and which ones can be left
out from further analysis. The best way to
improve the performance of any analytical
model is by investing in data. This can be
done by working on both the quantity and
quality simultaneously. Regarding the former, a key challenge concerns the aggregation of structured (e.g., stored in relational databases) and unstructured (e.g.,
textual) data to provide a comprehensive
and holistic view on customer behavior.
Closely related to this is the integration of
offline and online data, an issue that many
companies are struggling with nowadays.
Furthermore, companies can also look
beyond their internal boundaries and consider the purchase of external data from
data poolers to complement their internal analytical models. Extensive research
has indicated that this is very beneficial in
order to both perfect and benchmark the
analytical models developed.
Although data is typically available in
large quantities, its quality is often a more
painful attention point. Here the GIGO
principle applies: garbage in, garbage out,
or bad data yields bad models. This may
sound obvious at first. However, good
data quality is often the Achilles heel in
many analytical projects. Data quality can
be evaluated by various dimensions such
as data accuracy, data completeness, data

timeliness, and data consistency, to name


a few. To be successful in big data and
analytics, it is necessary for companies to
continuously monitor and remedy data
quality problems by setting up master data
management programs and creating new
job roles such as that of data auditor, data
steward, or data quality manager.
Analytics should always start from a
business problem rather than from a specific technological solution. However, this
comes with a chicken and egg problem.
To identify new business opportunities,
one needs to be aware of the technological
potential first. As an example, think about
the area of social media analytics. By first
understanding how this technology works,
a firm can start thinking about how to
leverage this to study its online brand
perception or perform trend monitoring.
To bridge the gap between technology
and the business, continuous education
is important. It allows companies to stay
ahead of the competition and spearhead
the development of new analytical applications. At this point, the academic world
should make a mea culpa, since the offering of Master of Science programs in the
area of big data and analytics is currently
falling short of the demand.
Another important component for
turning data into concrete business
insights and adding value using analytics concerns the proper validation of the
analytical models built. Quotes such as if
you torture the data long enough, it will
confess and terms such as data massage
have cast a negative perspective on the field
of analytics. It speaks for itself that analyt-

ical models should be properly audited


and validated and many mechanisms,
procedures, and tools are available to do
this. Thats why more and more firms are
splitting up their analytical teams into a
model development and a model validation team. Good corporate governance
then dictates the construction of a Chinese
wall between both teams, such that models developed by the former team can be
objectively and independently evaluated
by the latter team. One might even contemplate having the validation performed
by an external partner. By setting up an
analytical infrastructure whereby models
are critically evaluated and validated on
an ongoing basis, a firm is capable of continuously improving its analytical models
and thus, can better target its customers.
Analytics is not a one-shot single-time
exercise. In fact, the frustrating thing is
that once an analytical model has been
built and put into production, it is outdated. Analytical models constantly lag
behind reality, but the gap should be as
minimal as possible. Just think about it:
An analytical model is built using a sample of data, which is gathered at a specific
snapshot in time given a specific internal
and external environment. However, these
environments are not static, but continuously change because of both internal
(new strategies, changing customer behavior) as well as external effects (new economic conditions, new regulations). Think
about a fraud detection model whereby
criminals try to continuously outperform
the model to gain financial advantage.
Another example is a credit-scoring model
DBTA. COM

19

industry updates

The State of Business Intelligence and Advanced Analytics

which is heavily dependent upon the current state of the macro economy (upturn
versus recession). Hence, to be successful
and create added value, analytical models
should be accompanied by monitoring
and back-testing facilities that can facilitate the decision about when to tweak or
rebuild them.

Key Underlying Technologies


In order to set up an analytics environment, firms need to make decisions about
the hardware and software technologies
to be adopted. Hardware-wise, big data
requires specialized infrastructures (e.g.,
Hadoop and the related software stack)
to store, integrate, clean, and manage the
data. In order to limit this investment,
firms may opt for storing data in the cloud
and use the approach whereby big data
is offered as a service. Obviously, when
transferring data outside a companys context, adequate cautionary measures should
be taken to guarantee the confidentially
and privacy of the data collected.
Software-wise, many vendors are currently providing commercial solutions
for big data and analytics. There are also
more and more open source, free software
solutions being offered in the market.
Although these solutions are getting very
popular, they are not mature enough yet
to handle both the diversity and volume
of data sources needed to compete using
analytics. Successful analytical software
should provide integrated, comprehensive, vertical business solutions rather than
focus too much on a cross-industry, horizontal approach. Big data and analytics is
increasingly becoming part of a business
DNA, and needs to be genetically configured as such. Think about government
entities, the financial sector, and pharmaceutical industry, which each have their
own footprint, data specificities, business
problems, and even regulations.
Given the technological intricacies of
setting up an analytical environment, one
might contemplate outsourcing the whole
exercise. However, since internal company
data, and especially the analytical insights
that come with it, are the most valuable
20

BI G D ATA SOU RC EBO O K 2014

strategic assets which constitute the companys DNA, it is strongly discouraged to


give third parties full access to it. On the
contrary, firms should build the necessary
in-house skills and centers of excellence
to serve the organization-wide analytical needs. Obviously, this should also be
properly managed. So it is also important to involve the board of directors and
senior management in the development of
an analytics environment. Adding a Chief
Analytics Officer (CAO) to the C-suite is
an option which more and more companies are exploring in this context. This
person is then responsible to set up a corporate-wide analytics environment and
infrastructure, and continuously supervise
the development, audit, and deployment
of the analytical models across all business
units in the enterprise.

Big data and analytics is


becoming part of a business
DNA, and needs to be
genetically configured
as such.
As a final note, there are also more
small- and medium-sized enterprises
(SMEs) becoming interested in leveraging big data and analytics. Since these
firms typically have only limited budgets,
they are particularly interested in off-theshelf, pre-packaged software solutions that
can be directly used to analyze their data.
Actually, the most popular technologies in
use among SMBs are web analytics tools
to study how their websites are being discovered and used, improve their search
engine ranking, or decide upon their optimal organic-versus-paid-search online
marketing mix.

Whats Ahead
To fully leverage the power of big data
and analytics organizations should:
Simultaneously invest both in data
quantity and data quality;

Embrace continuous education to


bridge the gap between new analytical
technology and emerging business
opportunities;
Create analytical development teams
that include independent model
development and model validation
teams; and
Understand that analytics is more
than just model development and
validation, and should also be
accompanied by monitoring and
back-testing facilities.
From a technological perspective, companies should:
Consider the option of using cloud
services for big data and analytics;
Adopt the right vertical, industryspecific focus when selecting software
solutions and be careful with open
source software; and
Avoid outsourcing their analytical
activities, and instead focus on
building expertise in-house
accompanied by appropriate senior
management oversight. n

Bart Baesens is a professor at KU Leuven (Belgium), and a lecturer at


the University of Southampton (United Kingdom).
He has done extensive
research on analytics,
customer relationship management, web
analytics, fraud detection, and credit risk
management. His findings have been
published in well-known international
journals, and presented at international
top conferences. He is also author of
Credit Risk Management: Basic Concepts,
published by Oxford University Press in
2008; and Analytics in a Big Data World,
published by Wiley in 2014. His research
is summarized at www.dataminingapps
.com. He also regularly tutors, advises,
and provides consulting support to international firms with respect to their analytics and risk management strategy.

sponsored content

Big Data Infrastructure


ManagementNot Quite Nirvana Yet?
The application economy has resulted
in a data explosion of unprecedented
proportiondata is growing in format, type
and volume more quickly than ever before.
The buzzwords and talking points related to
this Big Data explosionthe all-too familiar
four VsVolume (the sheer enormity of
Big Data), Variety (data of different types,
structured, unstructured, semi-structured),
Veracity (integrity, trusted and in context)
and Velocity (e.g., performing real-time
analytics on streaming data)are terms we
hear daily, and can relate to. That data could
hold tangible value for your business, but
what is the method behind the madness?
Transitioning from knowing something has
to be done, to taking prescribed action, can be
a tremendous hurdle to overcome. Ensuring
the actions taken to address the needs of your
business do not burden your company with
unnecessary cost and resource requirements
is an even greater challenge. But to remain
competitive in this economy, becoming agile
and intelligent is not an optionit is an
imperative. It is critical to your business to
identify your goals, and determine how to use
your data to achieve those goals, by reaching
the truth behind the analysis and applying
the benefits of that analysis to your company.
But too often complexity stands in the way
of successful achievement.
There are so many benefits to be had,
but the reverberating dilemma is how to
make sense of it all in a meaningful, logical,
efficient way, that wont break your budget.
For enterprises implementing Big Data
infrastructure, often with hundreds of nodes
and clusters within an Apache Hadoop
distributed environment, complexity
abounds. A Big Data project often begins
with just a few clusters, perhaps using
the freely available Apache Hadoop

distribution. Once started, the distributed


environment can quickly grow as business
needs evolve. This growth requires you to
expand the number of clusters and nodes to
provide greater system capacity needed to
run additional Hadoop jobs. Often, other
big data projects are initiated by teams in
the same organization using a different
Hadoop distribution from one of the leading
vendors (e.g., Cloudera or Hortonworks
and they will likely download a variety of
freely available open source tools. The end
result is a mixed (multi-vendor), distributed
environment whereas the clusters associated
with each specific distribution have their
own unique manager tool. Considering that
these distributed environments often span
10s and 100s of clusters, its easy to imagine
that such an environment can quickly
become an administrative nightmare when
having to bounce between 10, 20 or more
manager (admin) tools.
Consider the administrator who needs to
determine the location and type of system
issues as they arise, before they challenge the
integrity of your environment. This increases
in difficulty as your environment grows and
becomes more complex and disparate. Using
manual processes and relying on alerts from
10s or 100s of instances of software can be
overwhelming and inefficient. Simplification
is the key to your success.
Many companies feel that the ideal
simplification would involve a single view
point into your entire heterogeneous
environmenta single dashboard of all
nodes and clusters. This aggregation point
for Big Data infrastructure management
could include monitoring and alert
management integrated with intelligent
automation and issue resolution. This
would allow you to streamline your Big

Copyright 2014 CA. All rights reserved. [INSERT REQUIRED THIRD-PARTY TRADEMARK ATTRIBUTIONS.] All trademarks,
trade names, service marks and logos referenced herein belong to their respective companies. This document is for your
informational purposes only. CA assumes no responsibility for the accuracy or completeness of the information. To the extent
permitted by applicable law, CA provides this document as is without warranty of any kind, including, without limitation,
any implied warranties of merchantability, fitness for a particular purpose, or noninfringement. In no event will CA be liable
for any loss or damage, direct or indirect, from the use of this document, including, without limitation, lost profits, business
interruption, goodwill or lost data, even if CA is expressly advised in advance of the possibility of such damages.

Data infrastructure management efforts


by centralizing the management role
and arming you with tools to minimize
human error and manual intervention.
A solution that supports your evolving,
growing environment, across your Hadoop
infrastructure choices, would enable you
to tailor your Big Data environment to suit
your needs, using the analytics software
that best fits each individual application or
business focus. n
David Hodgson,
CA Technologies,
Senior Vice President,
Strategy and Product
Management Mainframe
Business Unit
As SVP Product Management &
Strategy, David is currently focused on
CAs Mainframe portfolio and keeping
it relevant in the evolving world of IT.
With a focus on the flagship product,
Chorus, he leads a team that is delivering
innovations to enable the mainframe to
be a critical part of a private and hybrid
cloud infrastructure, big data analysis and
accessible from mobile devices.
David joined CA in 2000 and has 30
years of experience in the software industry
spanning support, services, development,
IT management, process and operations,
and business development.
Prior to CA, David served in executive
management positions at Sterling
Software, and David held management
and consulting positions at a number of
other ISVs and technology companies.
Now a US citizen, David was born and
educated in the UK, where he earned a
bachelor of technology degree with honors
from the University of Bradford, England.

CA TECHNOLOGIES
For more information, visit
www.ca.com/BigData.
DBTA. COM

21

industry updates

The State of Cloud Technologies

As Clouds Roll In,


Expectations for
Performance and
Availability Billow
By Michael Corey, Don Sullivan

At a conference several years ago, a


speaker discussed the future of computing
and information technology. This speaker
painted a world of computing devices that
would operate in a way similar to that of a
telephone worked at the time: A user would
attach a simple device into the wall and it
would have access to all the resources and
capabilities of the computers to which the
device could be wired.
Moving forward many years, the speakers
vision now appears astonishingly prescient.
Simply replace the word wall with network and the speakers vision of the future
maps to the reality of the present like Gene
Roddenberry envisioning a communication
device in the 1960s. But today, there is no
need to plug a device into the network, as the
network is omnipresent and invisible. Today,
we refer to that network as the cloud.
Today, for example, if someone moves
to a new home, technicians dutifully arrive
to install the cable and internet. To make
certain that access to the cloud is not interrupted, a wireless router from a company
such as Comcast may provide an additional
public network to customers needing access

22

BI G D ATA SOU RC EBO O K 2014

to the internet. According to The New York


Times, a combined ComcastTime Warner
agreement (which looms as of this writing)
would aggregate the companies combined
customer base to over 30 million. The consequence of this deal would be a potential 30
million cloud endpoints, not including every
coffee shop, airport, library, and any other
place one might consider that offers Wi-Fi.
What is unambiguously clear is that we live in
a world where access to the cloud and highspeed bandwidth is ubiquitous, just like the
clouds in the sky.

The Impact of Cloud on Everyday Lives


When high-speed bandwidth is combined with universal access, virtually anyone
anywhere will be capable of changing the
computing landscape. These alterations will
occur at a ferocious pace. On a personal level,
we all now have multiple devices to access
the power of the cloud, be it our smartphones, tablets, TVs, intelligent thermostats,
Xboxes, and we will continue to amass more
next-generation cloud-ready devices in the
future. A new level of video games has been
created, allowing players all over the globe to

interact with each other in real-time action.


Whole new classes of business applications
housed in the cloud such as Salesforce.com
have emerged. The music industry has been
completely turned upside down due to the
cloud, and entirely new types of social media
sites have been launched such as TripAdvisor,
Facebook, Twitter, and Yelp.
One could also argue that a major factor
that contributed to the Arab Spring uprisings is cloud computing. The combination of
social media and access to the world at their
fingertips, combined with activists ability
to communicate with each other instantly,
helped fuel the demonstrations that brought
about precipitous changes.

The Impact of Cloud on Business


As we use the cloud each day of our lives
and as we leave a trail of information about
ourselves, companies are also finding new
ways to harvest that information. Today, we
dont talk about data, we talk about big data.
The cloud is contributing to a data explosion unlike anything we have ever seen. One
way to think of data today is in terms of the
dataverse. No longer is data housed in just a

industry updates

The State of Cloud Technologies

The cloud is contributing to a data explosion


unlike anything we have ever seen.

SQL Server or Oracle database. The questions are: Who owns this dataverse? Who
runs this dataverse?
One can easily describe the obvious
developments that are based on the cloud,
but one cannot speculate on the future as
easily as the aforementioned speaker. For
every new cloud technology, there are millions of new users who did not invent that
technology and who did not grow up in a
world in which communication with their
best friend from high school required dialing 10 digits and paying exorbitant fees to
a phone company. These millions have for
their entire lives existed in a world in which
they simply could reach for any device anywhere and access any application at any
time, and communicate in milliseconds
with anyone they wanted to. For these
users, communication came at virtually no
cost and it was often with people that they
had never seen or spoken to before. Take
this a step further and imagine the conversation with the CFO of 2030 when she
wants to know why that 75-year-old salesman needs to have a face-to-face meeting
with the customer? Will the answer be: He
was born before the internet!

The Changing Role of the DBA


Traditionally, we relied on our database
administrators to ensure the data was safe,
secure and accessible. However, according
to a recent career survey, approximately
70% of the DBAs are over 45 years old, and
20% of those surveyed are within 10 years
of retirement (The Vanishing Database
Administrator: Survey of Data Professionals Career Aspirations, September 2014).
The results are not surprising when
one considers that the first relational database product was introduced over 30 years

ago.Unlike traditional technologies, which


would be getting close to the end of their
usefulness as the technologists themselves
were leaving the workforce, databases are
continuing to explode as a direct result
of all the new data resulting from cloud
computing.
Yet, even as we see the baby boom generation of DBAs moving toward retirement, we are living in a cloud-fueled world
of application proliferation coupled with a
data explosion that is creating additional
demand for these critical DBA skills. The
result is that the demand for DBAs is outpacing the supply at an alarming rate.

The Emerging Role of Cloud


in Data Management
The approaches that companies have
used to manage data in the past are not
sustainable as we move into the future.
This means the database administrator who wants to remain relevant in the
future will have to manage the dataverse.
This includes managing big data, data
stored in databases, and unstructured
data wherever it exists. If the company
needs to collect data, analyze it, and act
upon it in whatever forms it exists, the
future DBA will need to be the custodian
of that dataverse. Over time, the future
DBA will evolve into more and more like
a data scientist or data expert who is also
responsible for the data.
As the baby boomer generation of
DBAs leaves the industry, companies will
have other challenges to overcome. How
will they find enough of these DBAs to
keep ahead of this data and application
explosion? What are the future DBA skills
that will be needed, keeping in mind that
DBA expertise takes years to acquire and

that, as an industry, we are not growing


fast enough? Added to that is the fact that,
today, a typical organization requires different types of DBAs to support their business. We now have many DBA silos.There
is the production DBA, the development
DBA, the cloud DBA, the OS DBA, the
application DBA, the virtualization DBA
(vDBA), and the list goes on and on. This
further increases the complexity of the
problem.
More and more, companies will be
forced to turn to the cloud for their DBAs
since they will be unable to meet this
demand internally. Companies that provide DBA expertise to customers all over
the globe via the cloud will become standard bearers for future staffing and managed services companies attempting to
deal with the data explosion being fueled
by the cloud.
For years, the industry collectively
believed technology itself would replace
the DBA. However, time has proven that
the combination of application and data
proliferation has created a human capital challenge centered on the data professional. Due to the ever-increasing complexity of the technology, demand will
continue to outstrip supply and there is
no technology in the near future that will
make the DBA obsolete.

The Always-On Enterprise


With high-speed bandwidth available
globally, a plethora of players has emerged
that are attempting to commoditize computing resources in the cloud. Microsoft
Azure, Amazon Web Services (AWS), and
VMware Hybrid Cloud Services (vCHS) are
three of the bigger players. If an in-house
IT department cannot provide a business
DBTA. COM

23

industry updates

The State of Cloud Technologies

community with the infrastructure it needs


at the time it needs it, no problem; anyone
with a credit card can go out into the cloud
and purchase by the drink whatever computing resources are required.
But although it was thought by many
that these public clouds would eliminate
the need for traditional hosting companies, as time goes on, it seems the opposite is true. In actuality, what has happened is that properly positioned hosting
companies have made an adjustment in
their business models to stay competitive. Recognizing that a commoditized
infrastructure has limited longevity, these
vendors have developed deep expertise
in housing business-critical applications
in the cloud in market segments such as
healthcare. By focusing on these businesscritical areas, the hosting companies are
able to provide organizations a level of
service, subject matter expertise, and customization that a public cloud cannot. The
reason people dont do all of their shopping at Walmart is the same reason that
companies will not obtain all their computing infrastructure from players who are
simply commoditizing infrastructure.
More organizations are dipping their
toes in the cloud through public offerings and recognizing the advantages of
doing business this way, and they are also
using hosting companies to provide additional services required when they move
their most demanding applications into
the private cloud. In the future, public
clouds offerings will continue to expand
the demand for private cloud-based
infrastructure. These same companies
will migrate toward offering the services
of a platform such as Microsoft Azure so
that you will have the best of both worlds
under one roof.
Years ago, Massachusetts had blue
laws, which placed restrictions on business operations on Sundays, and holidays.
Today, the majority, if not all of these laws
have been repealed. People want to be able
to shop 7 days a week.
In the cloud, we expect and demand
infrastructure availability 7 days a week,
365 days a year. Vendors who cannot rise
24

BI G D ATA SOU RC EBO O K 2014

to meet this standard will continue to


suffer. Recognizing the need to provide
high performance systems that are always
available in the cloud has led to a new
class of systems.

The approaches that


companies used to manage
data are not sustainable.

The Hyper-Converged Infrastructure


From this demand, the era of hyperconverged infrastructure has emerged.
Think of a hyper-converged infrastructure
as being a tightly integrated network, computing, storage, and virtualization hypervisor all in one device. Some examples
of hyper-converged infrastructure would
include Nutanix and SimpliVity. The lure
of these systems is simpler administration
since they no longer have to be managed
on each level of the stack independently. In
addition, they are all built with high availability as a cornerstone of their capabilities. When you tie all this together with a
high performance virtualization platform
and hypervisor you have a powerful new
computing paradigm that is cloud-ready.
The best attribute of this approach is
that when the time comes to scale, organizations can simply add another hyperconverged device.
As we move forward into the future, the
business case continues getting stronger for
organizations to move their entire infrastructure into the cloud. Hyper-converged
infrastructures will become a critical piece
of the cloud infrastructure over time.

Whats Ahead
There is no doubt that cloud is here to
stay and will continue to change our lives
and our businesses each and every day.
New classes of cloud-ready devices and
applications will also continue to emerge.
These new applications and devices
will further fuel the data explosion, help-

ing companies to collect, analyze, and then


act upon this data. There is no short-term
answer to the DBA shortage, and this
problem will drive more and more companies to cloud-managed data services.
Commoditization is not going to stop
specialization. Just as we dont do all our
shopping at Walmart, we are not going to
purchase our entire infrastructure from
Amazon Web Services or Microsoft Azure.
Yet, these commoditized services will continue to evangelize the benefits of a cloud
infrastructure and drive more and more
companies into the cloud.
With bandwidth and accessibility no
longer the challenge, organizations will
expect more and more from their cloud
infrastructure. This will drive them to
wider adoption of converged infrastructures to meet customer demand and
expectations. n

Michael Corey is pres-

dent of NtiretyA Division


of Hosting (www.ntirety
.com). VMware has named
Corey a vExpert, Microsoft
has named him a SQL
Server MVP, and Oracle
has named him an Oracle Ace. Coreys
newest book is Virtualizing SQL Server
with VMware. He is a past president
of the Independent Oracle Users Group,
and helped found the Professional
Association of SQL Server.
Don Sullivan has been with

VMware (www.vmware
.com) since 2010 and is
the product line marketing manager for Business
Critical Applications. He is
an Oracle Certified Master, co-author of Virtualizing Oracle Database on vSphere, a VMware CTO Ambassador, and a VMware vExpert. In addition,
Sullivan was the co-creator of the Oracle
Certified Master Practicum in 2002.

Best Practices and Thought Leadership Reports


Get the inside scoop on
the hottest topics in data
management and analysis:

Big Data
technologies,
including
Hadoop, NoSQL,
and in-memory
databases
Solving
complex data
and application
integration
challenges

Increasing
efficiency
through cloud
technologies
and services

Tools and
techniques
reshaping the
world of business
intelligence
New
approaches
for agile data
warehousing

For information on upcoming reports:


http://iti.bz/dbta-editorial-calendar

Key strategies
for increasing
database
performance
and availability

To review past reports:


http://iti.bz/dbta-whitepapers

The State of Social Media

Social Media Analytics Tools and Platforms


industry updates

The Need for Speed


By Peter J. Auditore

Social media networks have radically


changed the way we do business and are now
vibrant and dynamic channels of influence
and content aggregation. They are virally
creating communities within communities
that are driving brand recognition and experience, product innovation, and everything
else associated with communications. Social
media networks now facilitate and automate
vast channels of interactions, connections,
and networks of people by enabling collaboration with colleagues, clients, and suppliers
anywhere and at any time.
These new channels (or ecosystems of
influence) have been greatly enhanced by a
rich suite of evolving Web 2.0 applications
that make it easy to participate in social media
environments. This is often overlooked in the
grand scheme of things. Most importantly,
however, within these new channels are
individual influencer ecosystems with their
own dynamics interrelationships, characteristics, and influence models. As management
expert Gary Hamel once said, Influence is
like water, always flowing somewhere. And,
endorsement is the interaction that flows
into Twitter, Facebook, LinkedIn, Yelp, and
TripAdvisor, for example.
Social media networks are now evolving
beyond Twitter, Facebook, and LinkedIn,
and collaboration has become less about
exchange and more about endorsement.
Social media analytical tools and platforms
must go beyond monitoring, and analyzing text strings and search terms. They now
need to encompass evolving channels such as
TripAdvisor, YouTube, Yelp, Pinterest, Instagram, and Vine, for example.

26

BI G D ATA SOU RC EBO O K 2014

The Global Bazaar and Wharton


Consumer Analytics Initiative
The impact of far-reaching social networks on business and government is staggering. Every day, millions of consumers,
partners, suppliers, and businesses discuss
and share their brand experience. Enter the
new business-to-person (B2P) paradigm.
According to Satish Nambisan and Mohanbir
Sawhney, co-authors of The Global Brain,
Social customers are driving innovation,
they are empowered and collaborative, they
are the drivers and initiators of innovation
and are increasingly viewed as a strategic asset
to companies. Today, customers are looking
for a personalized experience and relationship, demanding solutions rather than products, in what they call the global bazaar.
One of the more interesting academic and
industry alliances is the newly created Wharton Consumer Analytics Initiative, where
companies are giving Wharton data to analyze (http://wcai.wharton.upenn.edu). This is
unique in that the majority of business intelligence tools and data warehousing practices
were developed in industry not academia.
In the 2013 Big Data Sourcebook article
on Social Media Analytics, we discussed how
social media networks were creating large
datasets that provided organizations with an
opportunity to gain competitive advantage
and improve performance by getting closer
to their customers and constituents. We
explored how social media datasets provide
important insights into real-time customer
behavior, brand reputation, and the overall
customer experience. And how data-driven
organizations were monitoring and collect-

ing these data from owned channels such


as SAPs SDN or Salesforces Chatter and
open channels such as LinkedIn, Twitter,
Facebook, and others. For the most part,
this is still the case; although, in 2014 organizations are beginning to harvest and stage
social media data for collection and content analysis, in addition to employing even
more new and easier to use listening tools
and analytic platforms that deliver on the
need for speed.
This year, we will focus on the top 10
challenges and issues associated with social
media analytics and characteristics of influence channels, and provide a more comprehensive view of technology/vendor trends
and business objectives, rather than dig
into the history of analytics. However, I
strongly encourage a visit to last years
Social Media Analytics article at www.dbta
.com/DBTA-Downloads/SourceBook/BigData-Sourcebook-Your-Guide-to-the-DataRevolution-Free-eBook-4216.aspx.

Top 10 Social Media Technology


Trends and Challenges:
1. Content analytics on owned channels
is king
2. Content analytics on open channels
with listening and monitoring tools
3. The emergence of the analytic database
with in-memory functionality
4. Datasets are huge and often
unstructured
5. Finding relevant data or data that
matters is difficult
6. The need for speed and scalability
is paramount

industry updates

The State of Social Media

7. Data is not consistent and needs to


be cleansed
8. The rise of new data platforms:
Revolution R, Hive
9. The data scientist is emerging
10. Lack of data analyst talent in most
organizations

The Global Business of Social Media


Last year, the majority of organizations were not harvesting and staging data
from social media channels, but now new
scalable and fast analytic databases are
enabling in-house and cloud staging of
social media datasets for social media listening, analytics, and brand management.
Many were employing public relations
agencies to execute these new business
processes and this trend continues, primarily because they lack internal talent,
such as data architects or hybrid data
analysts, that are part psychologist and/or
anthropologist, and can identify what data
to act on and how to act on it.
PR agencies are now investing heavily in technology. For example, Edelman
recently launched Berland, a new specialist research and analytics subsidiary in
Greater China. Agencies with large client
bases are leveraging technology as competitive differentiator and providing their
clients with nearly all that is needed from a
social media listening, analytics, and brand
management perspective, so for them its a
battle for share of wallet.

Content Is King
One of the most important aspects of
content management across social platforms is understanding how consumers
are engaging with content. Social media
tools and platforms are innovating in areas
such as social community content management, enabling businesses to manage
content around how consumers engage in
their own channels. However, some social
media content monitoring tools dont let
you drill down into the conversation. The
next frontier for many of these products
will be mixed media modeling.
Still, significant regulatory issues are
associated with harvesting, staging, and
hosting social media content, and apply
28

BI G D ATA SOU RC EBO O K 2014

to nearly all data types in regulated industries. Data protection, security, governance, and compliance have entered an
entirely new frontier with in-house and/or
cloud-based management of social data.
Many social media products dont incorporate processes for governance and compliance and data security.

Accelerating Decision Making


In todays social media networks, people trust the information enough that it
is accelerating business decision making;
making processes such as sentiment analysis and fast access to data analysis is even
more important. In-memory processing
and analytical databases now enable fast
and efficient processing and cleansing of
data, in the old world of data warehousing,
this is called ETL (data extraction, transformation and loading). New products such

New products have emerged


to handle data, and are often
10 times faster than the
normal analytics.

as analytic engines have also emerged to


handle data, and are often 10 times faster
than the normal analytics that can also
handle streaming analytics required for
computing credit scores, tracking transactions, and identifying fraud. FICO, for
example, employs in-house-built streaming analytics engines.

Social Media Tools and Platforms


Apart from the industrys leading analytics platforms such as IBM
(Cognos-SPSS), Oracle, SAP (Business
Objects), and SAS, which support social
media analytics, there are at least 37 vendors that now have platforms or analytic
modules that specialize in the analysis of
social media data on-premise or in the
cloud. The market for these products has
exploded along with the data volumes, and

there are some lists on the web with 50 or


more social media analytics vendors. This
is in comparison to our short list last year
of only nine.
The majority of the products are marketed as providing social media intelligence, monitoring, and analysis. Many of
the analytic modules support enterprise BI
vendor platforms, however, many are not
evolving fast enough to meet the needs of
business. Most support the usual suspects,
Facebook, Twitter, LinkedIn, YouTube,
Instagram, Pinterest, and some specialize
in competitive intelligence for example.
Competitive intelligence modules can
monitor competitor website traffic, identify, and monitor channel performance.
More advanced analytic modules are
focusing on specific areas such as consumer behavior and are most effective
when used with the new class of high performance analytical databases.
Some analytic modules are specializing in competitive intelligence only, while
others offer more comprehensive suites
of functionality and enable you to place
content and/or engage in the conversation.
In addition, the ability to drill down into
the sentiment and contextual and noncontextual data from multiple sources
is important as social media continues
to evolve beyond the social media giants
Facebook, Twitter, and LinkedIn.

Key Functionality of Social Media


Analytic Modules or Platforms
Twitter-certified
Monitoring of brand, keyword, and
hashtags
Standard reporting tools
Content monitoring and tracking
Competitive analysis and
benchmarking
Competitor website and channel
monitoring
SEO analytics
Influencer identification and tracking
and reporting
Multi-channel analysis
Multiple dashboards and landscapes
Social media platform traffic and
real-time monitoring
Automated alerts

The missing link now in social analysis is not the data;


its the lack of expertise in the form of data scientists.

Social Analytic Tools and Platforms


Adobe Social www.adobe.com
Agorpulse www.agorpulse.com
(monitor your Facebook and Twitter)
Autonomy www.autonomy.com
Attensity http://attensity.com
Brandwatch www.brandwatch.com
BambooEngine www.manumatix.com
(consumer engagement-sales)
BlueYonder www.blue-yonder.com
Buffer www.bufferapp.com (services
for agencies, business and enterprise)
Crimson Hexagon
www.crimsonhexagon.com
Crowdbooster www.crowdbooster.com
(Facebook and Twitter)
Conversocial www.conversocial.com
(cloud-based solution)
Dataminr www.dataminr.com
(financial and government focus)
Gnip www.gnip.com
Google Analytics www.google.com/
analytics (Facebook, Twitter, LinkedIn,
and Google+)
Hootsuite www.hootsuite.com
InfiniGraph www.infinigraph.com
Lithium www.lithium.com
Kapow http://kapowsoftware.com
Moz Analytics www.moz.com
Netbase www.netbase.com
Quintly www.quintly.com (Facebook,
YouTube, Twitter & Google+)
Revolution R
www.revolutionanalytics.com
Rival IQ www.rivaliq.com (cmpetitive
tracking of Facebook, Twitter,
Google+, LinkedIn)
Salesforce Exact Target Marketing
Cloud www.salesforce.com
Sysomos www.sysomos.com
Simplymeasured
http://simplymeasured.com
(The Amazons, plus Vine & others)
SocialBakers www.socialbakers.com
(Facebook, YouTube, Twitter, and
Google+)

Socialmetrix www.socialmetrix.com
(Latin America)
SproutSocial http://sproutsocial.com
Topsy www.topsy.com
Visible www.visible.com
Wayin www.wayin.com (BYOD
analysis, curate, and distribute content)
Zuum www.zuumsocial.com
(Facebook, Twitter, YouTube,
Instagram, and Google+)
33Across www.33across.com

Net/Net
Now more than ever, we live in what
management advisor Joe Pine coined
as the experience economy, and social
media channels deliver more than that one
memorable event for the customer that
becomes the product (The Experience
Economy: Work Is Theater & Every Business a Stage, Pine and Gilmore, 1999).
Data processing has changed and many of
the legacy platforms (including the analytics) of the 1980s2000, are challenged
to handle the waves of data created by the
internet and, most notably, social media
channels. The open source community
has emerged to address this challenge
through Hadoop and Apache Hive, along
with a new breed of analytic databases and
social media analytics and platforms. The
bullet list below describes the majority of
business activities enabled by social media
analytics modules and platforms.

Business Activities Enabled By


Social Media Analytic Tools
Brand and sentiment analysis: watch
nearly in real-time sentiment analysis
Identification, ranking, and now
tracking of key influencers
Campaign tracking and measurement
Product launch measurement
Product innovation through
crowd-sourcing
Digital channel influence

Purchase intent
Customer care
Risk management
Competitive intelligence and tracking
Partner monitoring
Category analysis

Whats Ahead
Looking ahead, it will be important
for social media analytics tools to advance
into mixed media modeling with the ability to drill down into conversations, incorporate procedures for data security and
governance, and evolve with greater agility
to address changing business needs
The missing link now in social media
analysis is not the data; its the lack of
expertise in the form of data scientists.
This presents another widening chasm
between line professionals who are already
frustrated with IT, and IT professionals.
While you cant expect IT to understand and know how to act on consumer
behavior psychology on a 24/7 basis, technology has always been a competitive differentiator in business. Now, social media
analytic tools and platforms have become
a competitive weapon for smart businesses
and organizations. n
Peter J. Auditore is currently
the principal researcher at
Asterias Research, a boutique consultancy focused
on information management, traditional and
social analytics, and big
data (www.thedatadog@wordpress.com).
Auditore was a member of SAPs Global
Communications team for 7 years and most

recently head of the SAP Business Influencer Group. He is a veteran of four technology startups: Zona Research (co-founder);
Hummingbird (VP, marketing, Americas);
Survey.com (president); and Exigen Group
(VP, corporate communications).
DBTA. COM

29

industry updates

The State of Social Media

industry updates

The State of Data Quality and Master Data Management

The Big Data


Challenge to
Data Quality
By Elliot King

Changes in paradigms in information technology do not happen nearly as


quickly as people usually suggest. Mainframes dominated the IT world in the
1950s and 1960s. They gave way first to
minicomputers, the personal computers
networks in the 1970s and1980s, and laptops in the 2000s. In some way, shape, or
form, distributed computing dominated
the IT landscape for nearly 30 years. Now,
devices such as smartphones and tablets
are in ascendance.
Each change in computing infrastructure has a huge impact on the data produced, accessed, and applied. In the mainframe world, data was closely linked to the
programs that produced it. The creation of
relational databases in the 1970s severed
the link between the application and the
data it produced and set off the first great
data explosion. Suddenly, applications were
creating structured data that could be used
as a resource by other applications. Structured data could be combined, analyzed,
30

BI G D ATA SOU RC EBO O K 2014

and redeployed in ways that were only


limited by users imaginations. As the relational database became ubiquitous in the
enterprise, data growth rates accelerated.
But that was only the beginning. The
personal computer revolution and the
extension of compute power throughout
not only the enterprise but also peoples
homes, triggered huge growth of structured data as well as semi-structured data
such as email and unstructured data such
as text. And the proliferation of computer
networks meant that data of all sorts could
fly instantly to all sorts of places.
And then came the internet. The internet meant that not only could more people
produce more data and move it to more
places more quickly than ever before, but
also that clever new applications could
produce new kinds of data that people
could analyze and use to their advantage.
Social media turned the whole world into
data producers and many enterprises are
currently salivating about the possible

uses for social media information. To use


a metaphormainframe computing was
like exploring a planet; relational databases
opened up a solar system; and personal
computing and the internet allowed folks
to investigate a whole galaxy of information. Each step led to an exponential, heretofore unimaginable, jump in data growth,
data types, and potential data uses.
But each shift in data also posed new
problems for data quality. The records
for structured data could be inconsistent,
incomplete, or obsolete. Generated by
different applications managed by different internal groups, structured data had
to be centralized and then transformed
into a useful format to be analyzed and
deployed. Data from social media is even
more problematic. Among the many challenges associated with social media data is
that its provenance is unclear and suspect.
Who actually created the information
and how reliable is it? Can it be trusted?
A recent study titled, Governance Moves

industry updates

The State of Data Quality and Master Data Management

Metadata standards and requirements in big data environments


have to be defined, clear, applied at very basic levels, and
integrated across the enterprise to ensure that complex data
from multiple sources can be used throughout the enterprise.

Big Data from Hype to Confidence, conducted by Unisphere Research, a division


of Information Today, Inc., and sponsored
by IBM, found that more than one-third
of the respondents had less confidence in
data gleaned from outside sources than
they had for internal data and 61% were
not confident in the data garnered from
social media. Nonetheless, around onethird of respondents also indicated that
they were willing to use lower quality data
for analysis. Indeed, it seems that data
quality efforts have always trailed the creation, dissemination, availability, and use
of new data and that is the way it has been
for several data cycles.

Big Data Challenge


The most recent explosion in the volume, variety, and velocity of data has led
to the use of the term big data. And the
growth of big data has led to several new
challenges to data quality. At its most
fundamental, data quality has been premised on the idea that people first identify the questions they want answered and
then identify the data needed to answer
those questions. Once the dataset or sets
have been determined, data quality professionals typically define data quality
benchmarks to ensure the data is appropriate for the intended use. The process
entails measuring key data attributes,
which generally include validity, accuracy,
timeliness, conformity, and completeness
among other attributes. Best practices call
for data quality to be managed as close to
the data source as possible and all the elements of the data quality process should
be documented.
Big data often flips large parts of that
formula. One of the mechanisms associated with big data is to gather data first and
32

BI G D ATA SOU RC EBO O K 2014

figure out the questions later. Frequently,


it is very difficult to manage big data flows
at the source, since an enterprise may
not control the source of the data. Moreover, by its very nature, more often than
not, big data includes a wide variety of
data types, some of which may be highly
unstructured.
Consequently, data quality efforts in
big data environments have to focus on
different elements than those used in more
traditional scenarios. Metadata standards
and requirements have to be defined, clear,
applied at very basic levels, and integrated
across the enterprise to ensure that complex data from multiple sources can be used
throughout the enterprise. Data classifications and categorization is essential. For
example, is the data made up of personal
information, financial data, product attributes, and so on? Finally, data exchange
standards are critical to data quality for big
data. Data exchange standards can play a
central role in the data acquisition process
as they enable the mapping of data across
multiple data sources.

Master Data Management


One of the more common tools applied
to data quality efforts for big data projects
is master data management (MDM). MDM
allows enterprises to link all of their critical
data in a single file. Implementing a master
data management platform streamlines the
sharing of data across the enterprise and
allows for the use of data by multiple platforms and applications. In the past, MDM
platforms were used to create a holistic view
of an entityoften a customerfrom the
data generated by a variety of internal systems ranging from the CRM application to
the financials. The promise of MDM was
to construct a single version of the truth.

While a single version of the truth may


be harder to construct in big data scenarios,
the application of MDM can increase confidence in the trustworthiness of big data
analysis.
According to a survey conducted by the
analyst firm Information Difference, about
60% of the respondents who indicated
that they had big data projects underway
also had implemented master data management. And more than 50% of those
with both big data projects and MDM
programs said that the two were linked.
Interestingly, in many ways, MDM and
big data can interact synergistically. On
the one hand, an MDM file on customer
data could potentially be combined with
social media data to provide additional
insight into consumer behavior. On the
other hand, that analysis of big datasets
could potentially generate new data for the
MDM implementation. According to the
Information Difference survey, however,
only 17% of the respondents indicated
that they garnered new data for MDM
via the use of big data. Instead, MDM was
driving the use of big data.
Most observers believe that MDM provides a necessary component for big data
projects by establishing a meaningful context for analysis as well as a wide-angle lens
for critical data entities such as customers,
products, and employees. It can be used to
store attribute-level data such as Facebook
IDs or phone numbers. Transactional-level
data such as likes or conversations can be
stored in the big data repository.
While MDM may be positioned to play
an enhanced role in the data infrastructure, that transition will take place over
time as more big data projects come on
stream. Indeed, the IBM survey revealed
that only 30% of the respondents had

sponsored content

A 3-Step Process to Prepare


and Blend Big Data for Big Insights
Bigger isnt always better when it
comes to data, no matter how big the buzz
that surrounds Big Data. The trick to Big
Data is ensuring that insight derived is
accurate and optimized. Enter Big Data
Quality and the Big Data Blend. With three
data quality steps, which culminate in the
Big Blend, you can have a winning recipe for
actionable, meaningful analytics and insight
at the highest level.
The gargantuan amounts of data from
disparate sources create a complex Big
Data stream that needs data quality at the
get-go to help gather meaningful, actionable
analytics (worth repeating) and business
intelligence. After all, the driving force
behind Big Data is analytics and data mining
to understand customers, prevent fraud,
improve sales and marketing, create better
business decisions, and ultimately gain a
single, accurate view of the customer.
Too often Big Data meets the GIGO
scenario (garbage in and garbage out).
Output based on bad data leads to
untrustworthy conclusions and faulty
decision-making without the threestep process that ends with the Big Data
Blend. About 60% of IT leaders say their
organizations lack accountability for data
quality of any kind, and more than 50%
doubt the validity of their data rendering
it unusable.
The GIGO fate and any doubts about
data validity can be avoided when Big
Data meets Big Data Quality and reaches
the Big Data Blend where unstructured
data becomes useful. Data quality is an
important consideration for relational
data, and even more so with Big Data/
unstructured data. Unstructured data can
come from anywhere. And, that creates
perhaps the all-encompassing challenge
of Big Dataobtaining value from the
data delugefrom financial and Web
transactions, documents/emails, Social
Media/Weblogs, machine devices, and
even scientific data points.

Welcome Big Data Blend to the process.


Big Data blending encompasses data quality
principles by using authoritative reference
datasets to enrich and validate data that is
then matched, linked, and merged to create
the best data sets for Big Data analytics.

THE STEPS TO ACHIEVE THE


BIG DATA BLEND FOR THE
ULTIMATE BIG DATA ENRICHMENT:
1. Profile, parse, and extract entities.
Entity extraction needs to be able to pull
customer and other structured data points
from unstructured data. For example,
emails, names, IP address, and URLs can
be matched with the customer record.
2. Cleanse, merge, and link. Data Quality
and Entity Matching/Merge delivers
timely, trusted, relevant Master Data to
the process. What applies in small data
Discover, Model, Cleanse, Recognize,
Resolve, Relate/Match, Governalso
applies to Big Data. Identifying duplicates
and merging/purging them helps provide
a single, accurate view of the customer
what customers want to buy, what they
have bought already, and their sentiments
toward the products. This information
can drive decisions, marketing, and
promotions.

3. B
 lend and enrich. Now that you have
clean, consolidated data, its time to
blend in authoritative reference data
(geographic, demographic, psychographic,
and firmographic) to enrich the Big Data
stream. For instance, you cant get an
precise rooftop latitude/longitude without
a valid address. Once you have that, you
can append many different types of data
to model your customer profile.
The bigger the data and the greater the
number of sources, the more crucial it is to
use the Big Data Blend to create a golden
version of the trutha complete and
accurate view of the customer.Linking
and merging allows businesses to perform
accurate customer segmentation and
sentiment analysis and gives a better
identification of whos legit and whos not.
Big Data can be many things or it can be
a bunch of nothing if fed into the machine
to become part of the noise. Follow a threestep process that incorporates the principles
of Big Data Quality and Big Data Blending
that results in the sweet sound of success
and extraordinary customer insight.
MELISSA DATA CORP.
www.melissadata.com
DBTA. COM

33

industry updates

The State of Data Quality and Master Data Management

MDM platforms in production and 22%


were not planning to move to MDM at all.

Potential Pitfalls
Implementing data quality processes
within the context of big data projects
presents many potential pitfalls. The first
emerges when a big data project is ready
to go live. At that point, the project team
should have defined what constitutes
good data and those rules should be rigorously applied. Since big data almost
by definition involves incorporating data
from many different sources, if the proper
data quality standards are not in place
prior to the initial load, data quality problems will surely emerge over time. Indeed,
frequently the lead-up to the initial load
may be the only time the entire project
team comprised of both IT and business
stakeholders can cooperatively define
what constitutes good data.
In general, the quality of data can be
assessed using standard quality metrics
such as completeness, consistency and so
on. But for big data projects, an additional
measure should be addedrelevancy. Just
because data is available does not mean it
needs to be captured or used, though, with
big data the temptation is to do just that.
The second pitfall lies in managing the
sources of the data. In most big data projects, data flows in from a range of applications, both internal and external, and
the enterprise often does not have control
of all the sources. One approach to this
problem is creating a so-called data firewall. Not unlike an internet firewall, a data
firewall applies the data quality rules and
standards developed for the initial data
load to all the data coming into the system.
Some enterprises also reach out to external
data providers to encourage them to provide data at the necessary quality.
The final challenge and pitfall for data
quality in a big data environment is data
maintenance. In too many cases, once data
has been loaded, the rules and standards
for data are no longer updated. But not
infrequently, as companies merge with or
acquire other companies, they will have to
bulk-load data again.
In addition to preserving the initial data
quality structure, companies must imple34

BI G D ATA SOU RC EBO O K 2014

ment an ongoing program to identify and


correct faulty data. People make mistakes;
data is entered incorrectly. An ongoing data
maintenance program is essential.

Whats Ahead
While companies are just starting to
adjust their data quality programs to meet
the demands of big data, the next data
shock to the system is already apparent
the Internet of Things. To return to the
earlier metaphorif the mainframe and
mini-computing were the earth, personal
computing a solar system, and the internet
and social media a galaxy, the Internet of
Things will be like a new universe of data.
It will be so vast it is hard even to imagine
its limits. And it will be always expanding.

Just because data is


available does not mean it
needs to be captured or used,
though, with big data the
temptation is to do just that.
Social media meant that billions of people could suddenly create useful information from which enterprises would like to
derive value; the Internet of Things means
that hundreds of billions of devices will be
producing useful data on basically a nonstop basis.
Fitbit, one of the first truly popular wearable computing devices, offers a
small glimpse of what is to come. For the
uninitiated, Fitbit is a health tracker worn
around the wrist that constantly monitors
the steps people take, the quality of their
sleep, and how long they have been active.
All that data can eventually flow to a host
of places including a health record, an
insurance company, or a sports coach. Fitbit produces data that was not easily captured before but will be of value to a wide
range of stakeholders.
And Fitbit is just a tiny example of the
start of the Internet of Things. According
to a survey of more than 1,600 futurists
and other experts conducted by the Pew

Research Center, the next decade will witness an explosion of machines communicating with other machines, wearable
technologies and the widespread embedding of sensors in virtually everything.
The unprecedented explosion of data will
be accompanied by a demand to analyze
this data to gain insight and advantage.
The ultimate goal of Fitbit is to improve
health. Data generated by road sensors, for
example, could be used to manage traffic
flows. And sensors attached to inventory
can form the basis to restructure purchasing decisions.
For those goals to be realized, new ways
of aggregating data and ensuring its quality will have to be developed. Data lakes
are one of the new approaches to assembling huge amounts of data. The idea is
to gather data together in their raw formats without going through an extract,
transform, and load process. The raw data
would be called on and transformed as
needed. But as analysts from Gartner have
pointed out, the data lake approach poses
significant problems for data quality. By
definition, data lakes accept data without
oversight or governance, two central elements of a sound data quality approach.
Ensuring data quality has always been
a struggle. Under the best conditions,
data decays, errors are introduced, and, in
general, data quality degrades over time.
However, the new waves of data growth
represented by big data and the Internet
of Things promise to make it even more
challenging to ensure data quality. Ultimately, additional strategies, tactics, and
techniques will have to be developed to
enhance the existing approaches. n

Elliot King has reported

on IT for 30 years. He is
the chair of the Department of Communication
at Loyola University Maryland, where he is a
founder of an M.A. program in Emerging Media. He has written
six books and hundreds of articles about
new technologies. Follow him on Twitter
@joyofjournalism. He blogs at emerging
media360.org.

industry updates

The State of Data Warehousing

BUILDING THE
UNSTRUCTURED
BIG DATA/
DATA WAREHOUSE
INTERFACE
By W. H. Inmon

One of the central issues of the data


architect/data analyst in 2014 going into
2015 is that of understanding how the world
of data warehousing will interface with big
data. Indeed, big data is everywhere and new
applications are being created daily. But most
organizations already have a data warehouse
environment in place. Data warehousing is at
the heart of the corporate analytical processing. The big question then becomes how does
the world of big data work in cooperation
and in conjunction with the world of data
warehouse?
One option is to have no interface
between these two environments. In this
case, the two environments sit side by side
and have no interaction or interchange of
information. This lack of interaction is a
viable option but is not a very good one.
Both big data and data warehouse are capable of supporting decision making and both
worlds entail data, so it makes sense that
there should be a cooperative, constructive
interface between these two worlds. Acting
as if these worlds have no relationship to
each other is simply not productive or realistic, even if it is a real possibility.

36

BI G D ATA SOU RC EBO O K 2014

What Does the Interface Look Like?


So the question then becomes how
should the two worlds interface with each
other? How can an organization get the
best out of both worlds? Figure 1 outlines
how big data and data warehouse should
interface to create a constructive, nonoverlapping world.
The starting point of the architecture is
big data. Big data can be divided into two
kinds of datarepetitive data and nonrepetitive data. This division is indicated
by the red line bisecting raw big data in the
diagram. Repetitive data in big data is data
that repeats itself in large numbers where the
structure of the data is very similar and where
even the content may be identical. Typical
repetitive data might include metering data,
click stream data, telephone call record detail,
log data, and so forth. Structured data is
often placed in the big data environment and
resides in big data as another form of repetitive data.
Nonrepetitive data in big data is data
where the content and structure of each unit
of data is dissimilar. Each nonrepetitive record
is unique in terms of structure and content. If

any two records happen to have similar structure and/or content in nonrepetitive data, it is
an accident. Some examples of nonrepetitive
data in big data include email, corporate call
center records, corporate contracts, warranty
claims, insurance claims, medical records,
and so forth.
There are (at least!) two good reasons for
making this fundamental division of data in
big data. The first reason is that the technology required to process the different types of
big data is very different. Repetitive data can
be read and treated in a very simple manner.
You can load, read, and write repetitive data
in big data very simply. But nonrepetitive
data in big data requires the technology of
textual disambiguation in order to be transformed into a usable state.
Textual disambiguation is the technology
that reads nonrepetitive data and turns it
into structured data that can be managed in
a simple manner. Textual disambiguation of
nonrepetitive data is as fundamental to doing
business analysis on nonrepetitive data as the
basic storage method of dataHadoop.
A second reason why there is such a distinction between the two types of data found

The State of Data Warehousing

industry updates

in big data is that when it comes to business value, the vast preponderance of business value is in nonrepetitive data. Stated
differently, while there may be a lot of
repetitive data in big data, there is limited
business value there. Far and away the preponderance of data that contains business
value is nonrepetitive data.
Because of these two large differences,
it is necessary to treat big data as if it were
two different kinds of data. There are very
divergent paths of processing that apply to
repetitive data and nonrepetitive data.

Nonrepetitive Data
and Textual Disambiguation
Nonrepetitive data is read by textual disambiguation (also known as textual ETL)
and is prepared for analytical processing. A
simplistic perspective of textual disambiguation is that textual disambiguation parses
the nonrepetitive data into a state where the
data can be easily analyzed. However thinking of textual disambiguation as a parsing
functionality misses the point about what
textual disambiguation really does because
textual disambiguation does much more
than parse nonrepetitive data.
Textual disambiguation does such
important activities as standardize data,
edit data, transform homographs, resolve
acronyms, correct spelling errors, and
so forth.
Textual disambiguation accomplishes
many things, but the single most important thing done by textual disambiguation
is that textual disambiguation identifies
and establishes context for the nonrepetitive data. Without context, nonrepetitive
data is almost useless.
In fact, textual disambiguation does
much more than establish context for
nonrepetitive data. But identifying and
establishing context is the most important
thing done by textual disambiguation.

Repetitive Data and


Distillation and Filtering
Repetitive data is treated in two ways.
Repetitive data can be either distilled or it
can be filtered. When repetitive data is distilled, many records of repetitive data are
analyzed and the results are refined down

to a single value or a very finite set of values. As an example of distillation, many


bank accounts are studied and the bank
decides to raise interest rates. The result
of the distillation is a one-half percentage
point raise in the rates charged by the bank.
The filtering of repetitive data is somewhat similar. In the filtering of repetitive
data, repetitive records are read, and the
records that are of interest are filtered
out and reformatted. Typically there are
many records that pass through the filtering process.
The difference between the distillation process and the filtering process is
that the distillation process produces only
one record as its output, while the filtering process produces multiple records as a
result of its execution.

Placing the Results in a Database


The result of the distillation process,
the filtering process, and textual disambiguation is data that can be fed into a standard dbms. There may be a considerable
volume of data that is fed into the database, but the volume is still a fraction of
the data that is found in big data.
By the time the data is fed into the
database, it takes the form of a standard
database (usually a relational database).
The database looks no different than any
other database. The only difference is that

the data in the database comes from a very


different source of data than traditional
data found in a data warehouse.
Because the data coming into the database comes entirely from big data, and
because all data in big data is unstructured, the database that has been created
can be called an unstructured database.
This terminology is somewhat a misnomer because all the data in the unstructured database is actually structured. But
the source of data for the unstructured
database is all unstructured.
Notwithstanding the confusion of
terms, it is possible to create a structured
database that reflects the terms and words
found in the big data environment.
It is noteworthy that data coming out
of textual disambiguation does not have to
be placed into a database. Data coming out
of textual disambiguation can be placed
back into big data. But when the disambiguated data is placed back into big data, it
is placed there in a context enriched state.
There are then two kinds of big dataraw
big data and content enriched big data.

The Unstructured Database


The creation of a structured database
based on unstructured data coming from
big data has several major advantages. The
first advantage is that the data found in the
unstructured database can be processed
DBTA. COM

37

industry updates

The State of Data Warehousing

by standard analytic technology. You can


use Excel. You can use Cognos. You can
use Tableau. In fact there are many, many
analytic packages that can be used against
the unstructured database.
But there is a second, perhaps even
larger advantage to creating an unstructured database. That advantage is that
the unstructured database can freely be
mixed with other standard structured
databases.
Most organizations already have a
collection of standard structured databases. These databases typically have been
derived from the legacy systems environment and are often called a data warehouse. The raw legacy data has passed
through ETL processing and the resulting
data is integrated. The source for practically all the data found in the structured
database environmentthe classic data
warehouse environmentis the transaction processing application environment.
Once the structured data has been
placed into a database the data is formatted into a standard relational environment.
So organizations already have a collection of databases that contain data that
comes from the application, transaction
processing environment. Now they have
data that has come from the unstructured,
big data environment. And the data that
exists in the unstructured database is in a
completely compatible format as the existing data warehouse environment.

Analytical Processing
It is a simple and time-tested proposition for SQL to access the two types of data
together, allowing the analyst to start to
do query and analytical processing against
structured data and unstructured data at
the same time in the same query.

Whats Ahead
The possibilities for analytical processing are endless. Now, the analyst can look
at data in a perspective that before has not
been possible. Now, the analyst can do
such things as:
Analyze customer sales data along
with customer feedback data
38

BI G D ATA SOU RC EBO O K 2014

Analyze warranty data, creating a database of manufacturing malfunctions


Analyze call center data along with
customer sales data
Analyze corporate contract data
along with customer purchases
And the list goes on. This is just the tip
of the tip of the iceberg.
Now, useful meaningful business analysis can take place looking at perspectives
that were before impossible to examine.

The analysis of structured


and unstructured data
produces analytical
opportunities combining
different perspectives of data.
Different Types of Analysis
Once the two types of databases (that
now constitute the data warehouse) are
built, different kinds of analysis can occur.
These types of analysis are the following:
Analysis of structured data
Analysis of both structured and
unstructured data
Analysis of just unstructured data
The analysis of just structured data is
the analysis that has occurred since there
was the first data warehouse. The analysis of structured and unstructured data
is new. Now, combining the two types of
analysis produces analytical opportunities
combining different perspectives of data.
The third type of analysis is also new.
The third kind of analysis is analysis of just
unstructured data.
The most pregnant place to find business value is here. For many reasons there
is very high business value in looking at
the feedback and comments that customers have made.

Archiving Data Warehouse Data


There is another interesting and powerful path that data takes in the big data/

data warehouse architecture. That path is


from the data warehouse back to the big
data environment. This path is seen at the
top of the figure.
Over timeas data agesthe probability of access to the data in the data
warehouse diminishes. As the probability
of access diminishes (usually a function of
age), it no longer makes sense to keep the
data in the data warehouse. The data that
has aged is then sent to big data.
There is a doubly beneficial effect to
the movement of data from the data warehouse to big data:
The cost of the data warehouse is
lowered, and
Performance of the warehouse is
enhanced.
The movement of data whose probability of access has diminished has long
been known and recognized. In an earlier
day this was called the movement of dormant data to the worlds of big data. This
data management practice was first discussed in the works on DW 2.0, published
in 2010. Today the practice is embodied in
the data warehouse/big data architecture
that has been described here. n

W. H. Inmonthe father

of data warehousehas
written 52 books published in nine languages.
His latest book is Data
ArchitectureA Primer
for the Data Scientist
(Elsevier Press). Inmon speaks at conferences regularly. His latest adventure is
the building of Textual ETL/textual disambiguationtechnology that reads raw text
and allows raw text to be analyzed. Textual
disambiguation is used to create business
value from big data. Inmon was named
by ComputerWorld as one of the 10 most
influential people in the history of the computer profession. He lives in Castle Rock,
Colo., where he also founded Forest Rim
Technology (www.forestrimtech.com).

sponsored content

The Modern
Database Landscape

Technology Innovations Power the Convergence of Transactions and Analytics


Emerging business innovations focused
on realizing quick business value from new
and growing data sources require hybrid
transactional and analytical processing
(HTAP), the notion of performing analysis
on data directly in an operational data store.
While this is not a new idea, Gartner reports
that the potential for HTAP has not been
fully realized due to technology limitations
and inertia in IT departments.
Traditionally, databases have been loosely
categorized into two groups: those optimized
for online transactional processing (OLTP)
and those optimized for online analytical
processing (OLAP). Until recently,
technological limitations have undermined
the efficacy and prevented adoption of a
unified HTAP platform. However, advances
in in-memory computing technology are
making HTAP a reality. HTAP is performing
transactional and analytical operations in
a single database of record, often doing
time-sensitive analysis of streaming data.
An HTAP-oriented system will most likely
complement an organizations existing
analytics infrastructure rather than replace
it entirely
The main benefits of HTAP can be
grouped into two main categories: 1)
enabling new sources of revenue and 2)
reducing administrative and development
overhead by simplifying your computing
architecture.

NEW SOURCES OF REVENUE


Many databases promise to speed up
applications and analytics. However, there
is a fundamental difference between simply
speeding up existing business infrastructure
and actually opening up new channels of
revenue. True real-time analytics does not
just mean faster analytics, but analytics that
capture the value of data before it reaches a
critical time threshold.
Technological limitations have
necessitated maintaining separate workload-

specific data stores, which introduces latency


and complexity and prevents businesses
from capturing the full value of real-time
data. HTAP systems address these issues and
allow enterprises to harvest new business
value such as automating data-driven
decision making, performing real-time
operational analytics, and detecting and
responding to anomalies in real-time.

SIMPLIFIED COMPUTING
ARCHITECTURE
Because of limitations in legacy database
technology, organizations have turned
to in-memory computing tools like data
grids, stream processing engines, and other
distributed computing frameworks to
process data in a real-time window. While
these tools have their uses, they introduce
additional complexity to an organizations
computing infrastructure. Additionally,
they can be misused when companies dont
fully understand their intended use case.
Often they are used to try to compensate for
database latency. However, when possible, it
is preferable to use a more powerful database
rather than separate data processing tools in
order to preserve simplicity.
An HTAP system can dramatically
simplify an organizations data processing
infrastructure. For many companies, an
HTAP-capable database becomes the core
of their data processing infrastructure and
handles most of their day-to-day operational
workload. It serves as a database of record,
but it is also capable of analytics.
There are many advantages to
maintaining a simple computing
infrastructure: increased uptime, reduced
latency, and faster development cycles, to
name a few.
In addition to the generic benefits of
simple infrastructure, HTAP systems in
particular provide some unique benefits:
1. Save development time and prevent
disasters by eliminating the need for

synchronized copies of live operational data.


Synchronizing data is invariably difficult. It
is especially difficult in the situation where
both systems must have current operational
data, and a composite infrastructure has
been chosen specifically because one system
has insufficient throughput. To further
complicate matters, systems like stream
processing engines and data grids may not be
fully fault tolerant, which places additional
importance on the database of record.
An HTAP system eliminates the need for
multiple copies of real-time data.
2. Reduce hardware costs and
administrative overhead by eliminating
unnecessary duplicated data. Maintaining
multiple operational data stores (i.e., not
for disaster recovery) requires paying for
additional hardware and DBAs. HTAP
eliminates the need to manually synchronize
state across separate data stores because
users can run analytics directly in the
database of record. Note that this places the
additional requirement that HTAP systems
be fault-tolerant and ACID compliant.

MEMSQL IS DESIGNED FOR HTAP


HTAP is filling the void where Big Data
promises have fallen short. MemSQL is the
leader in real-time and historical Big Data
analyticsotherwise known as HTAP
based on a distributed in-memory database.
MemSQL has a unique combination
of innovative features that allows us to
deliver on the Big Data promises including
in-memory storage, access to real-time
and historical data, code generation and
compiled query execution plans, lockfree data structures and multiversion
concurrency control, two-tiered shared
nothing architecture, fault tolerance and
ACID compliance, JSON data type, and
integration. To learn more about MemSQL,
please visit www.memsql.com.
MEMSQL www.memsql.com
DBTA. COM

39

industry updates

The State of Data Security and Governance

Big Data Poses Security Risks


By Geoff Keston

Big data refers to the massive amounts of


structured and unstructured data that are
difficult to process using traditional data
management tools and techniques. While
big data can inform enterprise operations
offering business advantagesthe present
methods of mining and managing big data
are still evolving and pose serious security
and privacy challenges. Confronting these
challenges is essential if the potential of big
data is to be fully exploited.
According to Gartner, 64% of enterprises have or plan to use big data. Similarly,
NewVantage Partners found, in a 2013 study
that focused heavily on large finance companies, that 68% of respondents had spent at

40

BI G D ATA SOU RC EBO O K 2014

least $1 million on big data, about twice the


number that gave the same answer in 2012.
The rising popularity of big data, at least
among large enterprises, may force other
companies to keep pace. And even companies that do not choose to employ big data
themselves may use cloud services that do.
Like enterprises with their own big data programs, customers of cloud services will need
to understand the security issues that big data
creates and how they can be mitigated.

Understanding Big Data Security Threats


Big data environments are expansive and
technically complex, characteristics that in
themselves create security problems. The

scope of big data can make it difficult, for


instance, to control and monitor the rights
that users have to access particular files and
resources.
Discussing this problem in Forbes, Davi
Ottenheimer, a senior director at EMC,
explained that with the scale of big data,
problems can easily emerge, but finding the
cause of such problems is difficult. And a
study by the Cloud Security Alliance (CSA)
found that, in addition to their unwieldy size,
these environments have a variety of data
types, and that much of the data is streaming instead of static. Combined, these characteristics render many common security
approaches ineffective.1

sponsored content

EDB Makes Open Call for


Postgres NoSQL Performance Benchmark
Marc Linster, Senior Vice President, Products and Services, EnterpriseDB

Postgres was originally architected to


be an object-relational database, designed
specifically to enable extensibility. It
supports objects, classes, custom data types
and methods.
In the early years of the Postgres project,
this was problematic as it slowed down
development cycles because new code had to
be fully integrated so everything would work
with everything else. However, as Postgres
has become more feature-rich over the past
15 years, that original design hurdle has
turned into a unique advantage. The fact
that Postgres is an object-relational database
means new capabilities can be developed and
plugged into the database as needs evolve.
Using this level of extensibility, Postgres
developers have expanded the database to
include new features and capabilities as new
workloads requiring greater flexibility in
the data model emerged. The most relevant
examples in the NoSQL discussion are JSON
and HSTORE. With JSON and HSTORE,
Postgres can support applications that require
a great deal of flexibility in the data model.
EnterpriseDB (EDB) has begun running
comparison tests to help Postgres users better
assess the NoSQL capabilities of Postgres.
The tests compare PostgreSQL (often called
Postgres) with MongoDB as recent advances
have significantly enhanced Postgres capacity
to support document databases.
We are inviting an open review of our
test results and our framework. Weve
made the materials available (see below for
information) for the Postgres and MongoDB
communities to examine the results and
even duplicate our test or develop new
ones. Our goal is to encourage greater
participation in exploring Postgres as its
NoSQL technology continues to expand to
meet evolving enterprise workloads.
In testing, EDB has found that Postgres
outperforms MongoDB in selecting, loading

and inserting complex document data in key


workloads:
Ingestion of high volumes of data was 2.2
times faster with Postgres
MongoDB consumed 35% more disk space
Data inserts took three times longer with
MongoDB than Postgres
Data selection took almost three times as
long with MongoDB than Postgres
The results demonstrate that the
recent JSON enhancement to Postgres for
unstructured data perform very well, and
in our initial testing shows performance at
levels better than, or on par with, the more
specialized MongoDB solution. What is
compelling is that these document database
capabilities in Postgres simultaneously
enable developers to take advantage of
Postgres stability and ACID compliance.
Our customers report that this advantage
eliminates data silos and promotes easier
data governance when compared to the
specialized solution. Postgres provides users
the technology they need to address some
critical new data challenges without taking on
the risks and incremental cost of a NoSQLonly solution, which can introduce a host of
management and data integrity challenges.
Whats clear is that enterprises with
evolving unstructured data challenges need
to examine evolving capabilities in Postgres
before introducing the risks and problems
that NoSQL-only solutions bring to data
environments.

DESIGNED FOR EXPANSION


The object-relational foundation of
PostgreSQL means adding new capabilities
is a seamless process and the database is
always expanding to meet the needs of
users, said Jonathan Katz, chief technology
officer of VenueBook and a PostgreSQL
community organizer. JSONB truly reflects

the brilliance of the PostgreSQL community.


We now have a relational database system,
historically known for providing data
integrity and robust functionality, that can
search over non-relational data structures
at speeds that rival, if not surpass, those of
NoSQL database systems.
Recent releases of Postgres have
emphasized JSON capabilities. The
new version due to be released this fall,
PostgreSQL 9.4, will feature JSONB, which
will increase Postgres performance on
document database workloads dramatically.
But building document database capabilities
into Postgres is the newest initiative to
enhance NoSQL capabilities in Postgres.
Postgres has, in fact, had the capacity to
support key-value stores since 2006 when
the HStore contrib module was added, years
before some of the leading NoSQL solutions
were released or even developed.

SHARE IN THE EXPLORATION


Many end users are still exploring
the capacity for Postgres to support
unstructured data as their needs evolve.
To that end, EDB has developed and made
available multiple code samples for utilizing
Postgres in NoSQL use cases. EDB has also
developed a white paper that demonstrates
some of the NoSQL capabilities of Postgres
and identifies some of the challenges posed
by introducing NoSQL solutions into the
data environment.
We have made available all of the
materials discussed without registration
requirements, so please visit our Postgres
NoSQL Resources Page (http://info.
enterprisedb.com/Postgres-NoSQLBenchmark-Call.html).

ENTERPRISE DB
www.enterprisedb.com
DBTA. COM

41

industry updates

The State of Data Security and Governance

Big data analytics hold great promise,


but the present methods of mining and
managing big data are still evolving
and pose serious security and
privacy challenges. Confronting these
challenges is essential if the potential
of big data is to be fully exploited.

The CSA report divides big data


security threats into the following four
categories:
1. Infrastructure SecurityBig data
infrastructures are distributed across
many servers and often across multiple networks, so pulling data from
them requires approaches, such as
MapReduce, that are not used in traditional environments. These mapping
technologies are vulnerable to special
types of attacks, such as when hackers
spy on transactions or alter the results of
operations. The data sources themselves
are open to attacks. NoSQL databases
are commonly used in these environments, and they can be targeted by injection attacks in which hackers insert their
own code into a database application.
2. 
Data PrivacyConcerns about privacy loom over many big data security
discussions. In most cases, a breach
of a big data service will be a privacy
breach. Big data projects typically store
consumer data that most users will
expect to be private. To maintain the
confidentiality of data in such an environment, access control must be man42

BI G D ATA SOU RC EBO O K 2014

aged at a very granular level, which is


difficult and takes significant effort.
3. Data ManagementThe dispersed,
often multi-tiered nature of big data
architectures makes managing data
difficult. In particular, it is difficult to
determine datas provenance, that is,
the source from where it came and the
history of its creation and modifications. But these factors are critical concerns when evaluating the risks posed
by a piece of data and when enforcing
reputation-based security schemes.
Provenance is also an important issue
for complying with regulations such as
PCI and Sarbanes-Oxley.
4. 
Integrity and Reactive Security
Data that enterprises gather can be
dangerous, possibly because it has
been planted by hackers. This threat
compels organizations to find ways to
validate data. One way is to use realtime analysis, which can alert enterprises to potential problems. Such
analysis relies on algorithms that are
constructed to filter out data that
appears suspicious, either because of
its content or its source.

Apache Hadoop Poses Specific


Security Issues
Specific risks emerge from Apache
Hadoop, the most popular big data tool
for dividing datasets across many servers.
Hadoop has been pressed into a broader,
more mission-critical range of uses than
it was originally designed to address. And
the industry has been playing catch up to
add security capabilities that it did not
originally have.
Companies continue to work to
improve Hadoops security. For instance,
in June 2014, cloud services provider
Cloudera bought Gazzang, which offers
software to secure Hadoop. Clouderas
CEO, Tom Reilly, explains the motivation behind the acquisition as follows:
[C]ompanies that are weighing the value
of putting workloads in public cloud environments against security concerns will
now be able to move forward by putting
in place additional process-based access
controls.2

New Big Data Capabilities Are


Changing Security Practices
Early on, big data was used mostly in
specialty fields, notably academic and scientific research. It was too expensive and

NEW YORK

HILTON
Midtown

May 1213

2015

Preconference Workshops

Monday, May 11

SAVE THE DATE


Hot topics to be covered:
n T racking the Future
of Big Data
n F ollowing the Ecosystem
Hadoop, NoSQL, In-Memory,
and Beyond
n T aking Advantage of the
Top Trends in Analytics
 oving to a Modern Data
n M
Architecture
n W
 eighing the Pros and Cons
of the Data Lake
n C apitalizing on the Internet
of Things
 ccelerating Data Integration
n A
Projects
n E valuating Virtualization and
Cloud Solutions Today
n Improving Data Governance,
Security, and Privacy in the
Era of Big Data

Register Now With Super Early-Bird Rates


Were just beginning to put the 2015 program together and are looking forward to
another great year. In the meantime, take the opportunity now to be among the first to
register early and receive our Super Early-Bird savings of $200 off the Early-Bird rate.
The Data Summit program offers a comprehensive educational experience designed
to guide you through all of todays issues in data management and analysis. This is
the place to connect with the best minds in the industry, learn what works in the real
world, and chart your course forward.
Dont miss it. Use code SUPER to save today!

dbta.com/DataSummit

Featuring

CONNECT:
#BigDataNY
ORGANIZED AND
PRODUCED BY

industry updates

The State of Data Security and Governance

complex for most enterprises to use. Most


of the tools that were available were developed by experts for experts. They were not
built like commercial software products,
with intuitive interfaces, helpful wizards,
and other features that make using them
easier. Furthermore, each tool performed
only a limited number of functions, so
building a big data program entailed using
several different tools. There was little
guidance available about how to do this,
and doing so often required programming. If the tools were not linked correctly,
security problems could emerge.
But a transition is underway. Developers are adding user-friendly features to big
data tools, making them accessible to more
IT departments.3 And big data capabilities
are being used in a wider range of scenarios. In particular, these capabilities are
increasingly found in security information
and event management systems as a way to
perform identify and access management
(IAM). Describing the motivation behind
this development, a NetworkWorld article
by Jon Oltsik explained that [s]oftware
tools are great at automating and scaling
processes but IAM is fraught with complex workflow, multiple identity repositories, and multiple accounts per user. The
power of big data analytics enables organizations to manage identities and access
rights despite this complexity.
The expansion of big data analytical tools into the realm of typical enterprises will change the way that security is
practiced. Data that was once too timeconsuming or resource-intensive to analyze will now be inspected in real time,

and information from many sources will


be correlated in ways that were impossible
just a few years ago.4
But despite progress, big data technology is still under development. In particular, as described by Arnab Roy of Fujitsu
Laboratories of America, people focused

Security is a problem of the


entire system. Practitioners
and scientists must come
together and work out
holistic strategies.
on one area of big data security have not
extensively collaborated with people investigating other areas. The absence that
I see right now is a systematic, scientifically classified kind of organization of this
space, where we can talk about security as
a whole, says Roy.5 Security is a problem
of the entire system, and people have been
doing it piecemeal. So the practitioners
and scientists have to come together and
work out holistic strategies.

Decide Whether Big Data Analytics


Meet a Real Need
Big data analytics hold great promise, but that promise is not yet easy for a
typical enterprise to put into practice.6
Describing this challenge, a member of
CSAs Big Data Analytics subgroup, Wilco
van Ginkel, noted in a TechRepublic inter-

References
1. Preimesberger, C. Hadoop Poses a Big Data Security Risk. eWeek. April 2013.
2. Worth, D. Cloudera Buys Gazzang to Boost Hadoop Big Data Security. V3.co.uk. June 2014.
3 Taylor, B. How Big Data Is Changing the Security Analytics Landscape. TechRepublic. January 2014.
4. Cardenas, A. A., P. K. Manadhata, and S. P. Rajan, Big Data Analytics for Security. IEEE Security & Privacy.
November-December 2013.
5 Federal News Radio 1500-AM. Interview with Dr. Arnab Roy. Available online from
https://cloudsecurityalliance.org/research/big-data.
6. Jordan, J. The Risks of Big Data for Companies. The Wall Street Journal. October 2013.
This article is based on a comprehensive report published by Faulkner Information Services, a division of
Information Today, Inc., that provides a wide range of reports in the IT, telecommunications, and security fields.
For more information, visit www.faulkner.com. Copyright 2014, Faulkner Information Services. All Rights Reserved

44

BI G D ATA SOU RC EBO O K 2014

view that if an enterprise asked him about


big data analytics, he would first consider
the enterprises needs. Its not so much do
we need to jump on the big data analytics
bandwagon, yes or no? he explains. The
question is: Why should I? The focus is
best placed on real risks that a particular enterprise faces, said van Ginkel. He
argued that many enterprises are tempted
to use big data because of the buzz around
it, but that the concept is so new that many
of them lack the knowledge to do so.
At this stage, van Ginkel sees most companies as just scratching the surface of
the technologys potential, working with
it in only limited ways. But this is slowly
changing. For instance, he explained that
some organizations are progressing from
analytics that focus on past events to an
approach that predicts future events. Overall, enterprises and players in the market
are still trying to make sense of how to use
big data analytics to meet actual needs,
and this process will take time.
The need for specialized knowledge is
emphasized by Peter Wood, CEO of security consulting firm First Base Technologies. People with backgrounds in multivariate statistical analysis, data mining,
predictive modeling, natural language
processing, content analysis, text analysis, and social network analysis are all in
demand, he wrote in ComputerWeekly.
These analysts and scientists work with
structured and unstructured data to
deliver new insights and intelligence to
the business. Platform management professionals are also needed to implement
Hadoop clusters, secure, manage and optimize them. n

Geoff Keston is the author of more than


250 articles that help organizations find
opportunities in business trends and technology. He also works directly with clients
to develop communications strategies
that improve processes and customer
relationships. Keston has worked as a
project manager for a major technology
consulting and services company and is a
Microsoft Certified Systems Engineer and
a Certified Novell Administrator.

industry directory

Attunity is a leading provider of data integration software solutions

CA Technologies helps customers succeed in a future where every

that make Big Data available where and when needed across

businessfrom apparel to energyis being rewritten by software.

heterogeneous enterprise platforms and the cloud. Attunity solutions

From planning to development to management to security, at CA

accelerate mission-critical initiatives including BI/Big Data Analytics,

we create software that fuels transformation for companies in the

Disaster Recovery, Content Distribution and more. Solutions include

application economy.

data replication, data flow management, test data management,


change data capture, data connectivity, enterprise file replication (EFR),
managed-file-transfer (MFT), and cloud data delivery. For over 20 years,
Attunity has supplied real-time access and availability solutions to
thousands of enterprise-class customers worldwide, across the maze
of systems making up todays IT environment.

center to the mobile device.


Our software and solutions help our customers thrive in the new
secure their applications and infrastructure.

see our ad on

PAGE 35

www.attunity.com

leverage the technology that changes the way we livefrom the data

application economy by delivering the means to deploy monitor and

Learn more at www.attunity.com.

Attunity

With CA software at the center of their IT strategy, organizations can

CA Technologies
www.ca.com

see our ad on

PAGE 7

codeFutures
IBM CLOUDANT is the worlds first globally distributed databaseas-a-service (DBaaS) for loading, storing, analyzing, and distributing
operational application data for developers of large and/or fast-growing
web and mobile applications. Cloudant technology accelerates timeto-market and time-to-innovation because it frees developers from
the mechanics of data management so they can focus exclusively
on creating great applications. It also offers high availability, elastic
scalability, and innovative mobile device synchronization.

CodeFutures Corporation is a provider of AgilData, the first agile


platform for real-time Big Data platforms. CodeFutures has been
providing innovative high-performance database technologies to
leading enterprises, and now with the advent of AgilData has created
an entirely new capability for fast, easy access to meaningful data
as a means of driving the real-time enterprise. CodeFutures enables
new levels of dynamic schema, streaming data processing, predictive
analytics and the ability to scale as fast as data
volumes increase.
Founded by Cory Isaacson (CEO/CTO) and Andy Grove (VP Development)
in 2007, with 40+ years of cumulative database experience,
CodeFutures has amassed significant expertise and capabilities in
the Big Data Scalability arena. Their technologies have been used by
some of the worlds largest fast-growth social applications, as well
as analytics, OLTP and other general purpose database workloads by
customers worldwide.

Cloudant,

an

IBM

company

Sara Strope

see our ad on

COVER 2

CodeFutures Corporation

857-206-6018

11001 West 120th Avenue, Suite 400


Broomfield, CO 80021
Phone: +1 303 625 4084 Fax: +1 303 460 8228

https://cloudant.com

www.codefutures.com

sbstrope@us.ibm.com

DBTA. COM

45

industry directory

Continuent is a leading provider of database clustering and real-

Couchbase provides the worlds most complete, most scalable and

time data replication, enabling enterprises to run business-critical

best performing NoSQL database. Couchbase Server is an industry-

applications on cost-effective open source software.

leading solution that includes a shared-nothing architecture, a single


node-type, a built-in caching layer, true auto-sharding and the worlds

Continuent Tungsten provides enterprise-class high availability,


globally redundant data distribution and real-time heterogeneous
data integration. Continuent Tungsten offers real-time data loading
from transactional databases (MySQL, Oracle) into data warehouses
(Amazon Redshift, Hadoop, Vertica, and Oracle).

first NoSQL mobile offering: Couchbase Mobile, a complete NoSQL


mobile solution. Couchbase counts many of the worlds biggest brands
as its customers, including AT&T, Amadeus, Ballys, Beats Music,
Cisco, Comcast, Concur, Disney, eBay / PayPal, Neiman Marcus, Orbitz,
Rakuten / Viber, Sky, Tesco, and Verizon, as well as hundreds of others.

Continuent customers represent the most innovative and successful


organizations in the world, handling billions of transactions daily across
a wide range of industries.

Continuent
Follow us on Twitter @Continuent

www.continuent.com

Couchbase
www.couchbase.com

Embarcadero Technologies, Inc. is the leading provider of data

EnterpriseDB is the leading worldwide provider of Postgres software

modeling and management software tools that empower data

and services, enabling enterprises to reduce their reliance on costly

professionals to design, build, and run databases more efficiently in

proprietary solutions and slash their database spend by 80 percent

heterogeneous IT environments. ER/Studio is the companys flagship

or more. With powerful performance and security enhancements for

data architecture solution that combines data, business process, and

PostgreSQL, sophisticated management tools for global deployments

application modeling and reporting in a multi-platform environment.

and database compatibility for Oracle, EnterpriseDB software supports

DB PowerStudio is a complete database tool suite with administration,

both mission and non-mission critical enterprise applications. More

development and performance-tuning capabilities across multiple

than 2,500 enterprises, governments and other organizations worldwide

platforms. With support for numerous relational DBMS platforms and

use EnterpriseDB software, support, training and professional

big data platforms, including PostgreSQL, Hadoop Hive, and MongoDB,

services to integrate open source software into their existing data

Embarcadero provides the most comprehensive database tools portfolio

infrastructures.

Phone: (866) 998-3642

see our ad on

COVER 4

for cross-platform environments.


For more information, please visit www.enterprisedb.com.

46

Embarcadero Technologies, Inc.

Enterprise DB

www.embarcadero.com

www.enterprisedb.com

BI G D ATA SOU RC EBO O K 2014

see our ad on

PAGE 11

industry directory

Big Data, Small Change.


Its not just the cost, risk, and skills gap surrounding Hadoop, NoSQL,
in-memory DBs, and appliances. Its their functionality gaps. Before you
shift paradigms, first try proven, cost-effective big data management
solutions built on Eclipse that run simultaneously, and multi-threaded,
in your current file system.
Integrate. Replicate. Mask. Prototype. Visualize.
Founded in 1978, and renowned for its fast CoSort data transformation
software, IRI combines ETL, data migration, protection, test data, and
reporting jobs in one I/O. Easily create explicit scripts and flows in
Eclipse, and run your jobs anywhere.

Melissa Data offers data quality and enrichment tools that support
Big Data insight. Our tools can be used to extract relevant contact
information from unstructured data, as well as link and merge duplicate
information into a single customer view. With clean, consolidated data,
you can then utilize our enrichment solutions to blend in authoritative
customer data like demographics and geographics to drive Big Data
analytics and reporting. For 30 years Melissa Data has lead the way
in data quality for contact data management. Our tools work with
Pentaho, Talend, SSIS, and other leading Big Data integration tools.
Free trials available.

From COBOL to CDRs. From Unstructured Files to DBs.


Rake and lake your legacy, dark, and transactional data sources faster.
Populate multiple targets and formats at once. Run big DW, federation,
compliance, and analytic jobs without big costs or complexity.

Melissa Data Corp.


800-635-4772

IRI, The CoSort Company

see our ad on

COVER 3

1-800-333-SORT

info@melissadata.com

www.iri.com/solutions/big-data

www.melissadata.com

MemSQL is the leader in real-time and historical Big Data analytics

Objectivity Inc.s embedded database software helps discover and

based on a distributed in-memory database. MemSQL is purpose-built

unlock the hidden value in Big Data for improved real-time intelligence

for instant access to real-time and historical data through a familiar

and decision support. Objectivity focuses on storing, managing and

SQL interface and uses a horizontally scalable distributed architecture

searching the connection details between data. Its leading edge

that runs on commodity hardware. Innovative enterprises use MemSQL

technologies: InfiniteGraph, a unique distributed, scalable graph database

to accelerate time-to-value by extracting previously untapped value in


their data that results in new revenue.

and Objectivity/DB, a distributed and scalable object management


database, enable unique search and navigation capabilities across
distributed datasets to uncover hidden, valuable relationships within new
and existing data for enhanced analytics and facilitate custom distributed
data management solutions for some of the most complex and missioncritical systems in operation around the world today.

Objectivity, Inc.
3099 North First Street, Suite 200

see our ad on

PAGE 27

San Jose, CA 95134 USA


408-992-7100

MemSQL
www.memsql.com

see our ad on

PAGE 31

info@objectivity.com

www.objectivity.com

DBTA. COM

47

Master the leading


technologies and techniques
with DBTA magazine.
Have questions about what works
in the real world? DBTA magazine
has your answer! Each issue
provides vital information that
will help you plan your course
forward. Issues include advanced
trends analysis and case studies
serving the IT and business
stakeholders of complex data
environments.

Subscribe FREE* today!

dbta.com/Subscribe
* Print edition
free to qualified
U.S. subscribers.

S C A N

You might also like