Professional Documents
Culture Documents
COM
BIG DATA
SOURCEBOOK
DECEMBER 2014
CONTENTS
introduction
EDITORIAL & SALES OFFICE 630 Central Avenue, Murray Hill, New Providence, NJ 07974
CORPORATE HEADQUARTERS 143 Old Marlton Pike, Medford, NJ 08055
Thomas Hogan Jr., Group Publisher
609-654-6266; thoganjr@infotoday
Norma Neimeister,
Production Manager
Denise M. Erickson,
Senior Graphic Designer
Joseph McKendrick,
Contributing Editor; Joseph@dbta.com
Jackie Crawford,
Ad Trafficking Coordinator
Adam Shepherd,
Editorial and Advertising Assistant
908-795-3705
industry updates
John OBrien
10
ADVERTISING
Roger R. Bilboul,
Chairman of the Board
John C. Yersak,
Vice President and CAO
14
18
POSTMASTER
Send all address changes to:
Big Data Sourcebook, 143 Old Marlton Pike, Medford, NJ 08055
Copyright 2014, Information Today, Inc. All rights reserved.
22
The Big Data Sourcebook is a resource for IT managers and professionals providing information
on the enterprise and technology issues surrounding the big data phenomenon and the need
to better manage and extract value from large quantities of structured, unstructured and
semi-structured data. The Big Data Sourcebook provides in-depth articles on the expanding
range of NewSQL, NoSQL, Hadoop, and private/public/hybrid cloud technologies, as well
as new capabilities for traditional data management systems. Articles cover business- and
technology-related topics, including business intelligence and advanced analytics, data security
and governance, data integration, data quality and master data management, social media
analytics, and data warehousing.
No part of this magazine may be reproduced and by any meansprint, electronic or any
otherwithout written permission of the publisher.
30
COPYRIGHT INFORMATION
Authorization to photocopy items for internal or personal use, or the internal or personal use
of specific clients, is granted by Information Today, Inc., provided that the base fee of US $2.00
per page is paid directly to Copyright Clearance Center (CCC), 222 Rosewood Drive, Danvers,
MA 01923, phone 978-750-8400, fax 978-750-4744, USA. For those organizations that have
been grated a photocopy license by CCC, a separate system of payment has been arranged.
Photocopies for academic use: Persons desiring to make academic course packs with articles
from this journal should contact the Copyright Clearance Center to request authorization
through CCCs Academic Permissions Service (APS), subject to the conditions thereof. Same
CCC address as above. Be sure to reference APS.
Creation of derivative works, such as informative abstracts, unless agreed to in writing by the
copyright owner, is forbidden.
Acceptance of advertisement does not imply an endorsement by Big Data Sourcebook. Big Data
Sourcebook disclaims responsibility for the statements, either of fact or opinion, advanced by
the contributors and/or authors.
The views in this publication are those of the authors and do not necessarily reflect the views
of Information Today, Inc. (ITI) or the editors.
40
The rise of big data, cloud, mobility, and the proliferation of connected devices, coupled with newer data
management approaches, such as Hadoop, NoSQL, and
in-memory systems, are increasing the opportunities for
enterprises to harness data. However, with this new frontier there are challenges to be overcome. As they work to
maintain legacy applications and systems, IT organizations must address new demands for more timely access
to more data from more users, in addition to maintaining continuous availability of IT systems, and enforcing
appropriate data governance.
Its a lot to think about. How can companies choose the
right approach to leverage big data while keeping newer
technologies in line with budgetary, application availability, and security concerns?
Over the past year, Unisphere Research, a division of
Information Today, Inc., has conducted surveys among IT
professionals to gain insight into the challenges organizations are facing.
The information overload is already taking its toll on
IT organizations and professionals. According to a Unisphere Research report, Governance Moves Big Data
From Hype to Confidence, the percentage of organizations with big data projects is expected to triple by the
end of 2015. However, while organizations are investing
in increasing the information at their disposal, they are
finding that they are committing more time to simply
locating the necessary data, as opposed to actually analyzing it. In addition, the report, based on a survey of 304
data management professionals and sponsored by IBM,
found that respondents tend to be less confident about
data gathered through social media and public cloud
applications.
With all this data, there are also concerns about the
ability to maintain the high availability mandated by
todays stringent service level agreements. According to
another Unisphere Research survey sponsored by EMC,
and conducted among 315 members of the Independent Oracle Users Group (IOUG), close to one-fourth
of respondents organizations have SLAs of four nines of
availability or greater, meaning that they can have only 52
minutes or less of downtime a year. The survey, Bringing
Continuous Availability to Oracle Environments, found
that more than 25% of respondents dealt with more than
8 hours of unplanned downtime during the previous
sponsored content
While operational databases provide realtime data access and lightweight analytics,
they must integrate with Apache Hadoop
distributions for predictive analytics,
machine learning, and more. While
operational data feeds big data analytics,
big data analytics feed operational data. The
result is continuous refinement. By analyzing
the operational data, it can be updated to
improve operational efficiency. The result is
a big data feedback loop.
Couchbase provides and supports
a Couchbase Server plugin for Apache
Sqoop to stream data to and from Apache
Hadoop distributions. In fact, Cloudera
certified it for Cloudera Enterprise 5. In
addition, Couchbase provides and supports
a Couchbase Server plugin for Elasticsearch
COUCHBASE
www.couchbase.com
DBTA.C OM
industry updates
How Businesses
Are Driving Big Data
Transformation
By John OBrien
industry updates
Hadoop 2 Ushers in
the Next Generation
The significance of Hadoop 2 has
recently started to resonate with companies and enterprise architects. Moving away from its batch-oriented origins,
YARN has clearly positioned the data
operating system as two separate fundamental architecture components.
While the HDFS will continue to evolve
as the caretaker of data in the distributed
file system architecture with improved
name node high availability and perfor-
industry updates
YARN, introduced in
Hadoop 2, completely
changes the paradigm of
data engines and access.
Another discussion that continues
from the early adopters is how a data
node should be configured. Some implementations concerned with truly big data
configure data nodes with 25 front-loading bays and multi-terabyte slower SATA
drives for the highest capacity within
their cluster. Other implementations are
more concerned with performance and
opt for faster SAS drives at lower capacities but balanced with more servers in the
cluster for further increased performance
from parallelism. Some hyper-performance-oriented clusters will even opt for
faster SSD drives in the cluster. This also
leads to discussions regarding multi-core
CPUs and how much memory should
be in a data node. And, there have been
equations for the number of cores related
to the amount of memory and number of
drives for optimal performance of a data
node. We have seen that enterprise infrastructure has leaned more toward fewer
Re-envisioning
System z to
Power the
Application
Economy
Visit us at ca.com/getdynamic
industry updates
Whats Ahead
In 2015, the mainstream adoption with
enterprise data strategies and acceptance
of the data lake will continue as data management and governance practices provide
further clarity. The cautionary tale of 2014
to ensure business outcomes drive big data
adoption, rather than the hype of previous years will likewise continue. Hadoop
is clearly here to stay and inevitable,
and will have its well-deserved seat at the
enterprise data table, along with other
data technologies. While Hadoop wont be
taking over the world any time soon and
principle-based frameworks (such as our
own modern data platform) recognize the
evolution of both data technologies and
computing price/performance on modern data architecture. Besides the usual
maturing and improvements overall and
for existing big data tools, we predict some
major achievements in big data for 2015
that were keeping an eye on.
8
The Apache Spark engine will continue to mature, improve, and gain acceptance in 2015. With this adoption and the
incredible capabilities that it delivers, we
could start to see applications and capabilities beyond our imagination. Keep an eye
for these early case studies as inspiration
for your own needs.
With deepening acceptance and recognition of YARN as the standard for operating Hadoop clusters, open-source projects
and existing vendors will port their products to YARN certification and integration.
This will not only close the gap between
existing data technologies to work with
Hadoop clusters but more exciting will be
to see data technologies port over to YARN
so that they can operate and improve their
own capabilities within Hadoop. New
engines and existing engines running on
YARN in 2015 will further influence and
drive the adoption of Hadoop in enterprise data architecture.
sponsored content
DISTRIBUTED
The world is moving towards distributed
architectures. Memory is becoming a
commodity; the Internet is easily accessible
and fairly inexpensive and with more sources
of data creating an increase in information it
is easy to understand how organizations will
require multiple, distributed data centers to
store it all.
With distributed architectures comes a
need for distributed features such as parallel
ingest or the ability to quickly obtain data
using multiple resources/locations to enable
real-time application access to information
that is being processed. Then there is a
need for distributed task processing, which
helps to move the processes closer to the
locations where data is stored, thus saving
time and improving query performance as a
side effect. Finally, there becomes a need for
distributed query as well. This is the ability
to perform a search of data across different
SCALABLE
The next requirement revolves around
ease of scalability. When working with
distributed architecture, it is inevitable that
companies will need to eventually scale out
their applications across multiple locations
in order to keep up with growing data
demands. Technology that is easily scalable/
adaptable is very important in long-term
success and helps with managing ROI.
FLEXIBLE
Another requirement, due to the many
different types of data being collected, is the
ability to handle multiple data types. If a
technology is too limited in the way it needs
to collect information from structured,
unstructured, semi-structured sources,
organizations will find it difficult to grow
their solution long-term due to concerns
with data type limitations. On the other
hand, a technology that is able to natively or
alternatively store and access many types of
information from multiple data sources will
be key to enabling long-term competitive
advantage and growth.
COMPLEMENTARY
And finally, there is a need to address
existing and legacy solutions already
implemented at a large scale. Most
enterprises will not be tearing out widely
implemented solutions spanning across
their organization. It is important to require
that any new technologies being assessed
have the ability to complement existing
legacy solutions as well as any potential new
technologies that may add benefit to the
business, its customers and solution/services.
Todays enterprise success depends on the
ability to obtain key information quickly and
OBJECTIVITY, INC.
www.objectivity.com
DBTA.C OM
industry updates
The
Enabling Force
Behind Digital
Enterprises
By Joe McKendrick
10
industry updates
Whats Ahead
The year 2015 represents new opportunities to expand and enlighten data
management practices and platforms to
meet the needs of the ever-expanding
digital enterprise. To be successful, digital business efforts need to have solid
data management practices underneath.
As enterprises go digital, they will be relying on well-managed and diverse data to
explore and reach new markets. n
Joe McKendrick is an
13
industry updates
industry updates
Data Integration
Evolves to Support
a Bigger Analytic Vision
By Stephen Swoyer
14
Beyond Description
Were used to thinking of data in terms
of the predicates we attach to it. Now as ever,
we want and need to access, integrate, and
deliver data from traditional structured
sources such as OLTP DBMSs, or flat and/
or CSV files. Increasingly, however, were
alert to, or were intrigued by, the value of the
information that we believe to be locked into
multi-structured or so-called unstructured data, too. (Examples of the former
include log files and event messages; the latter is usually used as a kitchen-sink category
sponsored content
THE CHALLENGES OF
MOVING BIG DATA
However, to use Big Data, you must
be able to move it, and the challenges of
moving Big Data are multi-faceted. Out of
the gate, the pipes between data repositories
remain the same size, while the data grows
at an exponential rate. The issue worsens
when traditional tools are used to attempt to
access, process and integrate this data with
other systems. Yet, companies cannot rely on
traditional data warehouses alone.
Thus, companies are increasingly
turning to Apache Hadoopthe free, open
source, scalable software for distributed
computing that handles both structured
and unstructured data. The movement
towards Hadoop is indicative of something
bigger: a new paradigm thats taking over
the business worldthat of the modern
data architecture and the data supply
chain that feeds it. The data supply chain
describes a new reality in which businesses
find themselves coordinating multiple data
sources rather than using a single data
warehouse. The data from these sources,
which often varies in content, structure, and
type, has to be integrated with data from
other departments and other target systems
within an enterprise. Big Data is rarely used
en mass. Instead, different types of data tell
different stories, and companies need to be
able to integrate all of these narratives to
inform business decisions.
To learn more,
download
this Attunity
whitepaper:
Hadoop and
the Modern Data
Supply Chain
http://bit.ly/HadoopWP
ATTUNITY
www.attunity.com
DBTA. COM
15
industry updates
Whats Ahead
Cloud is a critical context for data integration. One reason for this is that most
providers offer export facilities or publish
APIs that facilitate access to cloud data.
Another reasonas I wrote last yearis
that doing DI in the cloud doesnt invalidate (completely or, even, in large part)
existing best practices: if you want to
run advanced analytics on SaaS data,
youve either got to load it into an existing, on-premises repository oralternativelyexpose it to a cloud analytics provider. What you do in the former scenario
Swoyer is a
technology writer with
more than 16 years of
experience. His writing
has focused on business
intelligence, data warehousing, and analytics
for almost a decade. Hes particularly
intrigued by the thorny people and process problems most BI and DW vendors
almost never want to acknowledge, let
alone talk about. You can contact him at
stephen.swoyer@gmail.com.
Stephen
DBTA. COM
17
industry updates
industry updates
Turning Data
Into Value
Using Analytics
By Bart Baesens
18
industry updates
19
industry updates
which is heavily dependent upon the current state of the macro economy (upturn
versus recession). Hence, to be successful
and create added value, analytical models
should be accompanied by monitoring
and back-testing facilities that can facilitate the decision about when to tweak or
rebuild them.
Whats Ahead
To fully leverage the power of big data
and analytics organizations should:
Simultaneously invest both in data
quantity and data quality;
sponsored content
Copyright 2014 CA. All rights reserved. [INSERT REQUIRED THIRD-PARTY TRADEMARK ATTRIBUTIONS.] All trademarks,
trade names, service marks and logos referenced herein belong to their respective companies. This document is for your
informational purposes only. CA assumes no responsibility for the accuracy or completeness of the information. To the extent
permitted by applicable law, CA provides this document as is without warranty of any kind, including, without limitation,
any implied warranties of merchantability, fitness for a particular purpose, or noninfringement. In no event will CA be liable
for any loss or damage, direct or indirect, from the use of this document, including, without limitation, lost profits, business
interruption, goodwill or lost data, even if CA is expressly advised in advance of the possibility of such damages.
CA TECHNOLOGIES
For more information, visit
www.ca.com/BigData.
DBTA. COM
21
industry updates
22
industry updates
SQL Server or Oracle database. The questions are: Who owns this dataverse? Who
runs this dataverse?
One can easily describe the obvious
developments that are based on the cloud,
but one cannot speculate on the future as
easily as the aforementioned speaker. For
every new cloud technology, there are millions of new users who did not invent that
technology and who did not grow up in a
world in which communication with their
best friend from high school required dialing 10 digits and paying exorbitant fees to
a phone company. These millions have for
their entire lives existed in a world in which
they simply could reach for any device anywhere and access any application at any
time, and communicate in milliseconds
with anyone they wanted to. For these
users, communication came at virtually no
cost and it was often with people that they
had never seen or spoken to before. Take
this a step further and imagine the conversation with the CFO of 2030 when she
wants to know why that 75-year-old salesman needs to have a face-to-face meeting
with the customer? Will the answer be: He
was born before the internet!
23
industry updates
Whats Ahead
There is no doubt that cloud is here to
stay and will continue to change our lives
and our businesses each and every day.
New classes of cloud-ready devices and
applications will also continue to emerge.
These new applications and devices
will further fuel the data explosion, help-
VMware (www.vmware
.com) since 2010 and is
the product line marketing manager for Business
Critical Applications. He is
an Oracle Certified Master, co-author of Virtualizing Oracle Database on vSphere, a VMware CTO Ambassador, and a VMware vExpert. In addition,
Sullivan was the co-creator of the Oracle
Certified Master Practicum in 2002.
Big Data
technologies,
including
Hadoop, NoSQL,
and in-memory
databases
Solving
complex data
and application
integration
challenges
Increasing
efficiency
through cloud
technologies
and services
Tools and
techniques
reshaping the
world of business
intelligence
New
approaches
for agile data
warehousing
Key strategies
for increasing
database
performance
and availability
26
industry updates
Content Is King
One of the most important aspects of
content management across social platforms is understanding how consumers
are engaging with content. Social media
tools and platforms are innovating in areas
such as social community content management, enabling businesses to manage
content around how consumers engage in
their own channels. However, some social
media content monitoring tools dont let
you drill down into the conversation. The
next frontier for many of these products
will be mixed media modeling.
Still, significant regulatory issues are
associated with harvesting, staging, and
hosting social media content, and apply
28
to nearly all data types in regulated industries. Data protection, security, governance, and compliance have entered an
entirely new frontier with in-house and/or
cloud-based management of social data.
Many social media products dont incorporate processes for governance and compliance and data security.
Socialmetrix www.socialmetrix.com
(Latin America)
SproutSocial http://sproutsocial.com
Topsy www.topsy.com
Visible www.visible.com
Wayin www.wayin.com (BYOD
analysis, curate, and distribute content)
Zuum www.zuumsocial.com
(Facebook, Twitter, YouTube,
Instagram, and Google+)
33Across www.33across.com
Net/Net
Now more than ever, we live in what
management advisor Joe Pine coined
as the experience economy, and social
media channels deliver more than that one
memorable event for the customer that
becomes the product (The Experience
Economy: Work Is Theater & Every Business a Stage, Pine and Gilmore, 1999).
Data processing has changed and many of
the legacy platforms (including the analytics) of the 1980s2000, are challenged
to handle the waves of data created by the
internet and, most notably, social media
channels. The open source community
has emerged to address this challenge
through Hadoop and Apache Hive, along
with a new breed of analytic databases and
social media analytics and platforms. The
bullet list below describes the majority of
business activities enabled by social media
analytics modules and platforms.
Purchase intent
Customer care
Risk management
Competitive intelligence and tracking
Partner monitoring
Category analysis
Whats Ahead
Looking ahead, it will be important
for social media analytics tools to advance
into mixed media modeling with the ability to drill down into conversations, incorporate procedures for data security and
governance, and evolve with greater agility
to address changing business needs
The missing link now in social media
analysis is not the data; its the lack of
expertise in the form of data scientists.
This presents another widening chasm
between line professionals who are already
frustrated with IT, and IT professionals.
While you cant expect IT to understand and know how to act on consumer
behavior psychology on a 24/7 basis, technology has always been a competitive differentiator in business. Now, social media
analytic tools and platforms have become
a competitive weapon for smart businesses
and organizations. n
Peter J. Auditore is currently
the principal researcher at
Asterias Research, a boutique consultancy focused
on information management, traditional and
social analytics, and big
data (www.thedatadog@wordpress.com).
Auditore was a member of SAPs Global
Communications team for 7 years and most
recently head of the SAP Business Influencer Group. He is a veteran of four technology startups: Zona Research (co-founder);
Hummingbird (VP, marketing, Americas);
Survey.com (president); and Exigen Group
(VP, corporate communications).
DBTA. COM
29
industry updates
industry updates
industry updates
sponsored content
3. B
lend and enrich. Now that you have
clean, consolidated data, its time to
blend in authoritative reference data
(geographic, demographic, psychographic,
and firmographic) to enrich the Big Data
stream. For instance, you cant get an
precise rooftop latitude/longitude without
a valid address. Once you have that, you
can append many different types of data
to model your customer profile.
The bigger the data and the greater the
number of sources, the more crucial it is to
use the Big Data Blend to create a golden
version of the trutha complete and
accurate view of the customer.Linking
and merging allows businesses to perform
accurate customer segmentation and
sentiment analysis and gives a better
identification of whos legit and whos not.
Big Data can be many things or it can be
a bunch of nothing if fed into the machine
to become part of the noise. Follow a threestep process that incorporates the principles
of Big Data Quality and Big Data Blending
that results in the sweet sound of success
and extraordinary customer insight.
MELISSA DATA CORP.
www.melissadata.com
DBTA. COM
33
industry updates
Potential Pitfalls
Implementing data quality processes
within the context of big data projects
presents many potential pitfalls. The first
emerges when a big data project is ready
to go live. At that point, the project team
should have defined what constitutes
good data and those rules should be rigorously applied. Since big data almost
by definition involves incorporating data
from many different sources, if the proper
data quality standards are not in place
prior to the initial load, data quality problems will surely emerge over time. Indeed,
frequently the lead-up to the initial load
may be the only time the entire project
team comprised of both IT and business
stakeholders can cooperatively define
what constitutes good data.
In general, the quality of data can be
assessed using standard quality metrics
such as completeness, consistency and so
on. But for big data projects, an additional
measure should be addedrelevancy. Just
because data is available does not mean it
needs to be captured or used, though, with
big data the temptation is to do just that.
The second pitfall lies in managing the
sources of the data. In most big data projects, data flows in from a range of applications, both internal and external, and
the enterprise often does not have control
of all the sources. One approach to this
problem is creating a so-called data firewall. Not unlike an internet firewall, a data
firewall applies the data quality rules and
standards developed for the initial data
load to all the data coming into the system.
Some enterprises also reach out to external
data providers to encourage them to provide data at the necessary quality.
The final challenge and pitfall for data
quality in a big data environment is data
maintenance. In too many cases, once data
has been loaded, the rules and standards
for data are no longer updated. But not
infrequently, as companies merge with or
acquire other companies, they will have to
bulk-load data again.
In addition to preserving the initial data
quality structure, companies must imple34
Whats Ahead
While companies are just starting to
adjust their data quality programs to meet
the demands of big data, the next data
shock to the system is already apparent
the Internet of Things. To return to the
earlier metaphorif the mainframe and
mini-computing were the earth, personal
computing a solar system, and the internet
and social media a galaxy, the Internet of
Things will be like a new universe of data.
It will be so vast it is hard even to imagine
its limits. And it will be always expanding.
Research Center, the next decade will witness an explosion of machines communicating with other machines, wearable
technologies and the widespread embedding of sensors in virtually everything.
The unprecedented explosion of data will
be accompanied by a demand to analyze
this data to gain insight and advantage.
The ultimate goal of Fitbit is to improve
health. Data generated by road sensors, for
example, could be used to manage traffic
flows. And sensors attached to inventory
can form the basis to restructure purchasing decisions.
For those goals to be realized, new ways
of aggregating data and ensuring its quality will have to be developed. Data lakes
are one of the new approaches to assembling huge amounts of data. The idea is
to gather data together in their raw formats without going through an extract,
transform, and load process. The raw data
would be called on and transformed as
needed. But as analysts from Gartner have
pointed out, the data lake approach poses
significant problems for data quality. By
definition, data lakes accept data without
oversight or governance, two central elements of a sound data quality approach.
Ensuring data quality has always been
a struggle. Under the best conditions,
data decays, errors are introduced, and, in
general, data quality degrades over time.
However, the new waves of data growth
represented by big data and the Internet
of Things promise to make it even more
challenging to ensure data quality. Ultimately, additional strategies, tactics, and
techniques will have to be developed to
enhance the existing approaches. n
on IT for 30 years. He is
the chair of the Department of Communication
at Loyola University Maryland, where he is a
founder of an M.A. program in Emerging Media. He has written
six books and hundreds of articles about
new technologies. Follow him on Twitter
@joyofjournalism. He blogs at emerging
media360.org.
industry updates
BUILDING THE
UNSTRUCTURED
BIG DATA/
DATA WAREHOUSE
INTERFACE
By W. H. Inmon
36
any two records happen to have similar structure and/or content in nonrepetitive data, it is
an accident. Some examples of nonrepetitive
data in big data include email, corporate call
center records, corporate contracts, warranty
claims, insurance claims, medical records,
and so forth.
There are (at least!) two good reasons for
making this fundamental division of data in
big data. The first reason is that the technology required to process the different types of
big data is very different. Repetitive data can
be read and treated in a very simple manner.
You can load, read, and write repetitive data
in big data very simply. But nonrepetitive
data in big data requires the technology of
textual disambiguation in order to be transformed into a usable state.
Textual disambiguation is the technology
that reads nonrepetitive data and turns it
into structured data that can be managed in
a simple manner. Textual disambiguation of
nonrepetitive data is as fundamental to doing
business analysis on nonrepetitive data as the
basic storage method of dataHadoop.
A second reason why there is such a distinction between the two types of data found
industry updates
in big data is that when it comes to business value, the vast preponderance of business value is in nonrepetitive data. Stated
differently, while there may be a lot of
repetitive data in big data, there is limited
business value there. Far and away the preponderance of data that contains business
value is nonrepetitive data.
Because of these two large differences,
it is necessary to treat big data as if it were
two different kinds of data. There are very
divergent paths of processing that apply to
repetitive data and nonrepetitive data.
Nonrepetitive Data
and Textual Disambiguation
Nonrepetitive data is read by textual disambiguation (also known as textual ETL)
and is prepared for analytical processing. A
simplistic perspective of textual disambiguation is that textual disambiguation parses
the nonrepetitive data into a state where the
data can be easily analyzed. However thinking of textual disambiguation as a parsing
functionality misses the point about what
textual disambiguation really does because
textual disambiguation does much more
than parse nonrepetitive data.
Textual disambiguation does such
important activities as standardize data,
edit data, transform homographs, resolve
acronyms, correct spelling errors, and
so forth.
Textual disambiguation accomplishes
many things, but the single most important thing done by textual disambiguation
is that textual disambiguation identifies
and establishes context for the nonrepetitive data. Without context, nonrepetitive
data is almost useless.
In fact, textual disambiguation does
much more than establish context for
nonrepetitive data. But identifying and
establishing context is the most important
thing done by textual disambiguation.
37
industry updates
Analytical Processing
It is a simple and time-tested proposition for SQL to access the two types of data
together, allowing the analyst to start to
do query and analytical processing against
structured data and unstructured data at
the same time in the same query.
Whats Ahead
The possibilities for analytical processing are endless. Now, the analyst can look
at data in a perspective that before has not
been possible. Now, the analyst can do
such things as:
Analyze customer sales data along
with customer feedback data
38
W. H. Inmonthe father
of data warehousehas
written 52 books published in nine languages.
His latest book is Data
ArchitectureA Primer
for the Data Scientist
(Elsevier Press). Inmon speaks at conferences regularly. His latest adventure is
the building of Textual ETL/textual disambiguationtechnology that reads raw text
and allows raw text to be analyzed. Textual
disambiguation is used to create business
value from big data. Inmon was named
by ComputerWorld as one of the 10 most
influential people in the history of the computer profession. He lives in Castle Rock,
Colo., where he also founded Forest Rim
Technology (www.forestrimtech.com).
sponsored content
The Modern
Database Landscape
SIMPLIFIED COMPUTING
ARCHITECTURE
Because of limitations in legacy database
technology, organizations have turned
to in-memory computing tools like data
grids, stream processing engines, and other
distributed computing frameworks to
process data in a real-time window. While
these tools have their uses, they introduce
additional complexity to an organizations
computing infrastructure. Additionally,
they can be misused when companies dont
fully understand their intended use case.
Often they are used to try to compensate for
database latency. However, when possible, it
is preferable to use a more powerful database
rather than separate data processing tools in
order to preserve simplicity.
An HTAP system can dramatically
simplify an organizations data processing
infrastructure. For many companies, an
HTAP-capable database becomes the core
of their data processing infrastructure and
handles most of their day-to-day operational
workload. It serves as a database of record,
but it is also capable of analytics.
There are many advantages to
maintaining a simple computing
infrastructure: increased uptime, reduced
latency, and faster development cycles, to
name a few.
In addition to the generic benefits of
simple infrastructure, HTAP systems in
particular provide some unique benefits:
1. Save development time and prevent
disasters by eliminating the need for
39
industry updates
40
sponsored content
ENTERPRISE DB
www.enterprisedb.com
DBTA. COM
41
industry updates
NEW YORK
HILTON
Midtown
May 1213
2015
Preconference Workshops
Monday, May 11
dbta.com/DataSummit
Featuring
CONNECT:
#BigDataNY
ORGANIZED AND
PRODUCED BY
industry updates
References
1. Preimesberger, C. Hadoop Poses a Big Data Security Risk. eWeek. April 2013.
2. Worth, D. Cloudera Buys Gazzang to Boost Hadoop Big Data Security. V3.co.uk. June 2014.
3 Taylor, B. How Big Data Is Changing the Security Analytics Landscape. TechRepublic. January 2014.
4. Cardenas, A. A., P. K. Manadhata, and S. P. Rajan, Big Data Analytics for Security. IEEE Security & Privacy.
November-December 2013.
5 Federal News Radio 1500-AM. Interview with Dr. Arnab Roy. Available online from
https://cloudsecurityalliance.org/research/big-data.
6. Jordan, J. The Risks of Big Data for Companies. The Wall Street Journal. October 2013.
This article is based on a comprehensive report published by Faulkner Information Services, a division of
Information Today, Inc., that provides a wide range of reports in the IT, telecommunications, and security fields.
For more information, visit www.faulkner.com. Copyright 2014, Faulkner Information Services. All Rights Reserved
44
industry directory
that make Big Data available where and when needed across
application economy.
see our ad on
PAGE 35
www.attunity.com
leverage the technology that changes the way we livefrom the data
Attunity
CA Technologies
www.ca.com
see our ad on
PAGE 7
codeFutures
IBM CLOUDANT is the worlds first globally distributed databaseas-a-service (DBaaS) for loading, storing, analyzing, and distributing
operational application data for developers of large and/or fast-growing
web and mobile applications. Cloudant technology accelerates timeto-market and time-to-innovation because it frees developers from
the mechanics of data management so they can focus exclusively
on creating great applications. It also offers high availability, elastic
scalability, and innovative mobile device synchronization.
Cloudant,
an
IBM
company
Sara Strope
see our ad on
COVER 2
CodeFutures Corporation
857-206-6018
https://cloudant.com
www.codefutures.com
sbstrope@us.ibm.com
DBTA. COM
45
industry directory
Continuent
Follow us on Twitter @Continuent
www.continuent.com
Couchbase
www.couchbase.com
infrastructures.
see our ad on
COVER 4
46
Enterprise DB
www.embarcadero.com
www.enterprisedb.com
see our ad on
PAGE 11
industry directory
Melissa Data offers data quality and enrichment tools that support
Big Data insight. Our tools can be used to extract relevant contact
information from unstructured data, as well as link and merge duplicate
information into a single customer view. With clean, consolidated data,
you can then utilize our enrichment solutions to blend in authoritative
customer data like demographics and geographics to drive Big Data
analytics and reporting. For 30 years Melissa Data has lead the way
in data quality for contact data management. Our tools work with
Pentaho, Talend, SSIS, and other leading Big Data integration tools.
Free trials available.
see our ad on
COVER 3
1-800-333-SORT
info@melissadata.com
www.iri.com/solutions/big-data
www.melissadata.com
unlock the hidden value in Big Data for improved real-time intelligence
Objectivity, Inc.
3099 North First Street, Suite 200
see our ad on
PAGE 27
MemSQL
www.memsql.com
see our ad on
PAGE 31
info@objectivity.com
www.objectivity.com
DBTA. COM
47
dbta.com/Subscribe
* Print edition
free to qualified
U.S. subscribers.
S C A N