A Practical Guide To Big Data Readiness

TABLE OF CONTENTS
Introduction: Are You Ready for Big Data? 3

The Big Data Continuum 5
Stage 1: Awakening 6
Stage 2: Advancing 9
Stage 3: Plateauing 12
Stage 4: Dynamic 15
Stage 5: Evolved
19
Conclusion 23
Learn More 25
The Big Data problem is a big business problem.

Analyzing Big Data to extract meaningful value is
no longer a luxury; its a necessity as companies
strive to remain relevant and competitive in the
marketplace.
Technological shifts create both opportunities
and challenges. For instance, while the Internet
revolution gave rise to Amazon and iTunes, it also
meant the end of Borders the defunct bookstore
chain or Tower Records. Big Data will be no different. Organizations unable to effectively keep pace amidst the
three Vs of Big Data Volume, Variety, and Velocity are at risk of becoming twenty-first century road kill.
How did we get here? The fact is that organizations have struggled to make sense of data for decades. And,
since the dawn of computing, there have been periods of innovation that have disrupted the entire market.
From mainframes to PCs, from Internet to social and mobile technologies, each fundamental shift in the
computing landscape has created unique challenges to organizations existing data management architectures
and processes. One-off point solutions using custom coding in the early 90s gave way to ETL platforms and
the enterprise data warehouse, all promising information nirvana: a single version of the truth.
More recently, as datasets explode with unprecedented speed and variety, and the needs of the business
become ever more complex, data management is more challenging than ever before. Traditional architectures
are breaking once again, and organizations are racing to adapt and rebuild them to handle Big Data. Big Data is
driving the next technological shift, and data integration is at the epicenter of the transformation.
1
The Digital Universe in 2020: Big Data, Bigger Digital Shadows, and Biggest Growth in the Far East. IDC, December 2012.
3
SURVIVE AND THRIVE WITH BIG DATA

So how can organizations evaluate their readiness in the context of this new environment and, most importantly,
prepare for the challenges ahead? How can you be sure youre making the right investments to embrace, and
capitalize on, the opportunities of Big Data?
Thats where the Big Data Continuum can help. The Big Data Continuum is a framework that can help you:

Assess your companys data management maturity level.
Identify potential pitfalls as you evaluate and implement new technologies and processes.
Learn how to successfully address common problems that arise at each stage.
Fast track your journey to embrace Big Data and capitalize on the forth V Value.
With decades of data management expertise and a long history of innovation,

Syncsort has worked with thousands of companies to help them solve their big data
issues, long before they knew the name Big Data. Based on our extensive experience
helping customers of all sizes and at all levels of data integration maturity, weve
designed a framework to help organizations evolve in their quest to leverage data for
competitive advantage. We call this framework The Big Data Continuum.
Organizations across different industries and sectors fall into

a wide range of maturity levels in terms of the processes and
technologies they use to manage their data, and their ability
to extract value from it. Therefore, the first steps in preparing
for Big Data involve a rigorous assessment of your existing
data management architecture and processes, and a strategic
roadmap that includes the challenges and opportunities
ahead. In essence, Where are you today, and where do you
need to be in the next 12 months?
The Big Data Continuum is a framework that can help you
answer these questions and propel your organization to the
next level.
THE FIVE STAGES OF DATA INTEGRATION MATURITY:

Awakening. Data integration tasks are mostly performed using custom coded approaches, often using SQL to
transform and integrate data inside a database.
Advancing. Organizations realize the value of data and start standardizing data integration processes on a
common platform such as Informatica, DataStage, and others, leading to greater efficiencies and economies of
scale.
Plateauing. Initial successes with an enterprise data warehouse spark the need for more insights. However,
increasing data volumes and changing business requirements push the limits of traditional data integration and
data warehousing architectures. Stopgap measures trigger a transition from ETL (Extract, Transform, Load) to
ELT (Extract, Load, Transform), shifting heavy data transformation workloads into the enterprise data warehouse.
The IT backlog grows despite standards and best practices. Initial success is replaced by unsustainable costs
and user frustration.
Dynamic. Organizations start to look for alternative solutions to meet these challenges in less time, with less
effort, and at lower cost. They experiment with Big Data frameworks like Hadoop to address architectural
limitations of traditional platforms and look for ways to leverage the accumulated expertise within their
organizations.
Evolved. Companies at this stage are scaling Hadoop across the entire enterprise, using it as an integral
component of their production data management infrastructure. Big Data platforms become a new standard
within these organizations, augmenting traditional architectures at significantly lower costs.
The rest of this paper examines the Big Data Continuum in more detail and provides specific
readiness strategies to help your organization address the challenges and opportunities
of each stage.
For organizations in the Awakening stage, hand coding

often using Structured Query Language (SQL)
inside the database is the most common method to
transform and integrate data sets. According to data
warehousing expert Rick Sherman, much of the data
integration projects in corporate enterprises are still
being done through manual coding methods that are
inefficient and often not documented.2
The problems associated with hand coding and using
SQL for data integration tasks are well understood
and include:

Custom SQL Code Used
for ETL Processing
Low Productivity: Developing, maintaining, and extending custom software code is a productivity drain
and quickly becomes unsustainable. It is particularly challenging to tune, maintain and extend existing
code when the original developers are no longer in the same roles or have left the company. Custom
code also makes it difficult to perform impact analysis or data lineage to understand dependencies and
data flows.
Rick Sherman. Misconceptions Holding Back Use of Data Integration Tools. BI Trends + Strategies, August 2012.
6
Poor Performance: SQL was not designed for ETL processing. Instead, it is a special-purpose programming
language designed for querying and managing data stored in relational databases. Using SQL for ETL
tasks is inefficient, creating performance bottlenecks and jeopardizing service level agreements (SLAs) for
ETL processing windows.
High Cost: Pushing intensive data transformations down to the database steals expensive database
cycles from the tasks for which it was intended, resulting in added infrastructure costs and jeopardizing
performance SLAs for processing database queries.
All of these issues can make it difficult for organizations to extract information and deliver business value from
data, especially as data-driven information and decision making become a vicious cycle, creating the demand for
even more data-driven information. Often, custom coding will solve problems at the outset, but as the need for
more and faster information grows, these approaches simply cant keep pace with the demands of the business.
READINESS STRATEGIES
Migrate SQL scripts to a high-performance ETL tool. ETL tools have
become the de facto solution to SQL scripting, maintenance and
performance issues. When choosing an ETL tool, beware of complex
engines and code-generators that push SQL down to the database.
Analyze and document complex code and SQL scripts used in data
integration processes and create graphical flow charts to depict SQL logic.
Identify the top 20%. Typically, 20% of SQL scripts consume up to 80%
of the time and cost, due to hardware, tuning and maintenance. Usual
suspects include SQL with merge/upsert, joins, materialized views, cursors
and union operations.
Migrate SQL scripts using the 80/20 rule. When planning and evaluating
the benefits of SQL migration, it is important to realize that a complete
migration of all SQL code is not necessary to achieve significant benefits.
Instead, focus on the top 20% to deliver quick results and significant
savings.
HOW SYNCSORT CAN HELP

Syncsorts SQL migration solution is specifically designed to help
organizations at the Awakening stage eliminate the SQL ETL coding and
maintenance nightmare by migrating existing SQL ETL scripts to a few
graphical DMX jobs. Syncsort DMX is high-performance ETL software that
accelerates overall performance and eliminates the need for database
staging areas, seamlessly reducing the total cost and complexity of data
integration.

Intelligent, self-documenting flow charts are automatically

generated so you can clearly understand complex SQL scripts used
in data integration processes.
A few graphical jobs vs. thousands of lines of SQL code. Replace

thousands of lines of SQL code with a few graphical jobs, allowing
even novice users to quickly develop and maintain data integration
jobs.
Improved IT productivity and sustained optimal performance.

Seamlessly scale as data volumes grow, without the need for
manual coding or tuning.
Migrate
SQL
Scripts
to
high-
performance ETL tool. Look for the

following characteristics to identify
the high impact code for migration.

High elapsed processing times.
Very
complex
scripts,
including
multiple merges, joins, cursors and

unions.

High impact on resource utilization,

including CPU, memory, and storage.
Unstable or error-prone.
As organizations progress to the advancing stage they will experience:

More Data. The number and type of data sources users need to leverage increases, often including
dissimilar data in different formats (e.g. text, mainframe, web logs, and CRM).
More End Users. The range of end users that must be satisfied increases, including executives, managers,
and field and operations staff, for example.
More Queries. As the number and roles of end users grow, so do the number, variety, and complexity of
queries that must be performed on the data.
Companies at this stage come to realize that continuing to use point solutions and hand-coded approaches will
hold them back. As a result, they will begin to evaluate, adopt and standardize on ETL tools and data integration
platforms. In addition to investments in IT infrastructure, organizations start to develop and enforce best practices,
and accumulate technical expertise that can prove critical to progress along the Big Data Continuum.
When surveyed, more organizations identified their data integration readiness at these first two stages of the Big
Data Continuum than at any of the others.
Beware of code-generators and push-down optimizations. Some organizations
have adopted tools that generate SQL or offer so-called push-down
optimizations as a means to achieve faster performance at scale. Unfortunately,
most of these tools including Talend and Informatica require significant
skills and ongoing manual tuning to achieve and sustain acceptable performance,
creating similar challenges to hand coding and maintaining SQL-based data
integration logic.
Improve staff productivity. Select an ETL tool with Windows-based paradigms
that dont require a long learning curve or specialized skills. Data integration
tools should allow users to focus on business rules and workflows, rather than
complex tuning parameters to achieve and maintain high performance. Look
for ease of use as well as ease of re-use, with impact analysis and data lineage
capabilities to make it easy to revise and extend existing applications as business
requirements change.
Choose a tool that maximizes run-time performance and efficiency. A tool
that delivers superior run-time processing performance and efficiency will
maximize resource utilization, minimize costs, and provide superior throughput.
Look for a solution that performs all transformation processing outside of the
database, minimizing performance bottlenecks and inefficient utilization of
expensive database resources. Doing so can keep costs under control and allow
you to build a solid foundation for the future, avoiding potential issues often
encountered in the subsequent stages.
Leverage all your data. Having the right data source and target connectivity is
critical for leveraging all your data, to help make the best business decisions and
discover new business opportunities.
Establish a Big Data Center of Excellence (COE). A center of excellence is key
to develop and retain Big Data expertise within the organization. The COE should
also set and enforce standards for the data management architecture,
define the strategic roadmap, establish best practices and provide
training and support to the organization.
10

Syncsorts DMX high-performance ETL solution provides companies
at the Advancing stage with a two-fold approach: it makes addressing
their immediate productivity issues fast and easy, while providing a solid
foundation for future data growth.

Template-driven design. DMX offers a clear, intuitive graphical user

interface that makes it easy for both business and technical users to
develop and deploy ETL processes.

Faster transformations for unparalleled ETL

performance. The solution packages a library of
hundreds of smart algorithms to handle the most
demanding data integration transformations and
functions, delivering up to 10X faster elapsed
processing times than Informatica, Talend, and
other conventional tools.
Smart ETL Optimizer. You dont have to worry

about ongoing, time-consuming tuning efforts to
maintain optimum performance. Our unique ETL
Optimizer ensures you will always get maximum
performance, so you can design for the business
without wasting time tuning.
Comprehensive connectivity to leverage all your data. The highperformance ETL solution provides out-of-the-box connectivity to
relational sources, flat files, mainframes, Hadoop, and everything in
between.
Flexibility and reusability with no strings attached. A file-based

repository delivers all the benefits of a complete metadata layer
without dependencies on third-party systems such as relational
databases.
11
Over time, increasing demands for information oftentimes prove to be too much for traditional architectures to
handle. As data volumes grow and business users demand fresher data, popular data integration tools such
as Informatica and DataStage force organizations to push data transformations down to the enterprise data
warehouse, effectively causing a transition from ETL to ELT. Unfortunately, SQL is almost never the best approach
for data integration tasks. Relational database management systems (RDBMS) were specifically designed to solve
problems that involve a big question with a small answer (i.e. user queries). However, when dealing with data
transformations, the T in ETL, the answer is generally as big, if not bigger, than the question.
Moreover, organizations can face unacceptable bottlenecks
to
and delays, not only for data transformations but also for
solve query loads. That is, big
analytical queries, as both processes compete for EDW
questions with a small answer.
resources. IT staff and budget can quickly be consumed by
However,
big
expensive and tedious stopgap measures: manual tuning
questions with sometimes even
efforts, hardware upgrades, and additional data warehouse
bigger answers. By offloading
capacity. Early excitement fades and gives way to user
heavy data transformations
frustration, incremental costs and a crippling IT backlog.
The
RDBMS
is
ETL
optimized
involves
from the EDW, you can free up

database capacity and budget
The resulting business ramifications of these bottlenecks can
while accelerating overall data
be severe, including lost revenue opportunities, impaired
performance.
decision making, customer attrition, and so on.
12
Offload transformations from the data warehouse. Inefficient and
underperforming ETL tools have forced many IT developers to push
transformations down to the database, adding complexity and requiring massive
investments in additional database capacity. This approach will actually move
you backward along the Big Data Continuum, increasing database costs and
the effort to maintain and tune scripts. Look for approaches that shift intensive
transformations out of the database.
Leverage acceleration technologies to extend your existing data
integration infrastructure. Most organizations have spent considerable time
and money building their existing data integration infrastructure, so rip &
replace approaches arent practical. Rather than buying extra hardware and
database capacity, you can identify where the bottlenecks occur and bring in
specialized data integration technology to accelerate these processes. For
example, technology now exists that can efficiently handle sorts, merges, and
aggregations, and that integrates seamlessly with your existing architecture.
Accelerating technologies increase an organizations Big Data readiness by
removing performance bottlenecks while allowing them to leverage their existing
architecture. These plug-and-play technologies typically result in significant
savings that can be used to fund initiatives to move into the Dynamic stage.
Start with the top 20% of data transformations. Usually 20% of the
transformations incur 80% of the processing problems. Offloading and
accelerating these transformations will provide the best bang for the buck.
Consider using Hadoop to offload all ETL processes from the data warehouse.
Hadoop is emerging as the de facto operating system for Big Data. Thanks to its
massively scalable and fault-tolerant architecture, Hadoop can be much more
effective from a performance and cost perspective than the data warehouse in
processing ETL workloads. In addition, shifting ETL workloads to Hadoop
can free up valuable database capacity to accelerate user queries.
13

Syncsorts ETL optimization solution helps organizations maximize the
return on their data integration investments, allowing them to keep their
existing infrastructure while shifting the heavy transformation processes to
Syncsort DMX.

Accelerate your existing data integration environment, including

Informatica and DataStage by 10x or more. Syncsort packages a
library of hundreds of smart algorithms, as well as an ETL Optimizer,
to handle the most demanding data integration transformations and
deliver up to 10x faster elapsed times
Simply plug DMX into your existing environment. DMX provides

advanced metadata interchange capabilities to bi-directionally
exchange metadata with other applications. This makes it easy
to plug the solution into existing data integration environments to
seamlessly accelerate performance, eliminate constant tuning, and
facilitate regulatory compliance.
Free up your database and your budget. Syncsorts ETL optimization

solution shifts all data transformations from the enterprise data
warehouse into the DMX high-performance ETL engine, freeing up
database resources for faster user queries.
Get Hadoop-ready. Syncsort
offers
high-performance
data
integration software with everything you need to deploy enterprisegrade ETL capabilities on Hadoop. DMX-h offers a unique approach
to Hadoop ETL that lowers the barriers for adoption, helping your
organization unleash the full potential of Hadoop. Thanks to a
library of Use Case Accelerators, its easy for organizations to get
started with Hadoop by implementing common ETL tasks such as
joins, change data capture (CDC), web log aggregations, mainframe
data access and more.
14
Hadoop is helping organizations in all industries gain greater insights, processing more data in less time and at a
lower cost. According to organizations surveyed, the top benefits from their use of Hadoop are finding previously
undiscovered insights and reducing the overall costs of data .
Two of the most common approaches include data warehouse optimization and mainframe offload. By shifting
transformations the T in ETL out of the data warehouse and into Hadoop, organizations can quickly
realize
significant
value,
including
shortened ETL batch windows, faster

database user queries, and significant
operational savings in the form of spare
database capacity. Similarly, enterprises
that rely on mainframe processing to
support
mission-critical
applications
can capitalize on valuable insights and

savings by offloading data and batch
processing from the mainframe into
Hadoop.
15
It is important to recognize, however, that Hadoop is not a complete ETL solution. Hadoop is an operating system
that provides the underlying services to create Big Data applications. While it offers powerful utilities and massive
horizontal scalability, it does not provide the full set of capabilities that users need to deliver enterprise ETL
applications and functionality. If not addressed correctly, the gaps between the operating-level services that
Hadoop offers and the functionality that enterprise-grade ETL requires can slow Hadoop adoption and frustrate
organizations eager to deliver results, jeopardizing subsequent investments.
Hadoop is an open-source software framework that excels at processing and

analyzing large amounts of data at scale.
Hadoop makes it practical to scale
out processing tasks across large numbers of nodes by handling the complicated
aspects of creating, managing, and executing a set of parallel processes over a
cluster of low-cost computers.
ETL the process of collecting, processing, and distributing data has emerged as
one of the most common use cases for Hadoop.3 In fact, industry analyst Gartner
predicts that most organizations will adapt their data integration strategy using
Hadoop as a form of preprocessor for Big Data integration in the data warehouse. 4
Use of Hadoop can become a game changer for organizations, dramatically
improving the cost structure for gaining new insights, for analyzing larger data sets
and new data types, and for quickly and flexibly bringing new services to market.
Input
Formatter
MAP
Optional
Partitioner
SORT
Local
Disk
Optional
Combiner
SORT
REDUCE
Ouput
Formatter
HDFS
REDUCE
Ouput
Formatter
HDFS
Local
Disk
Input
Formatter
MAP
Optional
Partitioner
SORT
Local
Disk
Optional
Combiner
SORT
Local
Disk
Input
Formatter
MAP
Optional
Partitioner
SORT
Local
Disk
Optional
Combiner
Typical MapReduce Data Flow
http://blog.cloudera.com/blog/2013/02/big-datas-new-use-cases-transformation-active-archive-and-exploration/
Mark A. Beyer and Ted Friedman. Big Data Adoption in the Logical Data Warehouse. Gartner Research, February 2013
16
During experimentation and early stages of Hadoop, the main objective is to prove the
value that Hadoop can bring to organizations by augmenting or extending existing data
integration and data warehouse architectures. Therefore, data connectivity and quick
development of common ETL use cases are critical for organizations at the Dynamic
stage. Connectivity to the right data sources can maximize the value of the framework
and avoid having Hadoop become yet another silo within the enterprise. In addition,
quickly ramping productivity with Hadoop allows IT to deliver quantifiable successes that
pave the way for more widespread adoption. Success at this stage enables companies
to move to the Evolved stage, where Hadoop becomes an integral component of the
production data management architecture.
Select a tool with a wide variety of connectors to source and target systems.
Simplify importing data from various sources into Hadoop, as well as exporting
data from Hadoop to other systems.
Leverage mainframe data. Mainframe data can be the critical reference point for
new data sources, such as web logs and sensor data. Therefore, make sure the
tool provides connectivity and data translation capabilities for the mainframe.
Ensure the tool offers a comprehensive library of pre-built, out-of-the-box data
transformations. The most common data flows include joins, aggregations,
and change data capture. Reusable templates can accelerate development of
prototype applications and proof of value.
Avoid tools that generate code. These tools will burden your organization with
heavy tuning and maintenance.
Test and break your system. As you build your proof-of-concept, stress testing
your system will help you assess the reliability of your implementation and will
teach your staff critical skills to maintain and support it down the road.
Identify and prioritize use cases. Identify one (or a small number of) proof-ofconcept use cases for Hadoop. Candidate use cases often involve recurring ETL
processes that place a heavy burden on the existing data warehouse.
17

Syncsorts DMX-h high-performance data integration software provides a
smarter approach to Hadoop ETL including an intuitive graphical interface
for easily creating and maintaining jobs, a wide range of productivity
features, metadata facilities for development re-use and data lineage, highperformance connectivity capabilities, and an ability to run natively within
the MapReduce framework, avoiding code generation.

Smarter connectivity to all your data. With DMX-h, you only need
one tool to connect all sources and targets to Hadoop, including
relational databases, appliances, files, XML, and even cloud. No
coding or scripting is needed. DMX-h can also be used to preprocess data cleanse, sort, partition, and compress prior to
loading it into Hadoop, resulting in enhanced performance and
significant storage savings.
Smarter mainframe data ingestion and translation. DMX-h offers

unique capabilities to read, translate, and distribute mainframe
data with Hadoop.
It supports mainframe record formats such
as fixed, variable, variable with block descriptor, and VSAM, and

also translates data from EBCDIC to ASCII, and imports COBOL
copybooks without coding.
Smarter testing, debugging and troubleshooting.
DMX-h allows you to develop, test, and troubleshoot locally
in Windows before deploying into Hadoop. In addition,
DMX-h provides comprehensive logging capabilities, as
well as integration with Hadoops JobTracker for easier
log consumption.
Smarter productivity to fast-track your way to
successful Hadoop ETL. DMX-h helps you get started
and become fully productive with Hadoop quickly
by providing a library of Use Case Accelerators that
implement common ETL tasks such as joins, change
data capture (CDC), web log aggregations, mainframe
data access, and more.
18
While most organizations at this stage are not looking to replace their existing data warehousing infrastructure
with Hadoop, ETL is a different story. Hadoop is poised to completely change the way organizations collect,
process, and distribute their data. ETL is shifting to Hadoop ETL and Big Data is becoming the new standard
architecture, providing greater value to the organization at a cost structure that is radically lower than traditional
architectures. And thats why the ability to cost-effectively utilize Big Data is quickly becoming a requirement for
companies to survive.
For example, an organization can store
aggregated web log data in their relational
database,
while
keeping
the
complete
datasets at the most granular level in Hadoop.

This allows them to run new queries against
the full historical data at any time to find new
insights, which can be a true game-changer
as organizations aggressively look for new
insights and offerings to differentiate from
the competition.
19
As organizations begin to standardize on Hadoop as the new Big Data platform, they must keep hardware
and resource costs under control. Although Hadoop leverages commodity hardware, the total cost for system
resources can still be significant. When dealing with large numbers of nodes, hardware costs add up. Programming
resources e.g. HiveQL, Pig, Java, MapReduce can also prove expensive. Using Hadoop for ETL processing
requires specialized and expensive developers that can be hard to find and hire. For example, the Wall Street
Journal recently cited that a Hadoop programmer can now earn as much as $300,000 per year.
Today, the reality is that very few organizations have yet to reach the Evolved stage. Less than 2% of organizations
surveyed are using Hadoop as an integral component of their data management platform. But many organizations
are working towards this goal, and almost 11% expect to be at this stage within the next twelve months. Those who
get there faster will have a definite competitive edge.
20
Organizations at this stage need to focus on approaches that will allow them to efficiently scale
adoption of Big Data technologies across the entire enterprise. As companies move from proofof-value solutions to full-scale adoption, it is critical to understand that what worked in the earlier
stages may not always work in the Evolved stage.

Select an approach with built-in optimizations that enhance Hadoops vertical

scalability to reduce hardware requirements. Run performance benchmarks and study
which tools deliver the best combination of price/performance for your most common
use cases.
Ensure code does not become a coding nightmare. While learning and developing Pig,
HiveQL, and Java code might be fun at the beginning, highly repetitive tasks such as
joins, change data capture (CDC), and aggregations can quickly become a nightmare to
troubleshoot and maintain. Using tools with a template-driven approach can make you
more productive by focusing on more value-added activities.
Choose a Hadoop ETL tool with a user-friendly graphical interface. Easily build ETL
jobs without the need to develop, debug, and maintain complex Java, Pig, HiveQL, and
other specialized code for MapReduce. Using common ETL paradigms will allow you
to leverage existing ETL skills within your organization, minimizing barriers for wider
Hadoop adoption.
Consider an ETL tool with native Hadoop integration. Beware of ETL tools that claim
integration with Hadoop but simply generate code such as HiveQL, Pig, or Java. These
approaches can create additional performance overhead and maintenance hurdles down
the road.
Leverage a metadata repository. This will facilitate reusability, data lineage, and impact
analysis capabilities.
Rationalize your data warehouse. Identify the top 20% of ETL workflows causing
problems within your existing enterprise data warehouse. Start by shifting these
processes into Hadoop. Operational savings and additional database capacity can then
be used to fund more strategic initiatives.
Secure your Hadoop data. Any viable approach to Hadoop ETL must provide ironclad
security that meets your organizations and industrys data security requirements.
Seamless support for Kerberos and LDAP is key.
Augment your Center of Excellence (COE) with Hadoop best practices and guidelines.
Enhance your organizations COE to provide expertise in Hadoop and related tools, and
to define and standardize guidelines to identify and align the appropriate IT resources
with the appropriate use cases throughout your organization.
21

Syncsort DMX-h turns Hadoop into a more robust and feature-rich ETL
solution, enabling users to maximize the benefits of MapReduce without
compromising on the capabilities and ease of use offered by conventional
data integration tools.

Faster performance per node. DMX-h is not a code generator.

Instead, Hadoop automatically invokes the highly efficient DMX-h
runtime engine, which executes on all nodes as an integral part
of Hadoop. DMX-h can help organizations in the Evolved stage
by delivering consistently higher performance per node as data
volumes grow.
Hadoop ETL without coding. DMX-h enables people with a much

broader range of skills not just MapReduce programmers to
create ETL tasks that execute within the MapReduce framework,
replacing complex Java, Pig, or HiveQL code with a powerful, easyto-use graphical development environment.
Enterprise-grade security for Hadoop ETL. DMX-h helps you

keep all your data secure with market-leading support for common
protocols such as LDAP and Kerberos.
Smarter Hadoop deployments. DMX-h offers tight integration

with all major Hadoop distributions, including Apache, Cloudera,
Hortonworks, MapR, and PivotalHD. Seamless integration with
Cloudera Manager allows you to easily deploy and upgrade DMX-h
in your entire Hadoop cluster with the click of a button.
Optimized sort for MapReduce processes and HiveQL. Thanks

to Syncsorts recently committed contribution to the open source
community MAPREDUCE-2454 you can simply plug DMX-h
into your existing Hadoop clusters to seamlessly optimize existing
Hive and MapReduce jobs for even greater performance and more
efficient use of your Hadoop cluster.
Smarter Economics. Keep costs down as you scale Hadoop across

the entire organization. DMX-hs unique capabilities help you
maximize savings, delivering best-in-class ETL technology at a price
point that is more consistent with the cost structure of open source
solutions. Achieve significant operational savings faster by shifting
existing ETL workloads from high-end platforms to Hadoop.
Syncsort developed and

contributed key features
to the Apache open source
community to make the
sort
function
with Hadoop.
2454
the
allows
pluggable
MAPREDUCE
you
fastest
to
and
run
most
efficient sort technology

natively
to
within
optimize
MapReduce
Hadoop
existing
operations
without any code changes

or tuning.
22
ARE YOU READY TO EMBRACE THE CHALLENGES

AND OPPORTUNITIES OF BIG DATA?
The Big Data Continuum a framework developed with decades of data management expertise can help
you assess your readiness and prepare for the challenges ahead:

Assess your companys data management maturity level.
Identify potential pitfalls as you evaluate and implement new technologies and processes.
Learn how to successfully address common problems that arise at each stage.
Fast track your journey to embrace Big Data and capitalize on the forth V Value.
The key stages of the Big Data Continuum are:

Awakening. Primarily using hand-coding techniques to process data.
Advancing. Standardizing on traditional data integration platforms.
Plateauing. Straining the limits of traditional data integration architectures.
Dynamic. Experimenting with Hadoop.
Evolved. Standardizing on Hadoop as the operating system for Big Data across the entire enterprise.
23
ARE YOU READY FOR BIG DATA?

Organizations that are further along the Big Data Continuum have a much better chance to succeed and enjoy firstmover advantage, while laggards will find themselves at risk of declining revenues, market share, and relevance.
Regardless where you are on the Big Data Continuum, Syncsort offers smarter solutions to help leverage all your
data assets and build a solid foundation for Big Data. With thousands of deployments across all major platforms,
Syncsorts solutions from SQL migration, to high performance ETL, to Hadoop can help you thrive in the world
of Big Data.
24
Discover Syncsorts Big Data Solutions

Take a Free Test Drive of Our Hadoop ETL
Solution
Check Out Our Infographic: The Big Picture
on Big Data & Hadoop
Read a Report: The European Big Picture on
Big Data & Hadoop
Like This Guide? Share It!
Syncsort provides data-intensive organizations across the Big Data continuum with a smarter way to collect and
process the ever-expanding data avalanche. With thousands of deployments across all major platforms, including
mainframe, Syncsort helps customers around the world to overcome the architectural limits of todays ETL and
Hadoop environments, empowering their organizations to drive better business outcomes in less time, with less
resources and lower TCO. For more information visit www.syncsort.com.
2013 Syncsort Incorporated. All rights reserved. DMExpress is a trademark of Syncsort Incorporated. All other company and product
names used herein may be the trademarks of their respective companies.

A Practical Guide To Big Data Readiness

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A Practical Guide To Big Data Readiness

Uploaded by

Copyright:

Available Formats

TABLE OF CONTENTS

Introduction: Are You Ready for Big Data? 3

The Big Data problem is a big business problem.

SURVIVE AND THRIVE WITH BIG DATA

Assess your companys data management maturity level.

With decades of data management expertise and a long history of innovation,

Organizations across different industries and sectors fall into

THE FIVE STAGES OF DATA INTEGRATION MATURITY:

For organizations in the Awakening stage, hand coding

Custom SQL Code Used

for ETL Processing

HOW SYNCSORT CAN HELP

Intelligent, self-documenting flow charts are automatically

A few graphical jobs vs. thousands of lines of SQL code. Replace

Improved IT productivity and sustained optimal performance.

performance ETL tool. Look for the

High elapsed processing times.

multiple merges, joins, cursors and

High impact on resource utilization,

As organizations progress to the advancing stage they will experience:

HOW SYNCSORT CAN HELP

Template-driven design. DMX offers a clear, intuitive graphical user

Faster transformations for unparalleled ETL

Smart ETL Optimizer. You dont have to worry

Flexibility and reusability with no strings attached. A file-based

solve query loads. That is, big

analytical queries, as both processes compete for EDW

questions with a small answer.

resources. IT staff and budget can quickly be consumed by

expensive and tedious stopgap measures: manual tuning

questions with sometimes even

efforts, hardware upgrades, and additional data warehouse

bigger answers. By offloading

capacity. Early excitement fades and gives way to user

heavy data transformations

frustration, incremental costs and a crippling IT backlog.

from the EDW, you can free up

The resulting business ramifications of these bottlenecks can

while accelerating overall data

be severe, including lost revenue opportunities, impaired

decision making, customer attrition, and so on.

HOW SYNCSORT CAN HELP

Accelerate your existing data integration environment, including

Simply plug DMX into your existing environment. DMX provides

Free up your database and your budget. Syncsorts ETL optimization

Get Hadoop-ready. Syncsort

shortened ETL batch windows, faster

can capitalize on valuable insights and

Hadoop is an open-source software framework that excels at processing and

Hadoop makes it practical to scale

Typical MapReduce Data Flow

HOW SYNCSORT CAN HELP

Smarter mainframe data ingestion and translation. DMX-h offers

It supports mainframe record formats such

as fixed, variable, variable with block descriptor, and VSAM, and

datasets at the most granular level in Hadoop.

Select an approach with built-in optimizations that enhance Hadoops vertical

HOW SYNCSORT CAN HELP

Faster performance per node. DMX-h is not a code generator.

Hadoop ETL without coding. DMX-h enables people with a much

Enterprise-grade security for Hadoop ETL. DMX-h helps you

Smarter Hadoop deployments. DMX-h offers tight integration

Optimized sort for MapReduce processes and HiveQL. Thanks