Professional Documents
Culture Documents
19
Conclusion 23
Learn More 25
The Digital Universe in 2020: Big Data, Bigger Digital Shadows, and Biggest Growth in the Far East. IDC, December 2012.
3
Identify potential pitfalls as you evaluate and implement new technologies and processes.
Learn how to successfully address common problems that arise at each stage.
Fast track your journey to embrace Big Data and capitalize on the forth V Value.
Awakening. Data integration tasks are mostly performed using custom coded approaches, often using SQL to
transform and integrate data inside a database.
Advancing. Organizations realize the value of data and start standardizing data integration processes on a
common platform such as Informatica, DataStage, and others, leading to greater efficiencies and economies of
scale.
Plateauing. Initial successes with an enterprise data warehouse spark the need for more insights. However,
increasing data volumes and changing business requirements push the limits of traditional data integration and
data warehousing architectures. Stopgap measures trigger a transition from ETL (Extract, Transform, Load) to
ELT (Extract, Load, Transform), shifting heavy data transformation workloads into the enterprise data warehouse.
The IT backlog grows despite standards and best practices. Initial success is replaced by unsustainable costs
and user frustration.
Dynamic. Organizations start to look for alternative solutions to meet these challenges in less time, with less
effort, and at lower cost. They experiment with Big Data frameworks like Hadoop to address architectural
limitations of traditional platforms and look for ways to leverage the accumulated expertise within their
organizations.
Evolved. Companies at this stage are scaling Hadoop across the entire enterprise, using it as an integral
component of their production data management infrastructure. Big Data platforms become a new standard
within these organizations, augmenting traditional architectures at significantly lower costs.
The rest of this paper examines the Big Data Continuum in more detail and provides specific
readiness strategies to help your organization address the challenges and opportunities
of each stage.
Low Productivity: Developing, maintaining, and extending custom software code is a productivity drain
and quickly becomes unsustainable. It is particularly challenging to tune, maintain and extend existing
code when the original developers are no longer in the same roles or have left the company. Custom
code also makes it difficult to perform impact analysis or data lineage to understand dependencies and
data flows.
Rick Sherman. Misconceptions Holding Back Use of Data Integration Tools. BI Trends + Strategies, August 2012.
6
Poor Performance: SQL was not designed for ETL processing. Instead, it is a special-purpose programming
language designed for querying and managing data stored in relational databases. Using SQL for ETL
tasks is inefficient, creating performance bottlenecks and jeopardizing service level agreements (SLAs) for
ETL processing windows.
High Cost: Pushing intensive data transformations down to the database steals expensive database
cycles from the tasks for which it was intended, resulting in added infrastructure costs and jeopardizing
performance SLAs for processing database queries.
All of these issues can make it difficult for organizations to extract information and deliver business value from
data, especially as data-driven information and decision making become a vicious cycle, creating the demand for
even more data-driven information. Often, custom coding will solve problems at the outset, but as the need for
more and faster information grows, these approaches simply cant keep pace with the demands of the business.
READINESS STRATEGIES
Migrate SQL scripts to a high-performance ETL tool. ETL tools have
become the de facto solution to SQL scripting, maintenance and
performance issues. When choosing an ETL tool, beware of complex
engines and code-generators that push SQL down to the database.
Analyze and document complex code and SQL scripts used in data
integration processes and create graphical flow charts to depict SQL logic.
Identify the top 20%. Typically, 20% of SQL scripts consume up to 80%
of the time and cost, due to hardware, tuning and maintenance. Usual
suspects include SQL with merge/upsert, joins, materialized views, cursors
and union operations.
Migrate SQL scripts using the 80/20 rule. When planning and evaluating
the benefits of SQL migration, it is important to realize that a complete
migration of all SQL code is not necessary to achieve significant benefits.
Instead, focus on the top 20% to deliver quick results and significant
savings.
Migrate
SQL
Scripts
to
high-
Very
complex
scripts,
including
Unstable or error-prone.
More Data. The number and type of data sources users need to leverage increases, often including
dissimilar data in different formats (e.g. text, mainframe, web logs, and CRM).
More End Users. The range of end users that must be satisfied increases, including executives, managers,
and field and operations staff, for example.
More Queries. As the number and roles of end users grow, so do the number, variety, and complexity of
queries that must be performed on the data.
Companies at this stage come to realize that continuing to use point solutions and hand-coded approaches will
hold them back. As a result, they will begin to evaluate, adopt and standardize on ETL tools and data integration
platforms. In addition to investments in IT infrastructure, organizations start to develop and enforce best practices,
and accumulate technical expertise that can prove critical to progress along the Big Data Continuum.
When surveyed, more organizations identified their data integration readiness at these first two stages of the Big
Data Continuum than at any of the others.
READINESS STRATEGIES
Beware of code-generators and push-down optimizations. Some organizations
have adopted tools that generate SQL or offer so-called push-down
optimizations as a means to achieve faster performance at scale. Unfortunately,
most of these tools including Talend and Informatica require significant
skills and ongoing manual tuning to achieve and sustain acceptable performance,
creating similar challenges to hand coding and maintaining SQL-based data
integration logic.
Improve staff productivity. Select an ETL tool with Windows-based paradigms
that dont require a long learning curve or specialized skills. Data integration
tools should allow users to focus on business rules and workflows, rather than
complex tuning parameters to achieve and maintain high performance. Look
for ease of use as well as ease of re-use, with impact analysis and data lineage
capabilities to make it easy to revise and extend existing applications as business
requirements change.
Choose a tool that maximizes run-time performance and efficiency. A tool
that delivers superior run-time processing performance and efficiency will
maximize resource utilization, minimize costs, and provide superior throughput.
Look for a solution that performs all transformation processing outside of the
database, minimizing performance bottlenecks and inefficient utilization of
expensive database resources. Doing so can keep costs under control and allow
you to build a solid foundation for the future, avoiding potential issues often
encountered in the subsequent stages.
Leverage all your data. Having the right data source and target connectivity is
critical for leveraging all your data, to help make the best business decisions and
discover new business opportunities.
Establish a Big Data Center of Excellence (COE). A center of excellence is key
to develop and retain Big Data expertise within the organization. The COE should
also set and enforce standards for the data management architecture,
define the strategic roadmap, establish best practices and provide
training and support to the organization.
10
Comprehensive connectivity to leverage all your data. The highperformance ETL solution provides out-of-the-box connectivity to
relational sources, flat files, mainframes, Hadoop, and everything in
between.
11
Over time, increasing demands for information oftentimes prove to be too much for traditional architectures to
handle. As data volumes grow and business users demand fresher data, popular data integration tools such
as Informatica and DataStage force organizations to push data transformations down to the enterprise data
warehouse, effectively causing a transition from ETL to ELT. Unfortunately, SQL is almost never the best approach
for data integration tasks. Relational database management systems (RDBMS) were specifically designed to solve
problems that involve a big question with a small answer (i.e. user queries). However, when dealing with data
transformations, the T in ETL, the answer is generally as big, if not bigger, than the question.
Moreover, organizations can face unacceptable bottlenecks
to
and delays, not only for data transformations but also for
However,
big
The
RDBMS
is
ETL
optimized
involves
performance.
12
READINESS STRATEGIES
Offload transformations from the data warehouse. Inefficient and
underperforming ETL tools have forced many IT developers to push
transformations down to the database, adding complexity and requiring massive
investments in additional database capacity. This approach will actually move
you backward along the Big Data Continuum, increasing database costs and
the effort to maintain and tune scripts. Look for approaches that shift intensive
transformations out of the database.
Leverage acceleration technologies to extend your existing data
integration infrastructure. Most organizations have spent considerable time
and money building their existing data integration infrastructure, so rip &
replace approaches arent practical. Rather than buying extra hardware and
database capacity, you can identify where the bottlenecks occur and bring in
specialized data integration technology to accelerate these processes. For
example, technology now exists that can efficiently handle sorts, merges, and
aggregations, and that integrates seamlessly with your existing architecture.
Accelerating technologies increase an organizations Big Data readiness by
removing performance bottlenecks while allowing them to leverage their existing
architecture. These plug-and-play technologies typically result in significant
savings that can be used to fund initiatives to move into the Dynamic stage.
Start with the top 20% of data transformations. Usually 20% of the
transformations incur 80% of the processing problems. Offloading and
accelerating these transformations will provide the best bang for the buck.
Consider using Hadoop to offload all ETL processes from the data warehouse.
Hadoop is emerging as the de facto operating system for Big Data. Thanks to its
massively scalable and fault-tolerant architecture, Hadoop can be much more
effective from a performance and cost perspective than the data warehouse in
processing ETL workloads. In addition, shifting ETL workloads to Hadoop
can free up valuable database capacity to accelerate user queries.
13
offers
high-performance
data
integration software with everything you need to deploy enterprisegrade ETL capabilities on Hadoop. DMX-h offers a unique approach
to Hadoop ETL that lowers the barriers for adoption, helping your
organization unleash the full potential of Hadoop. Thanks to a
library of Use Case Accelerators, its easy for organizations to get
started with Hadoop by implementing common ETL tasks such as
joins, change data capture (CDC), web log aggregations, mainframe
data access and more.
14
Hadoop is helping organizations in all industries gain greater insights, processing more data in less time and at a
lower cost. According to organizations surveyed, the top benefits from their use of Hadoop are finding previously
undiscovered insights and reducing the overall costs of data .
Two of the most common approaches include data warehouse optimization and mainframe offload. By shifting
transformations the T in ETL out of the data warehouse and into Hadoop, organizations can quickly
realize
significant
value,
including
mission-critical
applications
15
It is important to recognize, however, that Hadoop is not a complete ETL solution. Hadoop is an operating system
that provides the underlying services to create Big Data applications. While it offers powerful utilities and massive
horizontal scalability, it does not provide the full set of capabilities that users need to deliver enterprise ETL
applications and functionality. If not addressed correctly, the gaps between the operating-level services that
Hadoop offers and the functionality that enterprise-grade ETL requires can slow Hadoop adoption and frustrate
organizations eager to deliver results, jeopardizing subsequent investments.
out processing tasks across large numbers of nodes by handling the complicated
aspects of creating, managing, and executing a set of parallel processes over a
cluster of low-cost computers.
ETL the process of collecting, processing, and distributing data has emerged as
one of the most common use cases for Hadoop.3 In fact, industry analyst Gartner
predicts that most organizations will adapt their data integration strategy using
Hadoop as a form of preprocessor for Big Data integration in the data warehouse. 4
Use of Hadoop can become a game changer for organizations, dramatically
improving the cost structure for gaining new insights, for analyzing larger data sets
and new data types, and for quickly and flexibly bringing new services to market.
Input
Formatter
MAP
Optional
Partitioner
SORT
Local
Disk
Optional
Combiner
SORT
REDUCE
Ouput
Formatter
HDFS
REDUCE
Ouput
Formatter
HDFS
Local
Disk
Input
Formatter
MAP
Optional
Partitioner
SORT
Local
Disk
Optional
Combiner
SORT
Local
Disk
Input
Formatter
MAP
Optional
Partitioner
SORT
Local
Disk
Optional
Combiner
http://blog.cloudera.com/blog/2013/02/big-datas-new-use-cases-transformation-active-archive-and-exploration/
Mark A. Beyer and Ted Friedman. Big Data Adoption in the Logical Data Warehouse. Gartner Research, February 2013
16
READINESS STRATEGIES
During experimentation and early stages of Hadoop, the main objective is to prove the
value that Hadoop can bring to organizations by augmenting or extending existing data
integration and data warehouse architectures. Therefore, data connectivity and quick
development of common ETL use cases are critical for organizations at the Dynamic
stage. Connectivity to the right data sources can maximize the value of the framework
and avoid having Hadoop become yet another silo within the enterprise. In addition,
quickly ramping productivity with Hadoop allows IT to deliver quantifiable successes that
pave the way for more widespread adoption. Success at this stage enables companies
to move to the Evolved stage, where Hadoop becomes an integral component of the
production data management architecture.
Select a tool with a wide variety of connectors to source and target systems.
Simplify importing data from various sources into Hadoop, as well as exporting
data from Hadoop to other systems.
Leverage mainframe data. Mainframe data can be the critical reference point for
new data sources, such as web logs and sensor data. Therefore, make sure the
tool provides connectivity and data translation capabilities for the mainframe.
Ensure the tool offers a comprehensive library of pre-built, out-of-the-box data
transformations. The most common data flows include joins, aggregations,
and change data capture. Reusable templates can accelerate development of
prototype applications and proof of value.
Avoid tools that generate code. These tools will burden your organization with
heavy tuning and maintenance.
Test and break your system. As you build your proof-of-concept, stress testing
your system will help you assess the reliability of your implementation and will
teach your staff critical skills to maintain and support it down the road.
Identify and prioritize use cases. Identify one (or a small number of) proof-ofconcept use cases for Hadoop. Candidate use cases often involve recurring ETL
processes that place a heavy burden on the existing data warehouse.
17
Smarter connectivity to all your data. With DMX-h, you only need
one tool to connect all sources and targets to Hadoop, including
relational databases, appliances, files, XML, and even cloud. No
coding or scripting is needed. DMX-h can also be used to preprocess data cleanse, sort, partition, and compress prior to
loading it into Hadoop, resulting in enhanced performance and
significant storage savings.
18
While most organizations at this stage are not looking to replace their existing data warehousing infrastructure
with Hadoop, ETL is a different story. Hadoop is poised to completely change the way organizations collect,
process, and distribute their data. ETL is shifting to Hadoop ETL and Big Data is becoming the new standard
architecture, providing greater value to the organization at a cost structure that is radically lower than traditional
architectures. And thats why the ability to cost-effectively utilize Big Data is quickly becoming a requirement for
companies to survive.
For example, an organization can store
aggregated web log data in their relational
database,
while
keeping
the
complete
19
As organizations begin to standardize on Hadoop as the new Big Data platform, they must keep hardware
and resource costs under control. Although Hadoop leverages commodity hardware, the total cost for system
resources can still be significant. When dealing with large numbers of nodes, hardware costs add up. Programming
resources e.g. HiveQL, Pig, Java, MapReduce can also prove expensive. Using Hadoop for ETL processing
requires specialized and expensive developers that can be hard to find and hire. For example, the Wall Street
Journal recently cited that a Hadoop programmer can now earn as much as $300,000 per year.
Today, the reality is that very few organizations have yet to reach the Evolved stage. Less than 2% of organizations
surveyed are using Hadoop as an integral component of their data management platform. But many organizations
are working towards this goal, and almost 11% expect to be at this stage within the next twelve months. Those who
get there faster will have a definite competitive edge.
20
READINESS STRATEGIES
Organizations at this stage need to focus on approaches that will allow them to efficiently scale
adoption of Big Data technologies across the entire enterprise. As companies move from proofof-value solutions to full-scale adoption, it is critical to understand that what worked in the earlier
stages may not always work in the Evolved stage.
Ensure code does not become a coding nightmare. While learning and developing Pig,
HiveQL, and Java code might be fun at the beginning, highly repetitive tasks such as
joins, change data capture (CDC), and aggregations can quickly become a nightmare to
troubleshoot and maintain. Using tools with a template-driven approach can make you
more productive by focusing on more value-added activities.
Choose a Hadoop ETL tool with a user-friendly graphical interface. Easily build ETL
jobs without the need to develop, debug, and maintain complex Java, Pig, HiveQL, and
other specialized code for MapReduce. Using common ETL paradigms will allow you
to leverage existing ETL skills within your organization, minimizing barriers for wider
Hadoop adoption.
Consider an ETL tool with native Hadoop integration. Beware of ETL tools that claim
integration with Hadoop but simply generate code such as HiveQL, Pig, or Java. These
approaches can create additional performance overhead and maintenance hurdles down
the road.
Leverage a metadata repository. This will facilitate reusability, data lineage, and impact
analysis capabilities.
Rationalize your data warehouse. Identify the top 20% of ETL workflows causing
problems within your existing enterprise data warehouse. Start by shifting these
processes into Hadoop. Operational savings and additional database capacity can then
be used to fund more strategic initiatives.
Secure your Hadoop data. Any viable approach to Hadoop ETL must provide ironclad
security that meets your organizations and industrys data security requirements.
Seamless support for Kerberos and LDAP is key.
Augment your Center of Excellence (COE) with Hadoop best practices and guidelines.
Enhance your organizations COE to provide expertise in Hadoop and related tools, and
to define and standardize guidelines to identify and align the appropriate IT resources
with the appropriate use cases throughout your organization.
21
function
with Hadoop.
2454
the
allows
pluggable
MAPREDUCE
you
fastest
to
and
run
most
within
optimize
MapReduce
Hadoop
existing
operations
22
Identify potential pitfalls as you evaluate and implement new technologies and processes.
Learn how to successfully address common problems that arise at each stage.
Fast track your journey to embrace Big Data and capitalize on the forth V Value.
Evolved. Standardizing on Hadoop as the operating system for Big Data across the entire enterprise.
23
24
Syncsort provides data-intensive organizations across the Big Data continuum with a smarter way to collect and
process the ever-expanding data avalanche. With thousands of deployments across all major platforms, including
mainframe, Syncsort helps customers around the world to overcome the architectural limits of todays ETL and
Hadoop environments, empowering their organizations to drive better business outcomes in less time, with less
resources and lower TCO. For more information visit www.syncsort.com.
2013 Syncsort Incorporated. All rights reserved. DMExpress is a trademark of Syncsort Incorporated. All other company and product
names used herein may be the trademarks of their respective companies.