Professional Documents
Culture Documents
Table of contents
Executive summary ...................................................................................................................................................................... 2
Introduction .................................................................................................................................................................................... 2
Breaking the mold .................................................................................................................................................................... 3
Industry trends .......................................................................................................................................................................... 3
What else does HP see in the future? ................................................................................................................................... 4
HP approach ................................................................................................................................................................................... 5
Reference architecture ............................................................................................................................................................ 5
Performance .............................................................................................................................................................................. 9
Proprietary implementations ................................................................................................................................................. 9
Benefits of the HP BDRA solution .............................................................................................................................................. 9
Importance of flexibility in big data architecture ............................................................................................................. 10
HP BDRA proof points................................................................................................................................................................. 11
Summary the HP BDRA vision ............................................................................................................................................... 11
Implementation overview ..................................................................................................................................................... 12
For more information ................................................................................................................................................................. 13
Technical white paper | HP Big Data Reference Architecture: A Modern Approach
Executive summary
This paper describes the HP Big Data Reference Architecture (BDRA) solution and outlines how a modern architectural
approach to Hadoop provides the basis for consolidating multiple big data projects while, at the same time, enhancing
price/performance, density, and agility.
HP BDRA is a modern, flexible architecture for the deployment of big data solutions; it is designed to improve access to big
data, rapidly deploy big data solutions, and provide the flexibility needed to optimize the infrastructure in response to the
ever-changing requirements in a Hadoop ecosystem.
This reference architecture challenges the conventional wisdom used in current big data solutions, where compute
resources are typically co-located with storage resources on a single server node. Hosting big data on a single node can be
an issue if a server goes down and the time to redistribute all data to the other nodes could be significant. Instead, HP BDRA
utilizes a much more flexible approach that leverages an asymmetric cluster featuring de-coupled tiers of workload-
optimized compute and storage nodes in conjunction with modern networking. The result is a big data platform that makes
it easier to deploy, use, and scale data management applications.
As customers start to retain the dropped calls, mouse-clicks and other information that was previously discarded, they are
adopting the use of single data repositories (also known as data lakes) to capture and store big data for later analysis. Thus,
there is a growing need for a scalable, modern architecture for the consolidation, storage, access, and processing of big
data. HP BDRA meets this need by offering an extremely flexible platform for the deployment of data lakes, while also
providing access to a broad range of applications. Moreover, the HP BDRA tiered architecture allows organizations to grow
specific elements of the platform without the chore of having to redistribute data; this architecture allows bringing together
different classes of workload-specific compute and storage nodes.
Clearly, big data solutions are evolving from a simple model where each application was deployed on a dedicated cluster of
identical nodes. By integrating the significant changes that have occurred in fabrics, storage, container-based resource
management, and workload-optimized servers, HP has created the next-generation, data center-friendly, big data cluster.
Target audience: This document is intended for HP customers that are investigating the deployment of big data solutions,
or those that already have big data deployments in operation and are looking for a modern architecture to consolidate
current and future solutions while offering the widest support for big data tools.
Document purpose: This white paper outlines HP Big Data Reference Architecture (BDRA) and introduces new
developments in compute and storage node tiering to enhance system flexibility and performance.
Introduction
Todays data centers are typically based on a converged infrastructure featuring a number of servers and server blades
accessing shared storage over a storage array network (SAN). In such environments, the need to build dedicated servers to
run particular applications has disappeared. However, with the explosive growth of big data, the Hadoop platform is
becoming widely adopted as the standard solution for the distributed storage and processing of big data. Since it is often
cost-prohibitive to deploy shared storage that is fast enough for big data, organizations are often forced to build a
dedicated Hadoop cluster at the edge of the data center; then another; then perhaps an HBase cluster, a Spark cluster, and
so on. Each cluster utilizes different technologies and different storage types; each typically has three copies of big data.
The current paradigm for big data products like Hadoop, Cassandra, and Spark insists that performance in a Hadoop cluster
is optimal when work is taken to the data, resulting in nodes featuring Direct-Attached Storage (DAS), where compute and
storage resources are co-located on the same node. In this environment, data sharing between clusters is constrained by
network bandwidth, while individual cluster growth is limited by the necessity to repartition data and distribute it to new
disks.
Thus, many of the benefits of a traditional converged architecture are lost in todays Hadoop cluster implementations. You
cannot scale compute and storage resources separately; and it is no longer practical for multiple systems to share the same
data.
Typical business challenges are outlined in Figure 1.
2
Technical white paper | HP Big Data Reference Architecture: A Modern Approach
In essence, Hadoop creates a batch-processing environment for big data that runs in parallel across hundreds of servers,
with associated (federated) products restructuring raw data to create case-sets for further analysis. Thus, a conventional
data center might require clusters for tools such as HP Vertica, Apache Cassandra, Apache HBase, SAP HANA, or SAS in
addition to one or more Hadoop clusters. Rather than a neat, tidy converged infrastructure, organizations are cobbling
together multiple silos of big data, resulting in a relatively inflexible environment where it may take days and
repartitioning to move data from one silo to another. Since multiple tools are required in the analytic cycle, the inability to
share the same copy of the data or move data in a timely manner becomes an issue, while manageability is also a significant
concern.
Furthermore, there is no one optimal server configuration for big data processing; each product in the Hadoop federation of
big data solutions has a different server profile and different workload needs. In practice, optimal compute/storage ratios
typically vary from the accepted one spindle-per-core; indeed, many Hadoop workloads perform better with additional
compute power and fewer spindles. Other workloads require a large amount of memory, while still others would benefit
from accelerators such as Graphics Processing Units (GPUs) or Field-Programmable Gate Arrays (FPGAs).
As such, it is difficult to build a single node that can accommodate both workloads optimally.
Thus, HP has challenged entrenched assumptions and determined that the ideal Hadoop cluster is asymmetric; that is,
compute and storage nodes have been de-coupled and are connected via high-speed Ethernet. Storage is directly attached,
with all storage nodes exposed to the same view of the data; compute nodes are workload-optimized.
Industry trends
HPs efforts to challenge convention and re-imagine the Hadoop cluster did not happen in a vacuum. Industry trends have
been pointing in the same direction. Apache Hadoop has evolved consistently with client needs. Beginning with Hadoop 2.6,
there were significant internal changes to the storage layer, called Heterogeneous Storage. Accordingly, the terms Tiered
and Hierarchical are not considered descriptive enough for the changes that have been made.
With Heterogeneous Storage, the DataNode exposes the types and usage statistics for each individual storage to the
NameNode. This change allows NameNode to choose not just a target DataNode when placing replicas, but also the specific
storage type on each target DataNode. Since different storage types have different performance metrics in terms of
throughput, latency and cost, this new capability benefits the total optimization and cost efficiency of the system.
3
Technical white paper | HP Big Data Reference Architecture: A Modern Approach
Tiered storage
Rather than relying on proprietary SAN storage, Hadoop has been built around the concept of using Software-Defined
Storage (SDS), whereby lightweight distributed file systems support commoditized physical storage. For example, file-
based transfers typically utilize Hadoop Distributed File System (HDFS), while object-based transfers could utilize Ceph.
In another key advance, Hadoop now supports storage-tiering based on workload requirements. Thus, depending on the
particular workload, storage can be implemented using SSDs, HDDs, in-memory storage, or, for archival purposes, object
stores.
4
Technical white paper | HP Big Data Reference Architecture: A Modern Approach
Deep Learning There is increased interest in Deep Learning, whereby algorithms are used to make sense of data such
as text, images, and sound. In conjunction with artificial neural networks (that is, networks that are inspired by the central
nervous systems of the animal world), Deep Learning should enhance the analysis of big data. Systems will be able to
recognize items of interest and identify relationships without the need for specific models or programming.
Big data and supercomputing will come together. For example, no-frills hardware solutions designed for high-performance
computing (HPC), such as the HP Apollo 2000 System, can be deployed in the same rack as other compute nodes to serve
purely as number-crunchers for workloads such as Apache Spark.
HP approach
Many of the design choices for a traditional Hadoop infrastructure an architecture where compute is very closely tied to
storage, as shown in Figure 2 were driven by technologies that were available at the time. Now, however, these choices
are limiting the capabilities of Hadoop and inhibiting potential performance gains from newer hardware technologies and
enhancements to Hadoop itself.
Figure 2. The conventional wisdom is that compute and storage must be co-located
HP has attacked those paradigms, demonstrating that decoupling the compute and storage components of the Hadoop
cluster creates an extremely fast, asymmetric big data solution, as shown in Figure 3.
This solution is further enhanced by deploying workload-optimized servers and utilizing the latest features in Hadoop, such
as YARN, storage-tiering, and the ability to assign jobs to specific workload-optimized resources.
Reference architecture
HP has created HP BDRA, a reference architecture for big data, which is shown in Figure 4. This dense, converged solution
can facilitate the consolidation of multiple pools of data, allowing Hadoop, Vertica, Spark and other big data technologies to
share a common pool of data. The flexibility to adapt to future workloads has been built into the design.
This converged design features an asymmetric cluster where compute and storage resources are deployed on separate
tiers. Storage is direct-attached; specialized SAN technologies are not used. Workloads and storage can be directed to
optimized nodes. Interconnects are standard Ethernet; protocols between compute and storage are native Hadoop, such as
HDFS and HBase.
HP BDRA serves as a proof of concept and has been benchmarked to demonstrate substantially improved
price/performance along with significantly increased density compared with a traditional Hadoop architecture. For
example, HP has demonstrated that storage nodes in HP BDRA actually perform better now that they have been decoupled
and are dedicated to running HDFS, without any Java or MapReduce overhead. Moreover, because modern Ethernet fabrics
5
Technical white paper | HP Big Data Reference Architecture: A Modern Approach
are capable of delivering more bandwidth than a servers storage subsystem, network traffic between tiers does not create
a bottleneck. Indeed, testing indicated that read I/Os increased by as much as 30% in an HP BDRA configuration compared
with a conventional Hadoop cluster.
Figure 4. HP Big Data Reference Architecture, changing the economics of work distribution in big data
Compute nodes
HP Moonshot System with HP ProLiant m710/m710p Server Cartridges delivers a scalable, high-density layer for
compute tasks and provides a framework for workload-optimization.
HP Apollo 2000 System with HP ProLiant XL170r Gen9 Servers that deliver dual-processor performance, while taking
advantage of the Apollo 2000 Systems density and configuration flexibility.
Notes
HP Apollo 4200 Gen9 Servers or HP ProLiant SL4540 Gen8 Servers may be used as storage nodes in the HP BDRA.
HP ProLiant m710/m710p Server Cartridges or HP ProLiant XL170r Gen9 Servers may be used as compute nodes in the HP
BDRA.
High-speed networking separates compute nodes and storage nodes, creating an asymmetric architecture that allows each
tier to be scaled individually; there is no commitment to a particular CPU/storage ratio. Since big data is no longer co-located
with storage, Hadoop does need to achieve node locality. However, rack locality works in exactly the same way as in a
traditional converged infrastructure; that is, as long as you scale within a rack, overall scalability is not affected.
With compute and storage de-coupled, you can again enjoy many of the advantages of a traditional converged system. For
example, you can scale compute and storage independently, simply by adding compute nodes or storage nodes. Testing
carried out by HP indicates that most workloads respond almost linearly to additional compute resources.
Depending on solutions tailored in response to specific needs, an organization can choose from a range of possible
solutions (as shown in Figure 5), from a hot configuration featuring a large number of compute nodes and relatively small
storage, to a cold configuration with a large amount of storage and relatively few compute resources.
6
Technical white paper | HP Big Data Reference Architecture: A Modern Approach
Within the HP BDRA solution, YARN-compliant tools such as HBase and Apache Spark can directly utilize HDFS storage.
Other tools (such as SAP HANA, which requires a four-socket server2) are not good candidates for YARN; however, such
applications can achieve high-performance access to the same data via appropriate connectors.
More information on key components of the HP BDRA solution is provided below.
Storage nodes
The solution utilizes HP Apollo 4200 Gen9 servers for storage and recommends the HP Apollo 4510 System as an optional
server for backup and archiving.
Storage nodes deliver enterprise-level features and functionality, with an open source approach.
HP Apollo 4510
The ultra-dense HP Apollo 4510 system stores up to 544TB per system. HP Apollo 4510 is recommended as
backup/archival because in the unlikely event of a server failure, one can have up to 544TB of data per server that needs to
be redistributed across the cluster. Having Apollo 4510 configured in a separate archival HDFS storage tier will contain the
redistribution process in that tier, resulting in minimal impact to the default HDFS storage tier.
Using an HP Apollo 4510 system to back up and archive the Hadoop system delivers the following key features:
Optimized Big Data workloads
Hyperscale storage capacity and performance
Fits up to 68 Large Form Factor (LFF) hot plug drives in a 4U chassis
Up to 544TB per system
7
Technical white paper | HP Big Data Reference Architecture: A Modern Approach
Ultra-dense HP Apollo 4510 delivers hyperscale storage capacity at a lower cost compared to traditional Hadoop 2U
servers.
Compute nodes
The compute tier of the HP BDRA will be very dense. HP has validated solutions using the Moonshot 1500 Chassis with the
newest Moonshot server cartridges, as well as other solutions using the Apollo 2000 System with the latest server trays.
Both the innovations of the Moonshot System or the newest developments with the Apollo 2000 System are recommended
for handling the high density of the compute tier for the HP BDRA.
Moonshot 1500 Chassis
The features of the Moonshot 1500 Chassis include:
Key components (power, cooling, and fabric management) shared by all cartridges, reducing requirements for energy,
cabling, and space, while reducing complexity
Redundant design hosting up to 45 individually-serviceable hot-plug cartridges and two network switches, all within a
4.3U footprint
Chassis connected to network switches via six Direct Attach Copper (DAC) cables, each at 40GbE
ProLiant m710 or m710p server cartridges can be selected to provide an ultra-dense compute layer.
Networking
While HP BDRA is a reference architecture rather than a rigidly-configured appliance, HP recommends the networking
products described in this section, which were utilized in the proof-of-concept.
Networking in the HP BDRA solution is provided by a pair of Top of Rack (ToR) HP FlexFabric 5930 Switches configured via
HP Intelligent Resilient Framework (IRF), which extends the switches and provides a layer of networking redundancy. An
additional HP 5900 Switch provides connectivity to HP Integrated Lights-Out (HP iLO) management ports, which run at less
than 1GbE.
Moonshot System chassis are each equipped with dual 45-port switches for 10GbE internal networking, as well as four
40GbE uplinks.
The HP Apollo 4200 System, HP SL4540 and Apollo 4510 System are connected to the high-performance ToR switches via
a pair of 40GbE connections.
All cabling between storage and compute tiers is DAC.
YARN
YARN is a key Hadoop feature that may be characterized as a large-scale, distributed operating system for big data
applications. It decouples MapReduces resource management and scheduling capabilities from the data processing
components, allowing Hadoop to support more varied processing approaches and a broader array of data management
applications.
The concept of labels was a later addition to YARN, allowing compute nodes to be grouped. Now, when you submit a job
through YARN, you can flag the job to run on a particular group by including the appropriate label name.
Benefits of the label feature include the following:
Direct your job to nodes that are optimized for the particular workload
Accommodate applications that do not scale by providing dedicated resources
Utilize dedicated nodes, rather than relying on container isolation
Isolate resources for individual departments within the organization
Isolate compute resources that may be required at short notice for a high-priority job
8
Technical white paper | HP Big Data Reference Architecture: A Modern Approach
Performance
HP has conducted a broad range of tests to characterize the performance of an HP BDRA solution. Figure 6 shows the
results of DFSIOe, a worst-case Hadoop job that writes a large amount of data to storage.
In Figure 6, baseline write performance for three servers with local storage is shown in red. When the same three servers
were deployed as storage nodes in an HP BDRA configuration, write performance (shown in blue) scaled up significantly as
the number of compute nodes increased, demonstrating the benefits of de-coupling storage from compute resources.
Proprietary implementations
Competitors have created proprietary big data architecture solutions. However, these typically use traditional shared SAN
storage, which tends to be expensive and ties you to the particular vendor. Moreover, tools like Apache HBase, Parquet, and
OCFile are tightly coupled to HDFS and may not be well-hosted on a proprietary array.
9
Technical white paper | HP Big Data Reference Architecture: A Modern Approach
Faster time-to-solution
Processing big data typically requires the use of multiple data management tools. If these tools are deployed on their
own Hadoop clusters with their own often fragmented copies of the data, time-to-solution can be lengthy. With HP
BDRA, data is unfragmented, consolidated in a single data lake; and tools access the same data via YARN or a connector.
Thus, there is more time spent on analysis, less on shipping data; time-to-solution is typically faster.
10
Technical white paper | HP Big Data Reference Architecture: A Modern Approach
Power consumption
As expected, using the latest compute and storage nodes in HP BDRA reduces power consumption and, thus, Total Cost of
Ownership (TCO). Reduced power consumption also provides a significant advantage in cost/compute and cost/storage
compared with the traditional architecture.
Overall footprint
By using the latest ultra-dense server technology from HP, the HP BDRA solution delivers equivalent Hadoop performance
in significantly less rack space, reducing floor-space requirements in the data center.
11
Technical white paper | HP Big Data Reference Architecture: A Modern Approach
Implementation overview
HP is releasing HP BDRA as a reference architecture; however, to assist the customer, HP has created Bills of Materials
(BOMs), available in distro-specific reference architecture implementation documents, that allow solutions based on this
architecture to be built on-site by HP Technical Services, a VAR (value-added reseller), or the customer. HP BDRA is not an
appliance; thus there are many opportunities for customizing a solution to meet your particular needs. For example, you can
install the Hadoop distribution of your choice, you could specify additional head nodes, or vary the ratio of compute to
storage.
If desired, you can take advantage of an HP Service to image, configure, and test the system on your premises. HP has spent
many man-months optimizing and testing configurations for HP BDRA and has captured the information needed to facilitate
a rapid deployment.
The first step is to build the cluster, then execute a set of benchmarks to ensure that compute, storage, and networking are
all performing as expected. At this point, you can deploy the Hadoop distribution of your choice onto the cluster according to
your organizations standards. Finally, you run additional tests to validate the Hadoop installation.
HP offers services with a range of resources to help you deploy an HP BDRA solution, including the following:
Intellectual property (library of golden images, scripts, benchmarks and key settings)
HP Data Services Manager (HP DSM), which provides single-pane management and helps you transition to big data
HP Insight Cluster Management Utility, for the provisioning and image management of clusters
HP Cluster Test, for testing and deployment
HP Analytics & Data Management services, for managing Hadoop solutions
12
Technical white paper | HP Big Data Reference Architecture: A Modern Approach
Partner websites
Cloudera and HP: cloudera.com/content/cloudera/en/solutions/partner/HP.html
Hortonworks and HP: hortonworks.com/partner/hp/
MapR and HP: mapr.com/partners/partner/hp-vertica-delivers-powerful-ansi-sql-analytics-hadoop
Copyright 2014-2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. The only
warranties for HP products and services are set forth in the express warranty statements accompanying such products and services. Nothing herein should
be construed as constituting an additional warranty. HP shall not be liable for technical or editorial errors or omissions contained herein.