You are on page 1of 346

Developing big data solutions on

Microsoft Azure HDInsight


June 2014
Microsoft Azure HDInsight is a big data solution based on the open-source Apache Hadoop framework,
and is an integral part of the Microsoft Business Intelligence (BI) and Analytics product range. This guide
explores the use of HDInsight in a range of use cases and scenarios such as iterative exploration, as a
data warehouse, for ETL processes, and integration into existing BI systems. It also includes guidance on
understanding the concepts of big data, planning and designing big data solutions, and implementing
these solutions.

The guide is divided into three sections:

Section 1: Understanding Microsoft big data solutions, provides an overview of the principles
and benefits of big data solutions, and the differences between these and the more traditional
database systems. It includes general guidance for planning and designing big data solutions by
exploring in more depth topics such as defining the goals, locating data sources, and more. It
will help you decide where, when, and how you might benefit from adopting a big data solution.
This section also discusses Azure HDInsight, and its place within the comprehensive Microsoft
data platform.

Section 2, Designing big data solutions using HDInsight, contains guidance for designing
solutions to meet the typical batch processing use cases inherent in big data processing. Even if
you choose not to use HDInsight as the platform for your own solution, you will find the
information in this section useful.

2 Developing big data solutions on Microsoft Azure HDInsight

Section 3, Implementing big data solutions using HDInsight, explores a range of topics such as
the options and techniques for loading data into an HDInsight cluster, the tools you can use in
HDInsight to process data in a cluster, and the ways you can transfer the results from HDInsight
into analytical and visualization tools to generate reports and charts, or export the results into
existing data stores such as databases, data warehouses, and enterprise BI systems. This section
also contains useful information to help you automate all or part of the process, and to manage
and monitor your solutions.

The guide concentrates on the Azure HDInsight service, but much of the information is equally
applicable to big data solutions built on any platform, and with any Hadoop-based framework.

What this guide is, and what it is not


Big data is not a new concept. Distributed processing has been the mainstay of supercomputers and
high performance data storage clusters for a long time. Whats new is standardization around a set of
open source technologies that make distributed processing systems easier to buildcombined with the
growing need to store, manage, and get information from the ever increasing volume of data that
modern society generates. However, as with most new technologies, big data is surrounded by a great
deal of hype that often gives rise to unrealistic expectations.
Like all of the releases from Microsoft patterns & practices, this guide avoids the hype by concentrating
on the whys and the hows. In terms of the whys, the guide explains the concepts of big data
solutions based on Hadoop, gives a focused view on what you can expect to achieve with such a
solution, and explores the capabilities of these types of solutions in detail so that you can decide for
yourself if it is an appropriate technology for your own scenarios. The guide does this by explaining what
a Hadoop-based big data solution can do, how it does it, and the types of problems it is designed to
solve.
In terms of the hows the guide continues with a detailed examination of the typical use cases and
models for Hadoop-based big data batch processing solutions, and the ways that these models integrate
with the wider data management and business information environment, so that you can quickly
understand how you might apply a big data solution in your own environment.
This guide is not a technical reference manual for big data. It concentrates on the complete end-to-end
process of designing and building useful solutions on Hadoop-based systems for batch processing, with
the focus mainly on Azure HDInsight. It doesnt attempt to cover every nuance of implementing a big
data solution, or writing complicated code, or pushing the boundaries of what the technology is
designed to do. Neither does it cover all of the myriad details of the underlying mechanismthere is a
multitude of books, websites, and blogs that explain all these things. For example, you wont find in this
guide a list of all of the configuration settings, the way that memory is allocated, the syntax of every
type of query, and the many patterns for writing map and reduce components.
What you will see is guidance that will help you understand and get started using HDInsight to build
realistic solutions capable of answering real world questions.

About this guide 3

This guide is based on the version 3.0 (March 2014) release of HDInsight on Azure, but also includes
some of the preview features that are available in later versions. Earlier and later releases of HDInsight
may differ from the version described in this guide. For more information, see What's new in the
Hadoop cluster versions provided by HDInsight? To sign up for the Azure service, go to HDInsight
service home page.

Who this guide is for


The three sections of this guide target specific audiences:

Executives, information officers, and technology managers. The discussion of the principles
and benefits of Hadoop-based big data solutions, defining the goals for solutions, and
identifying analysis requirements in section 1, Understanding Microsoft big data solutions, of
this guide demonstrates where, when, and how a big data solution would benefit the
organization.

Architects and system designers. The exploration of the typical use cases and scenarios for big
data batch processing solutions in section 2, Designing big data solutions using HDInsight, of
this guide provides valuable assistance in designing systems that will produce the desired
results.

Developers and database administrators. The explanation of topics such as loading, querying
and manipulating data; transferring the results into analytical and visualization tools; exporting
the results into existing data stores and enterprise BI systems; and automating solutions in
section 3, Implementing big data solutions using HDInsight, of this guide will help developers
and DBAs to get started implementing and working with big data solutions.

Why this guide is pertinent now


Businesses and organizations are increasingly collecting huge volumes of data that may be useful now or
in the future, and they need to know how to store and query it to extract the hidden information it
contains. This might be web server log files, click-through data, financial information, medical data, user
feedback, location and sensor information from mobile devices, or a range of social sentiment data such
as tweets or comments to blog posts.
Big data techniques and mechanisms such as HDInsight provide a mechanism to efficiently store this
data, analyze it to extract useful information, and export the results to tools and applications that can
visualize the information in a range of ways. It is, realistically, the only way to handle the volume and the
inherent unstructured nature of this data.
No matter what type of service you provide, what industry or market sector you are in, or even if you
only run a blog or website forum, you are highly likely to benefit from collecting and analyzing data that
is easily available, is often collected automatically (such as server log files), or can be obtained from

4 Developing big data solutions on Microsoft Azure HDInsight

other sources and combined with your own data to help you better understand your customers, your
users, and your business; and to help you plan for the future.

The Team Who Brought You This Guide


This guide from the Microsoft patterns & practices group was produced with the help of many people
within the developer community.
Vision/Program Management: Masashi Narumoto
Authors: Alex Homer, Graeme Malcom, and Masashi Narumoto
Development: Andrew Oakley, Alejandro Jezierski (Southworks), Leo Tilli (Nippur LLC), and Pablo
Zaidenvoren (Nippur LLC)
Testing: Rohit Sharma, Larry Brader, Hanz Zhang, Mariano Sanchez (Lagash Systems SA), and Luis Ariel
Kahrs (Lagash Systems SA)
Performance Testing: Carlos Farre and Veerapat Sriarunrungrueang (Adecco)
Documentation and illustrations: Alex Homer and Graeme Malcom (Content Master Ltd)
Graphic design: Chris Burns (Linda Werner & Associates Inc)
Editor: RoAnn Corbisier
Production: Nelly Delgado
Reviewers: Cindy Gross, Carl Nolan, Matt Winkler, Maxim Lukiyanov, Rafael Godinho, Simon Gurevich, ,
Scott Shaw (Hortonworks), Wenming Ye (Microsoft Research), Sherman Wang, Kuninobu Sasaki, Philip
Reilly, Andre Magni, Simon Lidberg, Michael Hlobil, Chris Douglas, Min Wei, Cale Teeter, Buck Woody,
Emilio DAngelo Yofre, Mandar Inamdar, Daniel Vaughan, Sunil Sabat, Paul Glavich, Christopher Maneu,
Carlos dos Santos, Nishant Thacker, Pawe Wilkosz, and Ofer Ashkenazi.
Thanks: Special thanks to Fred Pace, Kate Baroni, and the members of Microsoft Data Insights COE for
supporting this project.
Thank you all for bringing this guide to life!

Community and Feedback


Questions? Comments? Suggestions? To provide feedback about this guide, or to get help with any
problems, please visit our Community site at http://wag.codeplex.com. The message board on the
community site is the preferred feedback and support channel because it allows you to share your ideas,
questions, and solutions with the entire community.

Table of Contents 5

Table of Contents
Understanding Microsoft big data solutions ................................................................................................ 6
What is big data? .................................................................................................................................... 11
What is Microsoft HDInsight? ................................................................................................................. 27
Planning a big data solution .................................................................................................................... 30
Designing big data solutions using HDInsight ............................................................................................. 47
Use case 1: Iterative exploration ............................................................................................................ 52
Use case 2: Data warehouse on demand................................................................................................ 56
Use case 3: ETL automation .................................................................................................................... 61
Use case 4: BI integration ....................................................................................................................... 65
Scenario 1: Iterative exploration............................................................................................................. 74
Scenario 2: Data warehouse on demand ................................................................................................ 96
Scenario 3: ETL automation .................................................................................................................. 108
Scenario 4: BI integration...................................................................................................................... 126
Implementing big data solutions using HDInsight .................................................................................... 163
Collecting and loading data into HDInsight .......................................................................................... 164
Processing, querying, and transforming data using HDInsight ............................................................. 195
Consuming and visualizing data from HDInsight .................................................................................. 244
Building end-to-end solutions using HDInsight .................................................................................... 295
Appendix A - Tools and technologies reference ....................................................................................... 317

6 Understanding Microsoft big data solutions

Understanding Microsoft big data solutions


This section of the guide explores two aspects of Hadoop-based big data systems such as HDInsight:
what they are (and why you should care), and how Microsoft is embracing open source technologies as
part of its big data roadmap. It will help you to understand the core concepts of a big data solution, the
technologies they typically use, and the advantages they offer in terms of managing huge volumes of
data and gaining insights into the information it contains.

Big data is not a stand-alone technology, or just new type of data querying mechanism. It is a significant
part of the Microsoft Business Intelligence (BI) and Analytics product range, and a vital component of
the Microsoft data platform. Figure 1 shows an overview of the Microsoft data platform and enterprise
BI product range, and the roles big data and HDInsight play within this.

What is big data? 7

Figure 1 - The role of big data and HDInsight within the Microsoft Data Platform
The figure does not include all of Microsofts data-related products, and it doesnt attempt to show
physical data flows. For example, data can be ingested into HDInsight without going through an
integration process, and a data store could be the data source for another process. Instead, the figure
illustrates as layers the applications, services, tools, and frameworks that work together allow you to
capture data, store it, process it, and visualize the information it contains. Notice that the big data
technologies span both the Integration and Data stores layers.

8 Understanding Microsoft big data solutions

Microsoft implements Hadoop-based big data solutions using the Hortonworks Data Platform (HDP),
which is built on open source components in conjunction with Hortonworks. The HDP is 100%
compatible with Apache Hadoop, and is compatible with open source community distributions. All of
the components are tested in typical scenarios to ensure that they work together correctly, and that
there are no versioning or compatibility issues. Developments are fed back into community through
Hortonworks to maintain compatibility and to support the open source effort.
Microsoft and Hortonworks offer three distinct solutions based on HDP:

HDInsight. This is a cloud-hosted service available to Azure subscribers that uses Azure clusters
to run HDP, and integrates with Azure storage. For more information about HDInsight see What
is Microsoft HDInsight? and the HDInsight page on the Azure website.

Hortonworks Data Platform (HDP) for Windows. This is a complete package that you can install
on Windows Server to build your own fully-configurable big data clusters based on Hadoop. It
can be installed on physical on-premises hardware, or in virtual machines in the cloud. For more
information see Microsoft Server and Cloud Platform on the Microsoft website and Hortonworks
Data Platform.

Microsoft Analytics Platform System. This is a combination of the massively parallel processing
(MPP) engine in Microsoft Parallel Data Warehouse (PDW) with Hadoop-based big data
technologies. It uses the HDP to provide an on-premises solution that contains a region for
Hadoop-based processing, together with PolyBasea connectivity mechanism that integrates
the MPP engine with HDP, Cloudera, and remote Hadoop-based services such as HDInsight. It
allows data in Hadoop to be queried and combined with on-premises relational data, and data
to be moved into and out of Hadoop. For more information see Microsoft Analytics Platform
System.

A single-node local development environment for Hadoop-based solutions is available from


Hortonworks. This is useful for initial development, proof of concept, and testing. For more details, see
Hortonworks Sandbox.

Examples of big data solutions


In Figure 1, data typically flows upward from data sources, through data stores such as SQL Server and
HDInsight, to reporting and analysis tools such as Excel, Office 365, and SQL Server Reporting Services
(SSRS). Note that the data does not necessarily need to flow through every layer shown in Figure 1. In
some scenarios, operations such as extract-transform-load (ETL) data integration and data validation
may be carried out within HDInsight so that use of a separate ETL service such as Data Quality Services is
not required. In addition, if the data is not being incorporated into a BI system but just passed directly to
reporting and analysis tools, it will not be exposed through a corporate data model.
As an example of how Microsoft big data tools, and specifically HDInsight, integrate with other tools and
frameworks, consider the following typical use cases:

What is big data? 9

Simple iterative querying and visualization. You may simply want to load some unstructured
data into HDInsight, combine it with data from external sources such as Azure Marketplace, and
then analyze and visualize the results using Microsoft Excel and Power View. In this case, data
from the data source will flow into HDInsight where queries and transformations generate the
required result. This result flows through an ODBC connector or directly from Azure blob
storage into a visualization tool such as Excel, where it is combined with data loaded directly by
Excel from Azure Marketplace.

Handling streaming data and exposing it through SharePoint. In this case streaming data
collected from device sensors is fed through Microsoft StreamInsight or Azure Intelligent
Systems Service for categorization and filtering, and can be used to display real-time values on a
dashboard or to trigger changes in a process. The data is then transferred into an Azure
HDInsight cluster for use in historical analysis. The output from queries that are run as periodic
batch jobs in HDInsight is integrated at the corporate data model level with a data warehouse,
and ultimately delivered to users through SharePoint libraries and web partsmaking it
available for use in reports, and in data analysis and visualization tools such as Excel.

Exposing data as a business data source for an existing data warehouse system. This might be
to produce a specific set of management reports on a regular basis. Semi-structured or
unstructured data is loaded into HDInsight, queried and transformed within HDInsight,
validated and cleansed using Data Quality Services, and stored in your data warehouse tables
ready for use in reports. You may also use Master Data Services to ensure consistency between
data representations of business elements across your organization.

These are just three examples of the countless permutations and capabilities of the Microsoft data
platform and HDInsight. Your own requirements will differ, but the combination of services and tools
makes it possible to implement almost any kind of big data solution using the elements of the platform.
You will see many examples of the way that these applications, tools, and services work together in this
guide.

The background to big data


Hadoop-based big data solutions provide a mechanism for storing vast quantities of structured, semistructured, and unstructured data. They also deal with the issue of variable data formats by allowing you
to store the data in its native form, and then apply a schema to it later when you need to query it. This
means that you dont inadvertently lose any information by forcing the data into a format that may later
prove to be too restrictive. The topics What is big data? and Why should I care about big data? provide
more details.
Big data solutions also provide a framework for efficiently executing distributed queries across these
huge volumes of data, often multiple terabytes or petabytes in size. It also means that you can simply
store the data noweven if you dont know how, when, or even whether it will be usefulsafe in the
knowledge that, should the need arise in the future, you can extract any useful information it contains.

10 Understanding Microsoft big data solutions

The topic How do big data solutions work? explores the mechanisms that Hadoop-based solutions can
use to analyze data.
Big data solutions can help you to discover information that you didnt know existed, complement your
existing knowledge about your business and your customers, and boost competitiveness. By using the
cloud as the data store and HDInsight as the query mechanism you benefit from very affordable storage
costs (at the time of writing, 1TB of Azure storage costs less than $40 per month), and the flexibility and
elasticity of the pay-as-you-go model where you only pay for the resources you use.

Planning a big data solution


You may choose to use a big data solution simply as an experimental platform for investigating data, or
you may want to build a more comprehensive solution that integrates with your existing data
management and BI systems. While there is no formal set of steps for designing and implementing big
data solutions, there are several points that you should consider before you start. Ensuring that you
think about these will help you to more quickly achieve the results you require, and can save
considerable waste of time and effort. For details of the typical planning considerations for big data
solutions, see Planning a big data solution.

More information
For an overview and description of Microsoft big data see Microsoft Server and Cloud Platform.
For more information about HDInsight see the HDInsight page on the Azure website.
Documentation for HDInsight is available on the Tutorials and Guides page.
To sign up for Azure services go to the HDInsight Service page.
The page Get started using Azure HDInsight will help you begin working with HDInsight.
The official site for the Apache Hadoop framework and tools is the Apache Hadoop website.
You can download the free eBook Introducing Microsoft Azure HDInsight from the Microsoft Press
Blog.
There are also many popular blogs that cover big data and HDInsight topics:

Alexei Khalyako: http://alexeikh.wordpress.com/category/bigdata/

Benjamin Guinebertire: http://blogs.msdn.com/benjguin

Brian Mitchell: http://brianwmitchell.com/

Brian Swan: http://blogs.msdn.com/brian_swan

Carl Nolan: http://blogs.msdn.com/b/carlnol/

Cindy Gross http://blogs.msdn.com/b/cindygross/archive/tags/hadoop/

Denny Lee: http://dennyglee.com/

What is big data? 11

Lara Rubbelke: http://sqlblog.com/blogs/lara_rubbelke/default.aspx

Matt Winkler: http://blogs.msdn.com/b/mwinkle/

Murshed Zaman: http://murshedsqlcat.wordpress.com

Teo Lachev: http://prologika.com/CS/blogs/blog/archive/tags/Hadoop/default.aspx

Microsoft Support for HDInsight: http://blogs.msdn.com/b/bigdatasupport/

Hortonworks: http://hortonworks.com/blog/

What is big data?


Do you know what visitors to your website really think about your carefully crafted content? Or, if you
run a business, can you tell what your customers actually think about your products or services? Did you
realize that your latest promotional campaign had the biggest effect on people aged between 40 and 50
living in Wisconsin, USA (and, more importantly, why)?
Being able to get answers to these kinds of questions is increasingly vital in today's competitive
environment, but the source data that can provide these answers is often hidden away; and when you
can find it, it's often very difficult to analyze. It might be distributed across many different databases or
files, be in a format that is hard to process, or may even have been discarded because it didnt seem
useful at the time.
To resolve these issues, data analysts and business managers are fast adopting techniques that were
commonly at the core of data processing in the past, but have been sidelined in the rush to modern
relational database systems and structured data storage. The new buzzword is big data and the
associated solutions encompass a range of technologies and techniques that allow you to extract real,
useful, and previously hidden information from the often very large quantities of data that previously
may have been left dormant and, ultimately, thrown away because storage was too costly.
The term big data is being used to describe an increasing range of technologies and techniques. In
essence, big data is data that is valuable but, traditionally, it was not practical to store or analyze it due
to limitations of cost or the absence of suitable mechanisms. Big data typically refers to collections of
datasets that, due to size and complexity, are difficult to store, query, and manage using existing data
management tools or data processing applications.
You can also think of big data as data, often produced at fire hose rate, that you don't know how to
analyze at the momentbut which may provide valuable information in the future. Big data solutions
aim to provide data storage and querying functionality for situations such as this. They offer a
mechanism for organizations to extract meaningful, useful, and often vital information from the vast
stores of data they are collecting.

12 Understanding Microsoft big data solutions

Big data is often described as a solution to the three V's problem:

Volume: Big data solutions typically store and query hundreds of terabytes of data, and the
total volume is probably growing by ten times every five years. Storage must be able to manage
this volume, be easily expandable, and work efficiently across distributed systems. Processing
systems must be scalable to handle increasing volumes of data, typically by scaling out across
multiple machines.

Variety: It's not uncommon for new data to not match any existing data schema. It may also be
semi-structured or unstructured data. This means that applying schemas to the data before or
during storage is no longer a practical proposition.

Velocity: Data is being collected at an increasing rate from many new types of devices, from a
fast-growing number of users, and from an increasing number of devices and applications per
user. The design and implementation of storage must be able to manage this efficiently, and
processing systems must be able to return results within an acceptable timeframe.

The quintessential aspect of big data is not the data itself; its the ability to discover useful information
hidden in the data. Big data is not just Hadoopsolutions may use traditional data management
systems such as relational databases and other types of data store. Its really all about the analytics
that a big data solution can empower.
This section of the guide explores some of the basic features of big data solutions. If you are not familiar
with the concepts of big data, when it is useful, and how it works, you will find the following topics
helpful:

Why should I care about big data?

How do big data solutions work?

What is Microsoft HDInsight?

Why should I care about big data?


Many people consider big data solutions to be a new way to do data warehousing when the volume of
data exceeds the capacity or cost limitations for relational database systems. However, it can be difficult
to fully grasp what big data solutions really involve, what hardware and software they use, and how and
when they are useful. There are some basic questions, the answers to which will help you understand
where big data solutions are usefuland how you might approach the topic in terms of implementing
your own solutions:

Why do I need a big data solution?

What problems do big data solutions solve?

How is a big data solution different from traditional database systems?

What is big data? 13

Will a big data solution replace my relational databases?

Why do I need a big data solution?


In the most simplistic terms, organizations need a big data solution to enable them to survive in a rapidly
expanding and increasingly competitive market where the sources and the requirements to store data
are growing at exponential rate. Big data solutions are typically used for:

Storing huge volumes of unstructured or semi-structured data. Many organizations need to


handle vast quantities of data as part of their daily operations. Examples include financial and
auditing data, and medical data such as patients' notes. Processing, backing up, and querying all
of this data becomes more complex and time consuming as the volume increases. Big data
solutions are designed to store vast quantities of data (typically on distributed servers with
automatic generation of replicas to guard against data loss), together with mechanisms for
performing queries on the data to extract the information the organization requires.

Finding hidden insights in large stores of data. For example, organizations want to know how
their products and services are perceived in the market, what customers think of the
organization, whether advertising campaigns are working, and which facets of the organization
are (or are not) achieving their aims. Organizations typically collect data that is useful for
generating business intelligence (BI) reports, and to provide input for management decisions.
However, they are increasingly implementing mechanisms that collect other types of data such
as sentiment data (emails, comments from web site feedback mechanisms, and tweets that
are related to the organization's products and services), click-through data, information from
sensors in users' devices (such as location data), and website log files.

Extracting vital management information. The vast repositories of data often contain useful,
and even vital information that can be used for product and service planning, coordinating
advertising campaigns, improving customer service, or as an input to reporting systems. This
information is also very useful for predictive analysis such as estimating future profitability in a
financial scenario, or for an insurance company to predict the possibility of accidents and
claims. Big data solutions allow you to store and extract all this information, even if you dont
know when or how you will use the data at the time you are collecting it.

Successful organizations typically measure performance by discovering the customer value that each
part of their operation generates. Big data solutions provide a way to help you discover value, which
often cannot be measured just through traditional business methods such as cost and revenue
analysis.

What problems do big data solutions solve?


Big data solutions were initially seen to be primarily a way to resolve the limitation with traditional
database systems due to:

14 Understanding Microsoft big data solutions

Volume: Big data solutions are designed and built to store and process hundreds of terabytes,
or even petabytes of data in a way that can dramatically reduce storage cost, while still being
able to generate BI and comprehensive reports.

Variety: Organizations often collect unstructured data, which is not in a format that suits
relational database systems. Some data, such as web server logs and responses to
questionnaires may be preprocessed into the traditional row and column format. However,
data such as emails, tweets, and web site comments or feedback, are semi-structured or even
unstructured data. Deciding how to store this data using traditional database systems is
problematic, and may result in loss of useful information if the data must be constricted to a
specific schema when it is stored.
Big data solutions typically target scenarios where there is a huge volume of unstructured or
semi-structured data that must be stored and queried to extract business intelligence.
Typically, the majority of data currently stored in big data solutions is unstructured or semistructured.

Velocity: The rate at which data arrives may make storage in an enterprise data warehouse
problematic, especially where formal preparation processes such as examining, conforming,
cleansing, and transforming the data must be accomplished before it is loaded into the data
warehouse tables.

The combination of all these factors means that, in some circumstances, a big data batch processing
solution may be a more practical proposition than a traditional relational database system. However, as
big data solutions have continued to evolve it has become clear that they can also be used in a
fundamentally different context: to quickly get insights into data, and to provide a platform for further
investigation in a way that just isnt possible with traditional data storage, management, and querying
tools.
Figure 1 demonstrates how you might go from a semi-intuitive guess at the kind of information that
might be hidden in the data, to a process that incorporates that information into your business domain.

Figure 1 - Big data solutions as an experimental data investigation platform

What is big data? 15

As an example, you may want to explore the postings by users of a social website to discover what they
are saying about your company and its products or services. Using a traditional BI system would mean
waiting for the database architect and administrator to update the schemas and models, cleanse and
import the data, and design suitable reports. But its unlikely that youll know beforehand if the data is
actually capable of providing any useful information, or how you might go about discovering it. By using
a big data solution you can investigate the data by asking any questions that may seem relevant. If you
find one or more that provide the information you need you can refine the queries, automate the
process, and incorporate it into your existing BI systems.
Big data solutions arent all about business topics such as customer sentiment or web server log file
analysis. They have many diverse uses around the world and across all types of applications. Police
forces are using big data techniques to predict crime patterns, researchers are using them to explore
the human genome, particle physicists are using them to search for information about the structure of
matter, and astronomers are using them to plot the entire universe. Perhaps the last of these really is
a big big data solution!

How is a big data solution different from traditional database systems?


Traditional database systems typically use a relational model where all the data is stored using
predetermined schemas, and linked using the values in specific columns of each table. Requiring a
schema to be applied when data is written may mean that some information hidden in the data is lost.
There are some more flexible mechanisms, such as the ability to store XML documents and binary data,
but the capabilities for handling these types of data are usually quite limited.
Big data solutions do not force a schema onto the stored data. Instead, you can store almost any type of
structured, semi-structured, or unstructured data and then apply a suitable schema when you query this
data. Big data solutions store the data in its raw format and apply a schema only when the data is read,
which preserves all of the information within the data.
Traditional database systems typically consist of a central node where all processing takes place, which
means that all the data from storage must be moved to the central location for processing. The capacity
of this central node can be increased only by scaling up, and there is a physical limitation on the number
of CPUs and memory, depending on the chosen hardware platform. The consequence of this is a
limitation of processing capacity, as well as network latency when the data is moved to the central node.
In contrast, big data solutions are optimized for storing vast quantities of data using simple file formats
and highly distributed storage mechanisms, and the initial processing of the data occurs at each storage
node. This means that, assuming you have already loaded the data into the cluster storage, the bulk of
the data does not need to be moved over the network for processing.
Figure 2 shows some of the basic differences between a relational database system and a big data
solution in terms of storing and querying data. Notice how both relational databases and big data
solutions use a cluster of servers; the main difference is where query processing takes place and how
the data is moved across the network.

16 Understanding Microsoft big data solutions

Figure 2 - Some differences between relational databases and big data batch processing solutions
Modern data warehouse systems typically use high speed fiber networks, in-memory caching, and
indexes to minimize data transfer delays. However, in a big data solution only the results of the
distributed query processing are passed across the cluster network to the node that will assemble them

What is big data? 17

into a final results set. Under ideal conditions, performance during the initial stages of the query is
limited only by the speed and capacity of connectivity to the co-located disk subsystem, and this initial
processing occurs in parallel across all of the cluster nodes.
The servers in a cluster are typically co-located in the same datacenter and connected over a lowlatency, high-bandwidth network. However, big data solutions can work well even without a high
capacity network, and the servers can be more widely distributed, because the volume of data moved
over the network is much less than in a traditional relational database cluster.
The ability to work with highly distributed data and simple file formats also opens up opportunities for
more efficient and more comprehensive data collection. For example, services and applications can
store data in any of the predefined distributed locations without needing to preprocess it or execute
queries that can absorb processing capacity. Data is simply appended to the files in the data store. Any
processing required on the data is done when it is queried, without affecting the original data and
risking losing valuable information.
Queries to extract information in a big data solution are typically batch operations that, depending on
the data volume and query complexity, may take some time to return a final result. However, when you
consider the volumes of data that big data solutions can handle, the fact that queries run as multiple
tasks on distributed servers does offer a level of performance that may not be achievable by other
methods. While it is possible to perform real-time queries, typically you will run the query and store the
results for use within your existing BI tools and analytics systems. This means that, unlike most SQL
queries used with relational databases, big data queries are typically not executed repeatedly as part of
an applications executionand so batch operation is not a major disadvantage.
Big data systems are also designed to be highly resilient against failure of storage, networks, and
processing. The distributed processing and replicated storage model is fault-tolerant, and allows easy reexecution of individual stages of the process. The capability for easy scaling of resources also helps to
resolve operational and performance issues.
The following table summarizes the major differences between a big data solution and existing
relational database systems.
Feature

Relational database systems

Big data solutions

Data types and formats

Structured

Semi-structured and unstructured

Data integrity

Hightransactional updates

Depends on the technology usedoften


follows an eventually consistent model

Schema

Staticrequired on write

Dynamicoptional on read and write

Read and write pattern

Fully repeatable read/write

Write once, repeatable read

Storage volume

Gigabytes to terabytes

Terabytes, petabytes, and beyond

Scalability

Scale up with more powerful hardware

Scale out with additional servers

Data processing distribution

Limited or none

Distributed across the cluster

18 Understanding Microsoft big data solutions

Economics

Expensive hardware and software

Commodity hardware and open source


software

Will a big data solution replace my relational databases?


Big data batch processing solutions offer a way to avoid storage limitations, or to reduce the cost of
storage and processing, for huge and growing volumes of data; especially where this data might not be
part of a vital business function. But this isnt to say that relational databases have had their day.
Continual development of the hardware and software for this core business function provides
capabilities for storing very large amounts of data. For example, Microsoft Analytics Platform System
(APS) can store hundreds of terabytes of data.
In fact, the relational database systems we use today and the more recent big data batch processing
solutions are complementary mechanisms. Big data batch processing solutions are extremely unlikely
ever to replace the existing relational databasein the majority of cases they complement and augment
the capabilities for managing data and generating BI. For example, it's common to use a big data query
to create a result set that is then stored in a relational database for use in the generation of BI, or as
input to another process.
Big data is also a valuable tool when you need to handle data that is arriving very quickly, and which you
can process later. You can dump the data into the storage cluster in its original format, and then process
it when required using a query that extracts the required result set and stores it in a relational database,
or makes it available for reporting. Figure 3 shows this approach in schematic form.

What is big data? 19

Figure 3 - Combining big data batch processing with a relational database


In this kind of environment, additional capabilities are enabled. Big data batch processing systems work
with almost any type of data. It is quite feasible to implement a bidirectional data management solution
where data held in a relational database or BI system can be processed by the big data batch processing
mechanism, and fed back into the relational database or used for analysis and reporting. This is exactly
the type of environment that Microsoft Analytics Platform System (APS) provides.
The topic Use case 4: BI integration in this guide explores and demonstrates the integration of a big
data solution with existing data analysis tools and enterprise BI systems. It describes the three main
stages of integration, and explains the benefits of each approach in terms of maximizing the usefulness
of your big data implementations.

How do big data solutions work? 21

How do big data solutions work?


In the days before Structured Query Language (SQL) and relational databases, data was typically stored
in flat files, often is simple text format, with fixed width columns. Application code would open and read
one or more files sequentially, or jump to specific locations based on the known line width of a row, to
read the text and parse it into columns to extract the values. Results would be written back by creating a
new copy of the file or a separate results file.
Modern relational databases put an end to all this, giving us greater power and additional capabilities
for extracting information simply by writing queries in a standard format such as SQL. The database
system hides all the complexity of the underlying storage mechanism and the logic for assembling the
query that extracts and formats the information. However, as the volume of data that we collect
continues to increase, and the native structure of this information is less clearly defined, we are moving
beyond the capabilities of even enterprise-level relational database systems.
Big data batch processing solutions are essentially a simple process that breaks up the source files into
multiple blocks and replicates the blocks on a distributed cluster of commodity nodes. Data processing
runs in parallel on each node, and the parallel processes are then combined into an aggregated result
set.
At the core of many big data implementations is an open source technology named Apache Hadoop.
Hadoop was developed by Yahoo and the code was then provided as open source to the Apache
Software Foundation. The most recent versions of Hadoop are commonly understood to contain the
following main assets:

The Hadoop kernel, or core package, containing the Hadoop distributed file system (HDFS)), the
map/reduce framework, and common routines and utilities.

A runtime resource manager that allocates tasks, and executes queries (such as map/reduce
jobs) and other applications. This is usually implemented through the YARN framework,
although other resource managers such as Mesos are available.

Other resources, tools, and utilities that run under the control of the resource manager to
support tasks such as managing data and running queries or other jobs on the data.

Notice that map/reduce is just one application that you can run on a Hadoop cluster. Several query,
management, and other types of applications are available or under development. Examples are:

Accumulo: A key/value NoSQL database that runs on HDFS.

Giraph: An iterative graph processing system designed to offer high scalability.

Lasr: An in-memory analytics processor for tasks that are not well suited to map/reduce
processing.

Reef: A query mechanism designed to implement iterative algorithms for graph analytics and
machine learning.

22 Understanding Microsoft big data solutions

Storm: A distributed real-time computation system for processing fast, large streams of data.

In addition there are many other open source components and tools that can be used with Hadoop. The
Apache Hadoop website lists the following:

Ambari: A web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters.

Avro: A data serialization system.

Cassandra: A scalable multi-master database with no single points of failure.

Chukwa: A data collection system for managing large distributed systems.

HBase: A scalable, distributed database that supports structured data storage for large tables.

Hive: A data warehouse infrastructure that provides data summarization and ad hoc querying.

Mahout: A scalable machine learning and data mining library.

Pig: A high-level data-flow language and execution framework for parallel computation.

Spark: A fast, general-use compute engine with a simple and expressive programming model.

Tez: A generalized data-flow programming framework for executing both batch and interactive
tasks.

ZooKeeper: A high-performance coordination service for distributed applications.

A list of commonly used tools and frameworks for big data projects based on Hadoop can be found in
Appendix A - Tools and technologies reference. Some of these tools are not supported on HDInsight
for more details see What is Microsoft HDInsight?
Figure 1 shows an overview of a typical Hadoop-based big data mechanism.

Figure 1 - The main assets of Apache Hadoop (version 2 onwards)


We'll be exploring and using some of these components in the scenarios described in subsequent
sections of this guide. For more information about all of them, see the Apache Hadoop website.

How do big data solutions work? 23

The cluster
In Hadoop, a cluster of servers stores the data using HDFS, and processes it. Each member server in the
cluster is called a data node, and contains an HDFS data store and a query execution engine. The cluster
is managed by a server called the name node that has knowledge of all the cluster servers and the parts
of the data files stored on each one. The name node server does not store any of the data to be
processed, but is responsible for storing vital metadata about the cluster and the location of each block
of the source data, directing clients to the other cluster members, and keeping track of the state of each
one by communicating with a software agent running on each server.
To store incoming data, the name node server directs the client to the appropriate data node server.
The name node also manages replication of data files across all the other cluster members that
communicate with each other to replicate the data. The data is divided into blocks and three copies of
each data file are stored across the cluster servers in order to provide resilience against failure and data
loss (the block size and the number of replicated copies are configurable for the cluster).

The data store


The data store running on each server in a cluster is a suitable distributed storage service such as HDFS
or a compatible equivalent. Hadoop implementations may use a cluster where all the servers are colocated in the same datacenter in order to minimize bandwidth use and maximize query performance.
HDInsight does not completely follow this principle. Instead it provides an HDFS-compatible service
over blob storage, which means that the data is stored in the datacenter storage cluster that is colocated with the virtualized servers that run the Hadoop framework. For more details, see What is
Microsoft HDInsight?
The data store in a Hadoop implementation is usually referred to as a NoSQL store, although this is not
technically accurate because some implementations do support a structured query language (such as
SQL). In fact, some people prefer to use the term Not Only SQL for just this reason. There are more
than 120 NoSQL data store implementations available at the time of writing, but they can be divided
into the following basic categories:

Key/value stores. These are data stores that hold data as a series of key/value pairs. The value
may be a single data item or a complex data structure. There is no fixed schema for the data,
and so these types of data store are ideal for unstructured data. An example of a key/value
store is Azure table storage, where each row has a key and a property bag containing one or
more values. Key/value stores can be persistent or volatile.

Document stores. These are data stores optimized to hold structured, semi-structured, and
unstructured data items such as JSON objects, XML documents, and binary data. They are
usually indexed stores.

Block stores. These are typically non-indexed stores that hold binary data, which can be a
representation of data in any format. For example, the data could represent JSON objects or it

24 Understanding Microsoft big data solutions

could just be a binary data stream. An example of a block store is Azure blob storage, where
each item is identified by a blob name within a virtual container structure.

Wide column or column family data stores. These are data stores that do use a schema, but the
schema can contain families of columns rather than just single columns. They are ideally suited
to storing semi-structured data, where some columns can be predefined but others are capable
of storing differing elements of unstructured data. HBase running on HDFS is an example. HBase
is discussed in more detail in the topic Specifying the infrastructure in this guide.

Graph data stores. These are data stores that hold the relationships between objects. They are
less common than the other types of data store, many still being experimental, and they tend to
have specialist uses.

NoSQL storage is typically much cheaper than relational storage, and usually supports a write once
capability that allows only for data to be appended. To update data in these stores you must drop and
recreate the relevant file, or maintain delta files and implement mechanisms to conflate the data. This
limitation maximizes throughput; storage implementations are usually measured by throughput rather
than capacity because this is usually the most significant factor for both storage and query efficiency.
Modern data management techniques such as the Event Sourcing, Command Query Responsibility
Separation (CQRS), and other patterns do not encourage updates to data. Instead, new data is added
and milestone records are used to fix the current state of the data at intervals. This approach provides
better performance and maintains the history of changes to the data. For more information about
CQRS and Event Sourcing see the patterns & practices guide CQRS Journey.

The query mechanism


Big data batch processing queries are commonly based on a distributed processing mechanism called
map/reduce (often written as MapReduce) that provides optimum performance across the servers in a
cluster. Map and reduce are mathematical operations. A map operation applies a function to a list of
data items, returning a list of transformed items. A reduce operation applies a combining function that
takes multiple lists and recursively generates an output.
In a big data framework such as Hadoop, a map/reduce query uses two components, usually written in
Java, which implement the algorithms that perform a two-stage data extraction and rollup process. The
Map component runs on each data node server in the cluster extracting data that matches the query,
and optionally applying some processing or transformation to the data to acquire the required result set
from the files on that server. The Reduce component runs on one or more of the data node servers, and
combines the results from all of the Map components into the final results set.
For a detailed description of the MapReduce framework and programming model, see
MapReduce.org.
As a simplified example of a map/reduce query, assume that the input data contains a list of the detail
lines from customer orders. Each detail line contains a reference (foreign key) that links it to the main

How do big data solutions work? 25

order record, the name of the item ordered, and the quantity ordered. If this data was stored in a
relational database, a query of the following form could be used to generate a summary of the total
number of each item sold:
SQL
SELECT ProductName, SUM(Quantity) FROM OrderDetails GROUP BY ProductName

The equivalent using a big data solution requires a Map and a Reduce component. The Map component
running on each node operates on a subset, or chunk, of the data. It transforms each order line into a
name/value pair where the name is the product name, and the value is the quantity from that order
line. Note that in this example the Map component does not sum the quantity for each product, it
simply transforms the data into a list.
Next, the framework shuffles and sorts all of the lists generated by the Map component instances into a
single list, and executes the Reduce component with this list as the input. The Reduce component sums
the totals for each product, and outputs the results. Figure 2 shows a schematic overview of the process.

Figure 2 - A high level view of the map/reduce process for storing data and extracting information

26 Understanding Microsoft big data solutions

Depending on the configuration of the query job, there may be more than one Reduce component
instance running. The output from each Map component instance is stored in a buffer on disk, and the
component exits. The content of the buffer is then sorted, and passed to one or more Reduce
component instances. Intermediate results are stored in the buffer until the final Reduce component
instance combines them all.
In some cases the process might include an additional component called a Combiner that runs on each
data node as part of the Map process, and performs a reduce type of operation on this part of the
data each time the map process runs. It may also run as part of the reduce phase, and again when large
datasets are being merged.
In the example shown here, a Combiner could sum the values for each product so that the output is
smaller, which can reduce network load and memory requirementswith a subsequent increase in
overall query efficiency. Often, as in this example, the Combiner and the Reduce components would be
identical.
Performing a map/reduce operation involves several stages such as partitioning the input data, reading
and writing data, and shuffling and sorting the intermediate results. Some of these operations are
quite complex. However, they are typically the same every timeirrespective of the actual data and
the query. The great thing with a map/reduce framework such as Hadoop is that you usually need to
create only the Map and Reduce components. The framework does the rest.
Although the core Hadoop engine requires the Map and Reduce components it executes to be written in
Java, you can use other techniques to create them in the background without writing Java code. For
example you can use tools named Hive and Pig that are included in most big data frameworks to write
queries in a SQL-like or a high-level language. You can also use the Hadoop streaming API to execute
components written in other languagessee Hadoop Streaming on the Apache website for more
details.

More information
The official site for Apache big data solutions and tools is the Apache Hadoop website.
For a detailed description of the MapReduce framework and programming model, see MapReduce.org.

What is Microsoft HDInsight? 27

What is Microsoft HDInsight?


Microsoft Azure HDInsight provides a pay-as-you-go solution for Hadoop-based big data batch
processing that is cost-effective because you do not need to commit to installing and configuring onpremises infrastructure. You can instantiate and configure a Hadoop cluster in HDInsight when required,
and remove it when it is not required. HDInsight uses a cluster of Azure virtual machines running the
Hortonworks Data Platform (HDP), and it integrates with Azure blob storage.
This guide is based on the version 3.0 (March 2014) release of HDInsight on Azure, but also includes
some of the preview features that are available in later versions. Earlier and later releases of HDInsight
may differ from the version described in this guide. For more information, see What's new in the
Hadoop cluster versions provided by HDInsight? To sign up for the Azure service, go to HDInsight
service home page.

Data storage
Big data solutions typically store data as a series of files located within a folder structure on disk.
However, in HDInsight these files are stored in Azure blob storage. HDInsight supports the standard
Hadoop file system commands and processes by using a fully HDFS-compliant layer over Azure blob
storage. As far as Hadoop is concerned, storage operates in exactly the same way as when using a
physical HDFS implementation. The advantages are that you can access storage using standard Azure
blob storage techniques as well as through the HDFS layer, and the data can be persisted when the
cluster is decommissioned.
HDInsight also offers the option to create a cluster that hosts the HBase open source data management
system. HBase is a NoSQL wide-column data store implemented as distributed system that provides data
processing and storage over multiple nodes in a Hadoop cluster. It provides a random, real-time,
read/write data store designed to host tables that can contain billions of rows and millions of columns.
For more information about how HDInsight uses blob storage, and the optional use of HBase, see
Data storage in the topic Specifying the infrastructure.

Data processing
HDInsight supports many of the Hadoop query, transformation, and analysis tools, and you can install
some additional tools and utilities on an HDInsight cluster if required. Examples of the tools and utilities
commonly used with Hadoop-based solutions such as HDInsight are:

Hive, which allows you to overlay a schema onto the data when you need to run a query, and
use a SQL-like language called HiveQL for these queries. For example, you can use the CREATE
TABLE command to build a table by splitting the text strings in the data using delimiters or at
specific character locations, and then execute SELECT statements to extract the required data.

Pig, which allows you to create schemas and execute queries by writing scripts in a high level
language called Pig Latin. Pig Latin is a procedural language that processes relations by

28 Understanding Microsoft big data solutions

performing multiple interrelated data transformations that are explicitly encoded as data flow
sequences.

Map/reduce using components written in Java, and executed directly by the Hadoop
framework. As an alternative you can use the Hadoop streaming interface to execute map and
reduce components written in other languages such as C# and F#.

Mahout is a machine learning library, which allows you to perform data mining queries that
examine data files to extract specific types of information. For example, it supports
recommendation mining (finding users preferences from their behavior), clustering (grouping
documents with similar topic content), and classification (assigning new documents to a
category based on existing categorization).

Storm is a distributed real-time computation system for processing fast, large streams of data. It
allows you to build trees and directed acyclic graphs (DAGs) that asynchronously process data
items using a user-defined number of parallel tasks. It can be used for real-time analytics, online
machine learning, continuous computation, distributed RPC, ETL, and more.

At the time of writing, Mahout and Storm were not supported on HDInsight. For more information
about the query and analysis tools in HDInsight see Processing, querying, and transforming data using
HDInsight.

Data access and workflow


The tools and utilities installed in HDInsight, and available in the HDInsight and Azure SDKs, can help you
to build a wide range of solutions. They include:

An ODBC driver that can be used to connect any ODBC-enabled consumer (such as a database,
or visualization tools such as Excel) with the data in Hive tables.

A Linq To Hive implementation that allows LINQ queries to be executed over the data in
HDInsight.

HCatalog, which is used in conjunction with queries, such as those that use Hive and Pig, to
abstract the physical paths to storage and make it easier to manage data and queries as a
solution evolves.

Sqoop, which can be used to import and export relational data to and from HDInsight.

Oozie, which provides a mechanism for automating workflows and operations. It supports
sequential and parallel workflow processes, and is extremely flexible.

More information about these and other tools and utilities is available in subsequent sections of this
guide, and in Appendix A - Tools and technologies reference.

What is Microsoft HDInsight? 29

Administration, automation, and monitoring


HDInsight contains a dashboard that provides rudimentary monitoring for clusters, a Hive editor where
you can test your Hive queries, and some administration capabilities. Its also possible to open a remote
desktop connection to a cluster. However, the majority of administration, management, deployment,
and query execution tasks are typically carried out by using the tools and utilities installed with
HDInsight, the Azure and HDInsight PowerShell cmdlets, the classes in the HDInsight SDKs, and custom
or third-party utilities.
The tools and utilities provided with, or available for download, allow you to carry out two distinct sets
of tasks:

Cluster management. This includes tasks such as creating and deleting clusters, and obtaining
runtime monitoring information.

Job execution. This includes uploading data and jobs, executing jobs, and downloading or
accessing the results.

Cluster management makes use of Apache Zookeeper (which is used internally to manage some aspects
of HDInsight) and the some features of the Ambari cluster monitoring framework.
The PowerShell cmdlets for Azure can be used to access blob storage to upload data to an HDInsight
cluster, as well as performing administrative tasks related to managing your subscription and services.
The PowerShell cmdlets for HDInsight allow full access to and management of almost all features of
HDInsight.
SDKs are available for use in creating applications that perform management and job submission for
HDInsight. The SDKs contain APIs that include classes for accessing storage, using HCatalog, automating
tasks with Oozie, and accessing monitoring information through Ambari. The .NET SDK also contains a
map/reduce implementation that uses the streaming interface to allow you to write queries in .NET
languages,
In addition, there is a cross-platform command-line interface available that allows you to access
HDInsight from different client platforms, and a management pack for Microsoft System Center.
For more information about administration tools and techniques for HDInsight see Building end-to-end
solutions using HDInsight and Appendix A - Tools and technologies reference.

More information
For an overview and description of HDInsight see Microsoft Big Data.
To sign up for the Azure HDInsight service, go to Azure HDInsight Service page.
For more information about using HDInsight, a good place to start is the TechNet library. You can see a
list of articles related to HDInsight by searching the library using this URL:
http://social.technet.microsoft.com/Search/en-US?query=hadoop.

30 Understanding Microsoft big data solutions

The TechNet library contains articles related to HDInsight. Search for these using the URL
http://social.technet.microsoft.com/Search/en-US?query=hadoop.
The official support forum for HDInsight is at http://social.msdn.microsoft.com/Forums/enUS/hdinsight/threads.

Planning a big data solution


Big data solutions such as Microsoft Azure HDInsight can help you discover vital information that may
otherwise have remained hidden in your dataor even been lost forever. This information can help you
to evaluate your organizations historical performance, discover new opportunities, identify operational
efficiencies, increase customer satisfaction, and even predict likely outcomes for the future. Its not
surprising that big data is generating so much interest and excitement in what some may see as the
rather boring world of data management.
This section of the guide focuses on the practicalities of planning your big data solutions. This means you
need to think about what you hope to achieve from them, even if your aim is just to explore the kinds of
information you might be able to extract from available data, and how your solution will fit into your
existing business infrastructure. It may be that you just want to use it alongside your existing business
intelligence (BI) systems, or you may want to deeply integrate it with these systems. The important
point is that, irrespective of how you choose to use it, the end result is the same: some kind of analysis
of the source data and meaningful visualization of the results.
Many organizations already use data to improve decision making through existing BI solutions that
analyze data generated by business activities and applications, and create reports based on this analysis.
Rather than seeking to replace traditional BI solutions, big data provides a way to extend the value of
your investment in BI by enabling you to incorporate a much wider variety of data sources that
complement and integrate with existing data warehouse, analytical data models, and business reporting
solutions.
The topics in this section provide an overview of the typical stages of planning, designing, implementing,
and using a Hadoop-based big data batch processing mechanism such as HDInsight. For each stage youll
find more details of the common concerns you must address, and pointers to help you make the
appropriate choices.

An overview of the big data process


Designing and implementing a big data batch processing solution typically involves a common collection
of stages, irrespective of the type of source data and the ultimate aims for obtaining information from
that data. You may not carry out every one of these stages, or execute them in a specific order, but you
should consider all of these aspects as you design and implement your solutions.

Planning a big data solution 31

In more detail, the stages are:

Decide if big data is the appropriate solution. There are some tasks and scenarios for which big
data batch-processing solutions based on Hadoop are ideally suited, while other scenarios may
be better accomplished using a more traditional data management mechanism such as a
relational database. For more details, see Is big data the right solution?

Determine the analytical goals and source data. Before you start any data analysis project, it is
useful to be clear about what you hope to achieve from it. You may have a specific question
that you need to answer in order to make a critical business decision; in which case you must
identify data that may help you determine the answer, where it can be obtained from, and if
there are any costs associated with procuring it. Alternatively, you may already have some data
that you want to explore to try to discern useful trends and patterns. Either way, understanding
your goals will help you design and implement a solution that best supports those goals. For
more details, see Determining analytical goals and Identifying source data.

Design the architecture. While every data analysis scenario is different, and your requirements
will vary, there are some basic use cases and models that are best suited to specific scenarios.
For example, your requirements may involve a data analysis process followed by data cleansing
and validation, perhaps as a workflow of tasks, before transferring the results to another
system. This may form the basis for a mechanism that, for example, changes the behavior of an
application based on user preferences and patterns of behavior collected as they use the
application. For more details of the core use cases and models, see Designing big data solutions
using HDInsight.

32 Understanding Microsoft big data solutions

Specify the infrastructure and cluster configuration. This involves choosing the appropriate big
data software, or subscribing to an online service such as HDInsight. You will also need to
determine the appropriate cluster size, storage requirements, consider if you will need to delete
and recreate the cluster as part of your management process, and ensure that your chosen
solution will meet SLAs and business operational requirements. For more details, see Specifying
the infrastructure.

Obtain the data and submit it to the cluster. During this stage you decide how you will collect
the data you have identified as the source, and how you will load it into your big data solution
for processing. Often you will store the data in its raw format to avoid losing any useful
contextual information it contains, though you may choose to do some pre-processing before
storing it to remove duplication or to simplify it in some other way. For more details, see
Collecting and loading data into HDInsight.

Process the data. After you have started to collect and store the data, the next stage is to
develop the processing solutions you will use to extract the information you need. You can
usually use Hive and Pig queries, or other processing tools, for even quite complex data
extraction. In a few rare circumstances you may need to create custom map/reduce
components to perform more complex queries against the data. For more details, see
Processing, querying, and transforming data using HDInsight.

Evaluate the results. Probably the most important step of all is to ensure that you are getting
the results you expected, and that these results make sense. Complex queries can be hard to
write, and difficult to get right the first time. Its easy to make assumptions or miss edge cases
that can skew the results quite considerably. Of course, it may be that you dont know what the
expected result actually is (after all, the whole point of big data is to discover hidden
information from the data) but you should make every effort to validate the results before
making business decisions based on them. In many cases, a business user who is familiar
enough with the business context can perform the role of a data steward and review the results
to verify that they are meaningful, accurate, and useful.

Tune the solution. At this stage, if the solution you have created is working correctly and the
results are valuable, you should decide whether you will repeat it in the future; perhaps with
new data you collect over time. If so, you should tune the solution by reviewing the log files it
creates, the processing techniques you use, and the implementation of the queries to ensure
that they are executing in the most efficient way. Its possible to fine tune big data solutions to
improve performance, reduce network load, and minimize the processing time by adjusting
some parameters of the query and the execution platform, or by compressing the data that is
transferred over the network.

Visualize and analyze the results. Once you are satisfied that the solution is working correctly
and efficiently, you can plan and implement the analysis and visualization approach you require.
This may be loading the data directly into an application such as Microsoft Excel, or exporting it

Planning a big data solution 33

into a database or enterprise BI system for further analysis, reporting, charting, and more. For
more details, see Consuming and visualizing data from HDInsight.

Automate and manage the solution. At this point it will be clear if the solution should become
part of your organizations business management infrastructure, complementing the other
sources of information that you use to plan and monitor business performance and strategy. If
this is the case, you should consider how you might automate and manage some or all of the
solution to provide predictable behavior, and perhaps so that it is executed on a schedule. For
more details, see Building end-to-end solutions using HDInsight.

Note that, in many ways, data analysis is an iterative process; and you should take this approach when
building a big data batch processing solution. In particular, given the large volumes of data and
correspondingly long processing times typically involved in big data analysis, it can be useful to start by
implementing a proof of concept iteration in which a small subset of the source data is used to validate
the processing steps and results before proceeding with a full analysis. This enables you to test your big
data processing design on a small cluster, or even on a single-node on-premises cluster, before scaling
out to accommodate production level data volumes.
Its easy to run queries that extract data, but its vitally important that you make every effort to validate
the results before using them as the basis for business decisions. If possible you should try to cross
reference the results with other sources of similar information.

Is big data the right solution?


The first step in evaluating and implementing any business policy, whether its related to computer
hardware, software, replacement office furniture, or the contract for cleaning the windows, is to
determine the results that you hope to achieve. Deciding whether to adopt a Hadoop-based big data
batch processing approach is no different.
The result you want from your solution will typically be better information that helps you to make datadriven decisions for your organization. However, to be able to get this information, you must evaluate
several factors such as:

Where will the source data come from? Perhaps you already have the data that contains the
information you need, but you cant analyze it with your existing tools. Or is there a source of
data you think will be useful, but you dont yet know how to collect it, store it, and analyze it?

What is the format of the data? Is it highly structured, in which case you may be able to load it
into your existing database or data warehouse and process it there? Or is it semi-structured or
unstructured, in which case a Hadoop-based mechanism such as HDInsight that is optimized for
textual discovery, categorization, and predictive analysis will be more suitable?

What are the delivery and quality characteristics of the data? Is there a huge volume? Does it
arrive as a stream or in batches? Is it of high quality, or will you need to perform some type of
data cleansing and validation of the content?

34 Understanding Microsoft big data solutions

Do you want to combine the results with data from other sources? If so, do you know where
this data will come from, how much it will cost if you have to purchase it, and how reliable this
data is?

Do you want to integrate with an existing BI system? Will you need to load the data into an
existing database or data warehouse, or will you just analyze it and visualize the results
separately?

The answers to these questions will help you decide whether a Hadoop-based big data solution such as
HDInsight is appropriate, but keep in mind that modern data management systems such as Microsoft
SQL Server and the Microsoft Analytics Platform System (APS) are designed to offer high performance
for huge volumes of datayour decision should not focus solely on data volume.
As you saw earlier in this guide, Hadoop-based solutions are primarily suited to situations where:

You have very large volumes of data to store and process, and these volumes are beyond the
capabilities of traditional relational database systems.

The data is in a semi-structured or unstructured format, often as text files or binary files.

The data is not well categorized; for example, similar items are described using different
terminology such as a variation in city, country, or region names, and there is no obvious key
value.

The data arrives rapidly as a stream, or in large batches that cannot be processed in real time,
and so must be stored efficiently for processing later as a batch operation.

The data contains a lot of redundancy or duplication.

The data cannot easily be processed into a format that suits existing database schemas without
risking loss of information.

You need to execute complex batch jobs on a very large scale, so that running the queries in
parallel is necessary.

You want to be able to easily scale the system up or down on demand, or have it running only
when required for specific processing tasks and close it down altogether at other times.

You dont actually know how the data might be useful, but you suspect that it will beeither
now or in the future.

In general you should consider adopting a Hadoop-based solution such as HDInsight only when your
requirements match several of the points listed above, and not just one or two. Existing database
systems can achieve many of the tasks in the list, but a batch processing solution based on Hadoop may
be a better choice when several of the factors are relevant to your own requirements.

Planning a big data solution 35

Determining analytical goals


Before embarking on a big data project it is generally useful to think about what you hope to achieve,
and clearly define the analytical goals of the project. In some projects there may be a specific question
that a business wants to answer, such as where should we open our new store? In other projects the
goal may be more open-ended; for example, to examine website traffic and try to detect patterns and
trends in visitor numbers. Understanding the goals of the analysis can help you make decisions about
the design and implementation of the solution, including the specific technologies to use and the level
of integration with existing BI infrastructure.

Historical and predictive data analysis


Most organizations already collect data from multiple sources. These might include line of business
applications such as websites, accounting systems, and office productivity applications. Other data may
come from interaction with customers, such as sales transactions, feedback, and reports from company
sales staff. The data is typically held in one or more data stores, and is often consolidated into a specially
designed data warehouse system.
Having collected this data, organizations typically use it to perform the following types of analysis and
reporting:

Historical analysis and reporting, which is concerned with summarizing data to make sense of
what happened in the past. For example, a business might summarize sales transactions by
fiscal quarter and sales region, and use the results to create a report for shareholders.
Additionally, business analysts within the organization might explore the aggregated data by
drilling down into individual months to determine periods of high and low sales revenue, or
drilling down into cities to find out if there are marked differences in sales volumes across
geographic locations. The results of this analysis can help to inform business decisions, such as
when to conduct sales promotions or where to open a new store.

Predictive analysis and reporting, which is concerned with detecting data patterns and trends
to determine whats likely to happen in the future. For example, a business might use statistics
from historical sales data and apply it to known customer profile information to predict which
customers are most likely to respond to a direct-mail campaign, or which products a particular
customer is likely to want to purchase. This analysis can help improve the cost-effectiveness of a
direct-mail campaign, or increase sales while building closer customer relationships through
relevant targeted recommendations.

Both kinds of analysis and reporting involve taking source data, applying an analytical model to that
data, and using the output to inform business decision making. In the case of historical analysis and
reporting, the model is usually designed to summarize and aggregate a large volume of data to
determine meaningful business measuresfor example, the total sales revenue aggregated by various
aspects of the business, such as fiscal period and sales region.

36 Understanding Microsoft big data solutions

For predictive analysis the model is usually based on a statistical algorithm that categorizes clusters of
similar data, or that correlates data attributes (which may influence one another) to the related cause
trendsfor example, classifying customers based on demographic attributes, or identifying a
relationship between customer age and the purchase of specific products.
Databases are the core of most organizations data processing, and in most cases the purpose is simply
to run the operation by, for example, storing and manipulating data to manage stock and create
invoices. However, analytics and reporting is one of the fastest growing sectors in business IT as
managers strive to learn more about their organization.

Analytical goals
Although every project has its own specific requirements, big data projects generally fall into one of the
following categories:

One-time analysis for a specific business decision. For example, a company planning to expand
by opening a new physical store might use big data techniques to analyze demographic data for
a shortlist of proposed store sites in order to determine the location that is likely to result in the
highest revenue for the store. Alternatively, a charity planning to build water supply
infrastructure in a drought-stricken area might use a combination of geographic, geological,
health, and demographic statistics to identify the best locations.

Open blue sky exploration of interesting data. Sometimes the goal of big data analysis is
simply to find out what you dont already know from the available data. For example, a business
might be aware that customers are using Twitter to discuss its products and services, and want
to explore the tweets to determine if any patterns or trends can be found that relate to brand
visibility or customer sentiment. There may be no specific business decision that needs to be
made based on the data, but gaining a better understanding of how customers perceive the
business might inform decision-making in the future.

Ongoing reporting and BI. In some cases a big data solution will be used to support ongoing
reporting and analytics, either in isolation or integrated with an existing enterprise BI solution.
For example, a real estate business that already has a BI solution, which enables analysis and
reporting of its own property transactions across time periods, property types, and locations,
might extend it to include demographic and population statistics data from external sources.

In many respects, data analysis is an iterative process. It is not uncommon for an initial project based on
open exploration of data to uncover trends or patterns that form the basis for a new project to support
a specific business decision, or to extend an existing BI solution.
The results of the analysis are typically consumed and visualized in the following ways:

Custom application interfaces. For example, a custom application might display the data as a
chart, or generate a set of product recommendations for a customer.

Planning a big data solution 37

Business performance dashboards. For example, you could use the PerformancePoint Services
component of SharePoint Server to display key performance indicators (KPIs) as scorecards, and
display summarized business metrics in a SharePoint Server site.

Reporting solutions such as SQL Server Reporting Services. For example, business reports can
be generated in a variety of formats and distributed automatically by email, or viewed on
demand through a web browser.

Analytical tools such as Excel. Information workers can explore analytical data models through
PivotTables and charts. Business analysts can use advanced Excel capabilities such as Power
Query, Power Pivot, Power View, and Power Map to create their own personal data models and
visualizations, or use add-ins to apply predictive models to data and view the results in Excel.

Identifying source data


In addition to determining the analytical goals of the project, you must identify sources of data that can
be used to meet these goals. Often you will already know which data sources need to be included in the
analysis. For example, if the goal is to analyze trends in sales for the past three years you can use historic
sales data from internal business applications or a data warehouse. However, in some cases you may
need to search for data to support the analysis you want to perform. For example, if the goal is to
determine the best location to open a new store you may need to search for useful demographic data
that covers the locations under consideration.
It is common in big data projects to combine data from multiple sources and create a mash up that
enables you to analyze many different aspects of the problem within a single solution. For example, you
might combine internal historic sales data with geographic data obtained from an external source to
plot sales volumes on a map. You may then overlay the map with demographic data to try to correlate
sales volume with particular geo-demographic attributes.
Common types of data source used in a big data solution include:

Internal business data from existing applications or BI solutions. Often this data is historic in
nature or includes demographic profile information that the business gathered from its
customers. For example, you might use historic sales records to correlate customer attributes
with purchasing patterns, and then use this information to support targeted advertising or
predictive modeling of future product plans.

Log files. Applications or infrastructure services often generate log data that can be useful for
analysis and decision making with regard to managing IT reliability and scalability. Additionally,
in some cases, combining log data with business data can reveal useful insights into how IT
services support the business. For example, you might use log files generated by Internet
Information Services (IIS) to assess network bandwidth utilization, or to correlate web site
traffic with sales transactions in an ecommerce application.

38 Understanding Microsoft big data solutions

Sensors. Increased automation in almost every aspect of life has led to a growth in the amount
of data recorded by electronic sensors (often referred to as the Internet of Things). For
example, RFID tags in smart cards are now routinely used to track passenger progress through
mass transit infrastructure, sensors in plant machinery generate huge quantities of data in
production lines, and smart metering provides detailed views of energy usage. This type of data
is often well suited to highly dynamic analysis and real-time reporting.

Social media. The massive popularity of social media services such as Facebook, Twitter, and
others is a major factor in the growth of data volumes on the Internet. Many social media
services provide application programming interfaces (APIs) that you can use to query the data
shared by users of these services, and consume this data for analysis. For example, a business
might use Twitters query API to find tweets that mention the name of the company or its
products, and analyze the data to determine how customers feel about the companys brand.

Data feeds. Many web sites and services provide data as a feed that can be consumed by client
applications and analytical solutions. Common feed formats include RSS, ATOM, and industry
defined XML formats; and the data sources themselves include blogs, news services, weather
forecasts, and financial markets data.

Governments and special interest groups. Many government organizations and special interest
groups publish data that can be used for analysis. For example, the UK government publishes
over 9000 downloadable datasets including statistics on population, crime, government
spending, health, and more, in a variety of formats. Similarly, the US government provides
census data and other statistics as downloadable datasets or in dBASE format on CD-ROM.
Additionally, many international organizations provide data free of charge. For example, the
United Nations makes statistical data available through its own website and in Azure
Marketplace.

Commercial data providers. There are many organizations that sell data commercially,
including geographical data, historical weather data, economic indicators, and others. Azure
Marketplace provides a central service through which you can locate and purchase
subscriptions to many of these data sources.

Just because data is available doesnt mean it is useful, or that the effort of using it will be viable. Think
about the value the analysis can add to your business before you devote inordinate time and effort to
collecting and analyzing data.
When planning data sources to use in your big data solution, consider the following factors:

Availability. How easy is it to find and obtain the data? You may have a specific analytical goal
in mind, but if the data required to support the analysis is difficult (or impossible) to find you
may waste valuable time trying to obtain it. When planning a big data project it can be useful to
define a schedule that allows sufficient time to research what data is available. If the data
cannot be found after an agreed deadline you may need to revise the analytical goals.

Planning a big data solution 39

Format. In what format is the data available, and how can it be consumed? Some data is
available in standard formats and can be downloaded over a network or Internet API. In other
cases the data may be available only as a real-time stream that you must capture and structure
for analysis. Later in the process you will consider tools and techniques for consuming the data
from its source and ingesting it into your cluster, but even during this early stage you should
identify the format and connectivity options for the data sources you want to use.

Relevance. Is the data relevant to the analytical goals? You may have identified a potential data
source and already be planning how you will consume it and ingest it into the analytical process.
However, you should first examine the data source carefully to ensure the data it contains is
relevant to the analysis you intend to perform.

Cost. You may determine the availability of a relevant dataset, only to discover that the cost of
obtaining the data outweighs the potential business benefit of using it. This can be particularly
true if the analytical goal is to augment an enterprise BI solution with external data on an
ongoing basis, and the external data is only available through a commercial data provider.

Specifying the infrastructure


Considerations for planning an infrastructure for Hadoop-based big data analysis include deciding
whether to use an on-premises or a cloud-based implementation, if a managed service is a better choice
than deploying the chosen framework yourself, and the type of data store and cluster configuration that
you require. This guide concentrates on HDInsight. However, many of the considerations for choosing an
infrastructure option are typically relevant to other solutions and platforms.
The topics discussed here are:

Service type and location

Data storage

Cluster size and storage configuration

SLAs and business requirements

Service type and location


Hadoop-based big data solutions are available as both cloud-hosted managed services and as selfinstalled solutions. If you decide to use a self-installed solution, you can choose whether to deploy this
locally in your own datacenter, or in a virtual machine in the cloud. For example, HDInsight is available
as an Azure service, but you can install the Hortonworks Hadoop distribution called the Hortonworks
Data Platform (HDP) as an on-premises or a cloud-hosted solution on Windows Server. Figure 1 shows
the key options for Hadoop-based big data solutions on the Microsoft platform.

40 Understanding Microsoft big data solutions

Figure 1 - Location options for a Hadoop-based big data solution deployment


You can install big data frameworks such as Hadoop on other operating systems, such as Linux, or you
can subscribe to Hadoop-based services that are hosted on these types of operating systems.
Consider the following guidelines when choosing between these options:

A managed service such as HDInsight running on Azure is a good choice over a self-installed
framework when:

You want a solution that is easy to initialize and configure, and where you do not need to
install any software yourself.

You want to get started quickly by avoiding the time it takes to set up the servers and
deploy the framework components to each one.

You want to be able to quickly and easily decommission a cluster and then initialize it again
without paying for the intermediate time when you dont need to use it.

A cloud-hosted mechanism such as Hortonworks Data Platform running on a virtual machine in


Azure is a good choice when:

The majority of the data is stored in or obtained through the cloud.

You require the solution to be running for only a specific period of time.

You require the solution to be available for ongoing analysis, but the workload will vary
sometimes requiring a cluster with many nodes and sometimes not requiring any service at
all.

You want to avoid the cost in terms of capital expenditure, skills development, and the time
it takes to provision, configure, and manage an on-premises solution.

An on-premises solution such as Hortonworks Data Platform on Windows Server is a good


choice when:

The majority of the data is stored or generated within your on-premises network.

You require ongoing services with a predictable and constant level of scalability.

Planning a big data solution 41

You have the necessary technical capability and budget to provision, configure, and manage
your own cluster.

The data you plan to analyze must remain on your own servers for compliance or
confidentiality reasons.

A pre-configured hardware appliance that supports big data connectivity, such as Microsoft
Analytics Platform System with PolyBase, is a good choice when:

You want a solution that provides predictable scalability, easy implementation, technical
support, and that can be deployed on-premises without requiring deep knowledge of big
data systems in order to set it up.

You want existing database administrators and developers to be able to seamlessly work
with big data without needing to learn new languages and techniques.

You want to be able to grow into affordable data storage space, and provide opportunities
for bursting by expanding into the cloud when required, while still maintaining corporate
services on a traditional relational system.

Choosing a big data platform that is hosted in the cloud allows you to change the number of servers in
the cluster (effectively scaling out or scaling in your solution) without incurring the cost of new
hardware or having existing hardware underused.

Data storage
When you create an HDInsight cluster, you have the option to create one of two types:

Hadoop cluster. This type of cluster combines an HDFS-compatible storage mechanism with the
Hadoop core engine and a range of additional tools and utilities. It is designed for performing
the usual Hadoop operations such as executing queries and transformations on data. This is the
type of cluster that you will see in use throughout this guide.

HBase cluster. This type of cluster, which was in preview at the time this guide was written,
contains a fully configured installation of the HBase database system. It is designed for use as
either a standalone cloud-hosted NoSQL database or, more typically, for use in conjunction with
a Hadoop cluster.

The primary data store used by HDInsight for both types of cluster is Azure blob storage, which provides
scalable, durable, and highly available storage (for more information see Introduction to Microsoft Azure
Storage). Using Azure blob storage means that both types of cluster can offer high scalability for storing
vast amounts of data, and high performance for reading and writing dataincluding the capture of
streaming data. For more details see Azure Storage Scalability and Performance Targets.

42 Understanding Microsoft big data solutions

When to use an HBase cluster?


HBase is an open-source wide column (or column family) data store. It uses a schema to define the data
structures, but the schema can contain families of columns rather than just single columns, making it
ideal for storing semi-structured data where some columns can be predefined but others are capable of
storing differing elements of unstructured data.
HBase is a good choice if your scenario demands:

Support for real-time querying.

Strictly consistent reads and writes.

Automatic and configurable sharding of tables.

High reliability with automatic failover.

Support for bulk loading data.

Support for SQL-like languages through interfaces such as Phoenix.

HBase provides close integration with Hadoop through base classes for connecting Hadoop map/reduce
jobs with data in HBase tables; an easy to use Java API for client access; adapters for popular
frameworks such as map/reduce, Hive, and Pig; access through a REST interface; and integration with
the Hadoop metrics subsystem.
HBase can be accessed directly by client programs and utilities to upload and access data. It can also be
accessed using storage drivers, or in discrete code, from within the queries and transformations you
execute on a Hadoop cluster. There is also a Thrift API available that provides a lightweight REST
interface for HBase.
HBase is resource-intensive and will attempt to use as much memory as is available on the cluster. You
should not use an HBase cluster for processing data and running queries, with the possible exception of
minor tasks where low latency is not a requirement. However, it is typically installed on a separate
cluster, and queried from the cluster containing your Hadoop-based big data solution.
For more information about HBase see the official Apache HBase project website and HBase Bigtablelike structured storage for Hadoop HDFS on the Hadoop wiki site.
Why Azure blob storage?
HDInsight is designed to transfer data very quickly between blob storage and the cluster, for both
Hadoop and HBase clusters. Azure datacenters provide extremely fast, high bandwidth connectivity
between storage and the virtual machines that make up an HDInsight cluster.
Using Azure blob storage provides several advantages:

Running costs are minimized because you can decommission a Hadoop cluster when not
performing queriesdata in Azure blob storage is persisted when the cluster is deleted and you
can build a new cluster on the existing source data in blob storage. You do not have to upload

Planning a big data solution 43

the data again over the Internet when you recreate a cluster that uses the same data. However,
although it is possible, deleting and recreating an HBase cluster is not typically a recommended
strategy.

Data storage costs can be minimized because Azure blob storage is considerably cheaper than
many other types of data store (1 TB of locally-redundant storage currently costs around $25
per month). Blob storage can be used to store large volumes of data (up to 500 TB at the time
this guide was written) without being concerned about scaling out storage in a cluster, or
changing the scaling in response to changes in storage requirements.

Data in Azure blob storage is replicated across three locations in the datacenter, so it provides a
similar level of redundancy to protect against data loss as an HDFS cluster. Storage can be
locally-redundant (replicas are in the same datacenter), globally-redundant (replicated locally
and in a different region), or read-only globally-redundant. See Introduction to Microsoft Azure
Storage for more details.

Data stored in Azure blob storage can be accessed by and shared with other applications and
services, whereas data stored in HDFS can only be accessed by HDFS-aware applications that
have access to the cluster storage. Azure storage offers import/export features that are useful
for quickly and easily transferring data in and out of Azure blob storage.

The high speed flat network in the datacenter provides fast access between the virtual machines
in the cluster and blob storage, so data movement is very efficient. Tests carried out by the
Azure team indicate that blob storage provides near identical performance to HDFS when
reading data, and equal or better write performance.

Azure blob storage may throttle data transfers if the workload reaches the bandwidth limits of the
storage service or exceeds the scalability targets. One solution is to use additional storage accounts. For
more information, see the blog post Maximizing HDInsight throughput to Azure Blob Storage on MSDN.
For more information about the use of blob storage instead of HDFS for data storage see Use Azure Blob
storage with HDInsight.
Combining Hadoop and HBase clusters
For most of your solutions you will use an HDInsight Hadoop-based cluster. However, there are
circumstances where you might combine both a Hadoop and an HBase cluster in the same solution, or
use an HBase cluster on its own. Some of the common configurations are:

Use just a Hadoop cluster. Source data can be loaded directly into Azure blob storage or stored
using the HDFS-compatible storage drivers in Hadoop. Data processing, such as queries and
transformations, execute on this cluster and access the data in Azure blob storage using the
HDFS-compatible storage drivers.

Use a combination of a Hadoop and an HBase cluster (or more than one HBase cluster if
required). Data is stored using HBase, and optionally through the HDFS driver in the Hadoop
cluster as well. Source data, especially high volumes of streaming data such as that from sensors

44 Understanding Microsoft big data solutions

or devices, can be loaded directly into HBase. Data processing takes place on the Hadoop
cluster, but the processes can access the data stored in the HBase cluster and store results
there.

Use just an HBase cluster. This is typically the choice if you require only a high capacity, high
performance storage and retrieval mechanism that will be accessed directly from client
applications, and you do not require Hadoop-based processing to take place.

Figure 1 shows these options in schematic form.

Figure 2 - Data storage options for an Azure HDInsight solution


In Figure 1, the numbered data flows are as follows:
1. Client applications can perform big data processing on the Hadoop cluster. This includes
typical operations such as executing Pig, Hive, and map/reduce processing and storing the
data in blob storage.
2. Client applications can access the Azure blob storage where the Hadoop cluster stores its
data to upload source data and to download the results. This is possible both when the
Hadoop cluster is running and when it is not running or has been deleted. The Hadoop
cluster can be recreated over the existing blob storage.
3. The processes running on the Hadoop cluster can access the HBase cluster using the storage
handlers in Hive and Pig queries, the Java API, interfaces such as Thrift and Phoenix, or the
REST interface to store and retrieve data in HBase.
4. Clients can access the HBase cluster directly using the REST interface and APIs to upload
data, perform queries, and download the results. This is possible both when the Hadoop

Planning a big data solution 45

cluster is running and when it is not running or has been deleted. However the HBase cluster
must be running.

Cluster size and storage configuration


Having specified the type of service you will use and the location, the final tasks in terms of planning
your solution center on the configuration you require for the cluster. Some of the decisions you must
make are described in the following sections.
Cluster size
The amount you are billed for the cluster when using HDInsight (or other cloud-hosted Hadoop services)
is based on the cluster size. For an on-premises solution, the cluster size has an impact on internal costs
for infrastructure and maintenance. Choosing an appropriate size helps to minimize runtime costs.
When choosing a cluster size, consider the following points:

Hadoop automatically partitions the data and allocates the jobs to the data nodes in the cluster.
Some queries may not take advantage of all the nodes in the cluster. This may be case with
smaller volumes of data, or where the data format prevents partitioning (as is the case for some
types of compressed data).

Operations such as Hive queries that must sort the results may limit the number of nodes that
Hadoop uses for the reduce phase, meaning that adding more nodes will not reduce query
execution time.

If the volume of data you will process is increasing, ensure that the cluster size you choose can
cope with this. Alternatively, plan to increase the cluster size at specific intervals to manage the
growth. Typically you will need to delete and recreate the cluster to change the number of
nodes, but you can do this for a Hadoop cluster without the need to upload the data again
because it is held in Azure blob storage.

Use the performance data exposed by the cluster to determine if increasing the size is likely to
improve query execution speed. Use historical data on performance for similar types of jobs to
estimate the required cluster size for new jobs. For more information about monitoring jobs,
see Building end-to-end solutions using HDInsight.

Storage requirements
By default HDInsight creates a new storage container in Azure blob storage when you create a new
cluster. However, its possible to use a combination of different storage accounts with an HDInsight
cluster. You might want to use more than one storage account in the following circumstances:

When the amount of data is likely to exceed the storage capacity of a single blob storage
container.

When the rate of access to the blob container might exceed the threshold where throttling will
occur.

46 Understanding Microsoft big data solutions

When you want to make data you have already uploaded to a blob container available to the
cluster.

When you want to isolate different parts of the storage for reasons of security, or to simplify
administration.

For details of the different approaches for using storage accounts with HDInsight see Cluster and storage
initialization in the section Collecting and loading data into HDInsight of this guide. For details of storage
capacity and bandwidth limits see Azure Storage Scalability and Performance Targets.
Maintaining cluster data
There may be cases where you want to be able to decommission and delete a cluster, and then recreate
it later with exactly the same configuration and data. HDInsight stores the cluster data in blob storage,
and you can create a new cluster over existing blob containers so the data they contain is available to
the cluster. Typically you will delete and recreate a cluster in the following circumstances:

You want to minimize runtime costs by deploying a cluster only when it is required. This may be
because the jobs it executes run only at specific scheduled times, or run on demand when
specific requirements arise.

You want to change the size of the cluster, but retain the data and metadata so that it is
available in the new cluster.

This applies only with a Hadoop-based HDInsight cluster. See Using Hadoop and HBase clusters for
information about using an HBase cluster. For details of how you can maintain the data when recreating
a cluster in HDInsight see Cluster and storage initialization in the section Collecting and loading data into
HDInsight of this guide.

SLAs and business requirements


Your big data process may be an exercise of experimentation and investigation, or it may be a part of a
business process. In the latter case, especially where you offer the solution to your customers as a
service, you must ensure that you meet Service Level Agreements (SLAs) and business requirements.
Keep in mind the following considerations:

Investigate the SLAs offered by your big data solution provider because these will ultimately
limit the level of availability and reliability you can offer.

Consider if a cluster should be used for a single process, for one or a subset of customers, or for
a specific limited workload in order to maintain performance and availability. Sharing a cluster
across multiple different workloads can make it more difficult to predict and control demand,
and may affect your ability to meet SLAs.

Consider how you will manage backing up the data and the cluster information to protect
against loss in the event of a failure.

Designing big data solutions using HDInsight 47

Choose an operating location, cluster size, and other aspects of the cluster so that sufficient
infrastructure and network resources are available to meet requirements.

Implement robust management and monitoring strategies to ensure you maintain the required
SLAs and meet business requirements.

Designing big data solutions using HDInsight


This section of the guide explores the common use cases for batch processing in Hadoop-based big data
solutions. These include iterative exploration, data warehouse on demand, ETL automation, and BI
integration. The guide focuses primarily on Microsoft Azure HDInsight, but the models described here
can be easily adapted to many other big data frameworks.

Hadoop-based big data solutions open up new opportunities for converting data into information. They
can also be used to extend existing information systems to provide additional insights through analytics
and data visualization. Every organization is different, and so there is no definitive list of the ways you
can use these types of solution as part of your own business processes.
However, there are four general use cases and corresponding models, described below, that are
appropriate for the typical batch processing workloads on an HDInsight cluster. Understanding these use
cases will help you to start making decisions on how best to integrate HDInsight with your organization,
and with your existing BI systems and tools.
By incorporating additional applications that run under the YARN resource manager, HDInsight can be
used to perform real-time processing of streaming data. However, this topic is outside the scope of the
guide.

48 Designing big data solutions using HDInsight

Use case 1: Iterative exploration

Figure 1 - The iterative exploration model


This model is typically chosen for experimenting with data sources to discover if they can provide useful
information, and for handling data that you cannot process using existing systems. For example, you
might collect feedback from customers through email, web pages, or external sources such as social
media sites, then analyze it to get a picture of user sentiment for your products. You might be able to
combine this information with other data, such as demographic data that indicates population density
and characteristics in each city where your products are sold. For more details, see the Use case 1:
Iterative exploration use case and batch processing model. For an example of using this model see
Scenario 1: Iterative exploration.

Designing big data solutions using HDInsight 49

Use case 2: Data warehouse on demand

Figure 2 - The data warehouse on demand model


Hadoop-based big data systems such as HDInsight allow you to store both the source data and the
results of queries executed over this data. You can also store schemas (or, to be precise, metadata) for
tables that are populated by the queries you execute. These tables can be indexed, although there is no
formal mechanism for managing key-based relationships between them. However, you can create data
repositories that are robust and reasonably low cost to maintain, which is especially useful if you need
to store and manage huge volumes of data. For more details, see the Use case 2: Data warehouse on
demand use case and batch processing model. For an example of using this model see Scenario 2: Data
warehouse on demand.

50 Designing big data solutions using HDInsight

Use case 3: ETL automation

Figure 3 - The ETL automation model


Hadoop-based big data systems such as HDInsight can be used to extract and transform data before you
load it into your existing databases or data visualization tools. Such solutions are well suited to
performing categorization and normalization of data, and for extracting summary results to remove
duplication and redundancy. This is typically referred to as an Extract, Transform, and Load (ETL)
process. For more details, see the Use case 3: ETL automation use case and batch processing model. For
an example of using this model see Scenario 3: ETL automation.

Designing big data solutions using HDInsight 51

Use case 4: BI integration

Figure 4 - The BI integration model


Enterprise-level data warehouses have some special characteristics that differentiate them from on-line
transaction processing (OLTP) database systems, and so there are additional considerations for
integrating with batch processing big data systems such as HDInsight. For example, you can integrate at
different levels, depending on the way that you intend to use the data obtained from your big data
solution. For more details, see the Use case 4: BI integration use case and batch processing model. For
an example of using this model see Scenario 4: BI integration.

52 Designing big data solutions using HDInsight

Use case 1: Iterative exploration


Traditional data storage and management systems such as data warehouses, data models, and reporting
and analytical tools provide a wealth of information on which to base business decisions. However,
while traditional BI works well for business data that can easily be structured, managed, and processed
in a dimensional analysis model, some kinds of analysis require a more flexible solution that can derive
meaning from less obvious sources of data such as log files, email messages, tweets, and more.
Theres a great deal of useful information to be found in these less structured data sources, which often
contain huge volumes of data that must be processed to reveal key data points. This kind of data
processing is what big data solutions such as HDInsight were designed to handle. It provides a way to
process extremely large volumes of unstructured or semi-structured data, often by performing complex
computation and transformation batch processing of the data, to produce an output that can be
visualized directly or combined with other datasets.
If you do not intend to reuse the information from the analysis, but just want to explore the data, you
may choose to consume it directly in an analysis or visualization tool such as Microsoft Excel.
The following sections of this topic provide more information:

Use case and model overview

When to choose this model

Data sources

Output targets

Considerations

Use case and model overview


Figure 1 shows an overview of the use case and model for a standalone iterative data exploration and
visualization solution using HDInsight. The source data files are loaded into the cluster, processed by one
or more queries within HDInsight, and the output is consumed by the chosen reporting and visualization
tools. The cycle repeats using the same data until useful insights have been found, or it becomes clear
that there is no useful information available from the datain which case you might choose to restart
the process with a different source dataset.

Use case 1: Iterative exploration 53

Figure 1 - High-level view of the iterative exploration model


Text files and compressed binary files can be loaded directly into the cluster storage, while stream data
will usually need to be collected and handled by a suitable stream capture mechanism (see Collecting
and loading data into HDInsight for more information). The output data may be combined with other
datasets within your visualization and reporting tools to augment the information and to provide
comparisons, as you will see later in this topic.
When using HDInsight for iterative data exploration, you will often do so as an interactive process. For
example, you might use the Power Query add-in for Excel to submit a query to an HDInsight cluster and
wait for the results to be returned, usually within a few seconds or even a few minutes. You can then
modify and experiment with the query to optimize the information it returns.
However, keep in mind that these are batch operations that are submitted to all of the servers in the
cluster for parallel processing, and queries can often take minutes or hours to complete when there are
very large volumes of source data. For example, you might use Pig to process an input file, with the
results returned in an output file some time laterat which point you can perform the analysis by
importing this file into your chosen visualization tool.
An example of applying this use case and model can be found in Scenario 1: Iterative exploration.

When to choose this model


The iterative exploration model is typically suited to the following scenarios:

54 Designing big data solutions using HDInsight

Handling data that you cannot process using existing systems, perhaps by performing complex
calculations and transformations that are beyond the capabilities of existing systems to
complete in a reasonable time.

Collecting feedback from customers through email, web pages, or external sources such as
social media sites, then analyzing it to get a picture of customer sentiment for your products.

Combining information with other data, such as demographic data that indicates population
density and characteristics in each city where your products are sold.

Dumping data from your existing information systems into HDInsight so that you can work with
it without interrupting other business processes or risking corruption of the original data.

Trying out new ideas and validating processes before implementing them within the live
system.

Combining your data with datasets available from Azure Marketplace or other commercial data sources
can reveal useful information that might otherwise remain hidden in your data.

Data sources
The input data for this model typically includes the following:

Social data, log files, sensors, and applications that generate data files.

Datasets obtained from Azure Marketplace and other commercial data providers.

Internal data extracted from databases or data warehouses for experimentation and one-off
analysis.

Streaming data that is captured, filtered, and pre-processed through a suitable tool or
framework (see Collecting and loading data into HDInsight).

Notice that, as well as externally obtained data, you might process data from within your organizations
existing database or data warehouse. HDInsight is an ideal solution when you want to perform offline
exploration of existing data in a sandbox. For example, you may join several datasets from your data
warehouse to create large datasets that act as the source for some experimental investigation, or to test
new analysis techniques. This avoids the risk of interrupting existing systems, affecting performance of
your data warehouse system, or accidently corrupting the core data.
The capability to store schema-less data, and apply a schema only when processing the data, may also
simplify the task of combining information from different systems because you do not need to apply a
schema beforehand, as you would in a traditional data warehouse.
Often you need to perform more than one query on the data to get the results into the form you need.
Its not unusual to base queries on the results of a preceding query; for example, using one query to
select and transform the required data and remove redundancy, a second query to summarize the data

Use case 1: Iterative exploration 55

returned from the first query, and a third query to format the output as required. This iterative
approach enables you to start with a large volume of complex and difficult to analyze data, and get it
into a structure that you can consume directly from an analytical tool such as Excel, or use as input to a
managed BI solution.

Output targets
The results from your exploration processes can be visualized using any of the wide range of tools that
are available for analyzing data, combining it with other datasets, and generating reports. Typical
examples for the iterative exploration model are:

Interactive analytical tools such as Excel, Power Query, Power Pivot, Power View, and Power
Map.

SQL Server Reporting Services using Report Builder.

Custom or third party analysis and visualization tools.

You will see more details of these tools in Consuming and visualizing data from HDInsight.

Considerations
There are some important points to consider when choosing the iterative exploration model:

This model is typically used when you want to:

Experiment with new types or sources of data.

Generate one-off reports or visualizations of external or internal data.

Monitor a data source using visualizations to detect changes or to predict behavior.

Combine the output with other data to generate comparisons or to augment the
information.

You will usually choose this model when you do not want to persist the results of the query
after analysis, or after the required reports have been generated. It is typically used for one-off
analysis tasks where the results are discarded after use; and so differs from the other models
described in this guide in which the results are stored and reused.

Very large datasets are likely to preclude the use of an interactive approach due to the time
taken for the queries to run. However, after the queries are complete you can connect to the
cluster and work interactively with the data to perform different types of analysis or
visualization.

Data arriving as a stream, such as the output from sensors on an automated production line or
the data generated by GPS sensors in mobile devices, requires additional considerations. A
typical technique is to capture the data using a stream processing technology such as Storm or

56 Designing big data solutions using HDInsight

StreamInsight and persist it, then process it in batches or at regular intervals. The stream
capture technology may perform some pre-processing, and might also power a real time
visualization or rudimentary analysis tool, as well as feeding it into an HDInsight cluster. A
common technique is micro-batch processing, where incoming data is persisted in small
increments, allowing near real-time processing by the big data solution.

You are not limited to running a single query on the source data. You can follow an iterative
pattern in which the data is passed through the cluster multiple times, each pass refining the
data until it is suitably prepared for use in your analytical tool. For example, a large
unstructured file might be processed using a Pig script to generate a smaller, more structured
output file. This output could then be used as the input for a Hive query that returns aggregated
data in tabular form.

Use case 2: Data warehouse on demand


Hadoop-based big data solutions such as HDInsight can provide a robust, high performance, and costeffective data storage and parallel job processing mechanism. Data is replicated in the storage system,
and jobs are distributed across the nodes for fast parallel processing. In the case of HDInsight, the data is
saved in Azure blob storage, which is also replicated three times.
This combination of capabilities means that you can use HDInsight as a basic data warehouse. The low
cost of storage when compared to most relational database mechanisms that have the same level of
reliability also means that you can use it simply as a commodity storage mechanism for huge volumes of
data, even if you decide not to transform the data into Hive tables.
The following sections of this topic provide more information:

Use case and model overview

When to choose this model

Data sources

Output targets

Considerations

If you need to store vast amounts of data, irrespective of the format of that data, an on-premises
Hadoop-based solution can reduce administration overhead and save money by minimizing the need for
the high performance database servers and storage clusters used by traditional relational database
systems. Alternatively, you may choose to use a cloud hosted Hadoop-based solution such as HDInsight
in order to reduce the administration overhead and running costs compared to on-premises
deployment.

Use case 2: Data warehouse on demand 57

Use case and model overview


Figure 1 shows an overview of the use case and model for a data warehouse on demand solution using
HDInsight. The source data files may be obtained from external sources, but are just as likely to be
internal data generated by your business processes and applications. For example, you might decide to
use this model instead of using a locally installed data warehouse based on the traditional relational
database model.
In this scenario, you can store the data both as the raw source data and as Hive tables. Hive provides
access to the data in a familiar row and column format that is easy to consume and visualize in most BI
tools. The data for the tables is stored in Azure blob storage, and the table definitions can be maintained
by HCatalog (a feature of Hive). For more information about Hive and HCatalog, see Data processing
tools and techniques.

Figure 1 - High-level view of the data warehouse on demand model


In this model, when you need to process the stored data, you create an HDInsight Hadoop-based cluster
that uses the Azure blob storage container holding that data. When you finish processing the data you
can tear down the cluster without losing the original archived data (see Cluster and storage initialization
for information about preserving or recreating metadata such as Hive table definitions when you tear
down and then recreate an HDInsight cluster).
You might also consider storing partly processed data where you have performed some translation or
summary of the data, but it is still in a relatively raw form that you want to keep in case it is useful in the
future. For example, you might use a stream capture tool to allocate incoming positional data from a
fleet of vehicles into separate categories or areas, add some reference keys to each item, and then store
the results ready for processing at a later date. Stream data may arrive in rapid bursts, and typically
generates very large files, so using HDInsight to capture and store the data helps to minimize the load on
your existing data management systems.

58 Designing big data solutions using HDInsight

This model is also suitable for use as a data store where you do not need to implement the typical data
warehouse capabilities. For example, you may just want to minimize storage cost when saving large
tabular format data files for use in the future, large text files such as email archives or data that you
must keep for legal or regulatory reasons but you do not need to process, or for storing large quantities
of binary data such as images or documents. In this case you simply load the data into the storage
associated with cluster, without creating Hive tables for it.
You might, as an alternative, choose to use just an HBase cluster in this model. HBase can be accessed
directly from client applications through the Java APIs and the REST interface. You can load data directly
into HBase and query it using the built-in mechanisms. For information about HBase see Data storage
in the topic Specifying the infrastructure.
An example of applying this use case and model can be found in Scenario 2: Data warehouse on
demand.

When to choose this model


The data warehouse on demand model is typically suited to the following scenarios:

Storing data in a way that allows you to minimize storage cost by taking advantage of cloudbased storage systems, and minimizing runtime cost by initiating a cluster to perform
processing only when required.

Exposing both the source data in raw form, and the results of queries executed over this data in
the familiar row and column format, to a wide range of data analysis tools. The processed
results can use a range of data types that includes both primitive types (including timestamps)
and complex types such as arrays, maps, and structures.

Storing schemas (or, to be precise, metadata) for tables that are populated by the queries you
execute, and partitioning the data in tables based on a clustered index so that each has a
separate metadata definition and can be handled separately.

Creating views based on tables, and creating functions for use in both tables and queries.

Creating a robust data repository for very large quantities of data that is relatively low cost to
maintain compared to traditional relational database systems and appliances, where you do not
need the additional capabilities of these types of systems.

Consuming the results directly in business applications through interactive analytical tools such
as Excel, or in corporate reporting platforms such as SQL Server Reporting Services.

Data sources
Data sources for this model are typically data collected from internal and external business processes.
However, it may also include reference data and datasets obtained from other sources that can be

Use case 2: Data warehouse on demand 59

matched on a key to existing data in your data store so that it can be used to augment the results of
analysis and reporting processes. Some examples are:

Data generated by internal business processes, websites, and applications.

Reference data and data definitions used by business processes.

Datasets obtained from Azure Marketplace and other commercial data providers.

If you adopt this model simply as a commodity data store rather than a data warehouse, you might also
load data from other sources such as social media data, log files, and sensors; or streaming data that is
captured, filtered, and processed through a suitable tool or framework (see Collecting and loading data
into HDInsight).

Output targets
The main intention of this model is to provide the equivalent to a data warehouse system based on the
traditional relational database model, and expose it as Hive tables. You can use these tables in a variety
of ways, such as:

Combining the datasets for analysis, and using the result to generate reports and business
information.

Generating ancillary information such as related items or recommendation lists for use in
applications and websites.

Providing external access to the results through web applications, web services, and other
services.

Powering information systems such as SharePoint server through web parts and the Business
Data Connector (BDC).

If you adopt this model simply as a commodity data store rather than a data warehouse, you might use
the data you store as an input for any of the models described in this section of the guide.
The data in an HDInsight data warehouse can be analyzed and visualized directly using any tools that can
consume Hive tables. Typical examples are:

SQL Server Reporting Services

SQL Server Analysis Services

Interactive analytical tools such as Excel, Power Query, Power Pivot, Power View, and Power
Map

Custom or third party analysis and visualization tools

60 Designing big data solutions using HDInsight

You can find more details of these tools in the topic Consuming and visualizing data from HDInsight. For
a discussion of using SQL Server Analysis Services see Corporate Data Model Level Integration in the
topic Use case 4: BI integration. You can also download a case study that describes using SQL Server
Analysis Services with Hive.

Considerations
There are some important points to consider when choosing the data warehouse on demand model:

This model is typically used when you want to:

Create a central point for analysis and reporting by multiple users and tools.

Store multiple datasets for use by internal applications and tools.

Host your data in the cloud to benefit from reliability and elasticity, to minimize cost, and to
reduce administration overhead.

Store both externally collected data and data generated by internal tools and processes.

Refresh the data at scheduled intervals or on demand.

You can use Hive to:

Define tables that have the familiar row and column format, with a range of data types for
the columns that includes both primitive types (including timestamps) and complex types
such as arrays, maps, and structures.

Load data from storage into tables, save data to storage from tables, and populate tables
from the results of running a query.

Create indexes for tables, and partition tables based on a clustered index so that each has a
separate metadata definition and can be handled separately.

Rename, alter and drop tables, and modify columns in a table as required.

Create views based on tables, and create functions for use in both tables and queries.
The main limitation of Hive tables is that you cannot create constraints such as foreign key
relationships that are automatically managed. For more details of how to work with Hive
tables, see Hive Data Definition Language on the Apache Hive website.

You can store the Hive queries and views within HDInsight so that they can be used to extract
data on demand in much the same way as the stored procedures in a relational database.
However, to minimize response times you will probably need to pre-process the data where
possible using queries within your solution, and store these intermediate results in order to
reduce the time-consuming overhead of complex queries. Incoming data may be processed by
any type of query, not just Hive, to cleanse and validate the data before converting it to table
format.

Use case 3: ETL automation 61

You can use the Hive ODBC connector in SQL Server with HDInsight to create linked servers. This
allows you to write Transact-SQL queries that join tables in a SQL Server database to tables
stored in an HDInsight data warehouse.

If you want to be able to delete and restore the cluster, as is typically the case for this model,
there are additional considerations when creating a cluster. See Cluster and storage
initialization for more information.

Use case 3: ETL automation


In a traditional business environment, the data to power your reporting mechanism will usually come
from tables in a database. However, its increasingly necessary to supplement this with data obtained
from outside your organization. This may be commercially available datasets, such as those available
from Azure Marketplace and elsewhere, or it may be data from less structured sources such as social
media, emails, log files, and more.
You will, in most cases, need to cleanse, validate, and transform this data before loading it into an
existing database. Extract, Transform, and Load (ETL) operations can use Hadoop-based systems such as
HDInsight to perform pattern matching, data categorization, de-duplication, and summary operations on
unstructured or semi-structured data to generate data in the familiar rows and columns format that can
be imported into a database table or a data warehouse. ETL is also the primary way to ensure that the
data is valid and contains correct values, while data cleansing is the gatekeeper that protects tables from
invalid, duplicated, or incorrect values.
The following sections of this topic provide more information:

Use case and model overview

When to choose this model

Data sources

Output targets

Considerations

There is often some confusion between the terms ETL and ELT. ETL, as used here, is generally the more
well-known, and describes performing a transformation on incoming data before loading it into a data
warehouse. ELT is the process of loading it into the data warehouse in raw form and then transforming it
afterwards. Because the Azure blob storage used by HDInsight can store schema-less data, storing the
raw data is not an issue (it might be when the target is a relational data store). The data is then
extracted from blob storage, transformed, and the results are loaded back into blob storage. See ETL or
ELT or both? on the Microsoft OLAP blog for a more complete discussion of this topic.

62 Designing big data solutions using HDInsight

Use case and model overview


Figure 1 shows an overview of the use case and model for ETL automation. Input data is transformed to
generate the appropriate output format and data content, and then imported into the target data store,
application, or reporting solution. Analysis and reporting can then be done against the data store, often
by combining the imported data with existing data in the data store. Applications such as reporting tools
and services can then consume this data in an appropriate format, and use it for a variety of purposes.

Figure 1 - High-level view of the ETL automation model


The transformation process may involve just a single query, but it is more likely to require a multi-step
process. For example, it might use custom map/reduce components or (more likely) Pig scripts, followed
by a Hive query in the final stage to generate a tabular format. However, the final format may be
something other than a table. For example, it may be a tab delimited file or some other format suitable
for import into the target application.
An example of using applying this use case and model can be found in Scenario 3: ETL automation.

When to choose this model


The ETL automation model is typically suited to the following scenarios:

Extracting and transforming data before you load it into your existing databases or analytical
tools.

Performing categorization and restructuring of data, and for extracting summary results to
remove duplication and redundancy.

Use case 3: ETL automation 63

Preparing data so that it is in the appropriate format and has appropriate content to power
other applications or services.

Data sources
Data sources for this model are typically external data that can be matched on a key to existing data in
your data store so that it can be used to augment the results of analysis and reporting processes. Some
examples are:

Social media data, log files, sensors, and applications that generate data files.

Datasets obtained from Azure Marketplace and other commercial data providers.

Streaming data captured, filtered, and processed through a suitable tool or framework (see
Collecting and loading data into HDInsight).

Output targets
This model is designed to generate output that is in the appropriate format for the target data store.
Common types of data store are:

A database such as SQL Server or Azure SQL Database.

A document or file sharing mechanism such as SharePoint server or other information


management systems.

A local or remote data repository in a custom format, such as JSON objects.

Cloud data stores such as Azure table and blob storage.

Applications or services that require data to be processed into specific formats, or as files that
contain specific types of information structure.

You may decide to use this model even when you dont actually want to keep the results of the big data
query. You can load it into your database, generate the reports and analyses you require, and then
delete the data from the database. You may need to do this every time if the source data changes
between each reporting cycle in a way that means just adding new data is not appropriate.

Considerations
There are some important points to consider when choosing the ETL automation model:

This model is typically used when you want to:

Load stream data or large volumes of semi-structured or unstructured data from external
sources into an existing database or information system.

64 Designing big data solutions using HDInsight

Cleanse, transform, and validate the data before loading it; perhaps by using more than one
transformation pass through the cluster.

Generate reports and visualizations that are regularly updated.

Power other applications that require specific types of data, such as using an analysis of
previous behavioral information to apply personalization to an application or service.

When the output is in tabular format, such as that generated by Hive, the data import process
can use the Hive ODBC driver or Linq To Hive. Alternatively, you can use Sqoop (which is
included in in the Hadoop distribution installed by HDInsight) to connect a relational database
such as SQL Server or Azure SQL Database to your HDInsight data store and export the results of
a query into your database. If you are using Microsoft Analytical Platform System (APS) you can
access the data in HDInsight using PolyBase, which acts as a bridge between APS and HDInsight
so that it becomes just another data source available for use in queries and processes in APS.
Some other connectors for accessing Hive data are available from Couchbase, Jaspersoft, and
Tableau Software.

If the target for the data is not a database, you can generate a file in the appropriate format
within the query. This might be tab delimited format, fixed width columns, some other format
for loading into Excel or a third-party application, or even for loading into Azure storage through
a custom data access layer that you create. Azure table storage can be used to store table
formatted data using a key to identify each row. Azure blob storage is more suitable for storing
compressed or binary data generated from the HDInsight query if you want to store it for reuse.

If the intention is to regularly update the target table or data store as the source data changes
you will probably choose to use an automated mechanism to execute the query and data
import processes. However, if it is a one-off operation you may decide to execute it interactively
only when required.

If you need to execute several operations on the data as part of the ETL process you should
consider how you manage these. If they are controlled by an external program, rather than as a
workflow within the solution, you will need to decide whether some can be executed in parallel,
and you must be able to detect when each job has completed. Using a workflow mechanism
such as Oozie within Hadoop may be easier than trying to orchestrate several operations using
external scripts or custom programs. See Workflow and job orchestration for more information
about Oozie.

Use case 4: BI integration 65

Use case 4: BI integration


Microsoft offers a set of applications and services to support end-to-end solutions for working with data
(described in Understanding Microsoft big data solutions). The Microsoft data platform includes all of
the elements that make up an enterprise level business intelligence (BI) system for organizations of any
size. Organizations that already use an enterprise BI solution for business analytics and reporting can
extend their analytical capabilities by using a big data solution such as HDInsight to add new sources of
data to their decision making processes.
The following sections of this topic provide more information:

Use case and model overview

When to choose this model

Data sources

Output targets

Considerations

Summary of integration level scenarios and considerations

The information in this section will help you to understand how you can integrate HDInsight with an
enterprise BI system. However, a complete discussion of enterprise BI systems is beyond the scope of
this guide.

Use case and model overview


Enterprise BI is a topic in itself, and there are several factors that require special consideration when
integrating a big data solution such as HDInsight with an enterprise BI system. Figure 1 shows an
overview of a typical enterprise data warehouse and BI solution.

66 Designing big data solutions using HDInsight

Figure 1 - Overview of a typical enterprise data warehouse and BI implementation


Data flows from business applications and other external sources into a data warehouse through an ETL
process. Corporate data models are then used to provide shared analytical structures such as online
analytical processing (OLAP) cubes or data mining models, which can be consumed by business users
through analytical tools and reports. Some BI implementations also enable analysis and reporting
directly from the data warehouse, enabling advanced users such as business analysts and data scientists
to create their own personal analytical data models and reports.
Many technologies and techniques can be used to integrate HDInsight with existing BI technologies.
However, as a general guide there are three primary levels of integration, as shown in Figure 2.

Use case 4: BI integration 67

Figure 2 - Three levels of integration for big data with an enterprise BI system
The integration levels shown in Figure 2 are:

Report level integration. Data from HDInsight is used in reporting and analytical tools to
augment data from corporate BI sources, enabling the creation of reports that include data
from corporate BI sources as well as from HDInsight, and also enabling individual users to
combine data from both solutions into consolidated analyses. This level of integration is
typically used for creating mashups, exploring datasets to discover possible queries that can find
hidden information, and for generating one off reports and visualizations.

Corporate data model level integration. HDInsight is used to process data that is not present in
the corporate data warehouse, and the results of this processing are then added to corporate
data models where they can be combined with data from the data warehouse and used in
multiple corporate reports and analysis tools. This level of integration is typically used for
exposing the data in specific formats to information systems, and for use in reporting and
visualization tools.

Data warehouse level integration. HDInsight is used to prepare data for inclusion in the
corporate data warehouse. The data that has been loaded is then available throughout the
entire enterprise BI solution. This level of integration is typically used to create standalone
tables on the same database hardware as the enterprise data warehouse, which provides a
single source of enterprise data for analysis, or to incorporate the data into a dimensional
schema and populate dimension and fact tables for full integration into the BI solution.

68 Designing big data solutions using HDInsight

An example of applying this use case and model can be found in Scenario 4: BI integration.
The following sections describe the three integration levels in more detail to help you understand the
implications of your choice. They also contain guidelines for implementing each one. However, keep in
mind that you dont have to use the same integration level for all of your processes. You can use a
different approach for each dataset that you extract from HDInsight, depending on the scenario and the
requirements for that dataset.

Report level integration


Using a big data solution in a standalone iterative and experimental way, as described in Use case 1:
Iterative exploration, can unlock information from data sources that have not yet been analyzed.
However, much greater business value can be obtained by integrating the results from a big data
solution with data and BI activities that are already present in the organization.
In contrast to the rigidly defined reports created by BI developers, a growing trend is a self-service
approach. In this approach the data warehouse provides some datasets based on data models and
queries defined there, but the user selects the datasets and builds a custom report. The ability to
combine multiple data sources in a personal data model enables a more flexible approach to data
exploration that goes beyond the constraints of a formally managed corporate data warehouse. Users
can augment reports and analysis of data from the corporate BI solution with additional data from a big
data solution to create a mashups that bring together data from both sources into a single, consolidated
report.
You can use the following techniques to integrate HDInsight with enterprise BI data at the report level:

Use the Power Query add-in to download the output files generated in the cluster and open
them in Excel, or import them into a database for reporting.

Create Hive tables in the cluster and consume them directly from Excel (including using Power
Query, Power Pivot, Power View, and Power Map) or from SQL Server Reporting Services (SSRS)
by using the Hive ODBC driver.

Download the required data as a delimited file from the clusters Azure blob storage container,
perhaps by using PowerShell, and open it in Excel or another data analysis and visualization
tool.

Corporate data model level integration


Integration at the report level makes it possible for advanced users to combine data from the cluster
with existing corporate data sources to create data mashups. However, there may be some scenarios
where you need to deliver combined information to a wider audience, or to users who do not have the
time or ability to create their own complex data analysis solutions. Alternatively, you might want to take
advantage of the functionality available in corporate data modeling platforms to add value to the
insights you have gained from the information contained within the source data.

Use case 4: BI integration 69

By integrating data from your big data solution into corporate data models you can accomplish both of
these aims, and use the data as the basis for enterprise reporting and analytics. Integrating the output
from HDInsight with your corporate data models allows you to use tools such as SQL Server Analysis
Services (SSAS) to analyze the data and present it in a format that is easy to use in reports, or for
performing deeper analysis.
You can use the following techniques to integrate the results into a corporate data model:

Create Hive tables in the cluster and consume them directly from a SSAS tabular model by using
the Hive ODBC driver. SSAS in tabular mode supports the creation of data models from multiple
data sources and includes an OLE DB provider for ODBC, which can be used as a wrapper
around the Hive ODBC driver.

Create Hive tables in the cluster and then create a linked server in the instance of the SQL
Server database source used by an SSAS multidimensional data model so that the Hive tables
can be queried through the linked server and imported into the data model. SSAS in
multidimensional mode can only use a single OLE DB data source, and the OLE DB provider for
ODBC is not supported.

Use Sqoop or SQL Server Integration Services (SSIS) to copy the data from the cluster to a SQL
Server database engine instance that can then be used as a source for an SSAS tabular or
multidimensional data model.

Note that you must choose between the multidimensional and the tabular data mode when you install
SQL Server Analysis Services, though you can install two instances if you need both modes.
When installed in tabular mode, SSAS supports the creation of data models that include data from
multiple diverse sources, including ODBC-based data sources such as Hive tables.
When installed in multidimensional mode, SSAS data models cannot be based on ODBC sources due to
some restrictions in the designers for multidimensional database objects. To use Hive tables as a source
for a multidimensional SSAS model, you must either extract the data from Hive into a suitable source for
the multidimensional model (such as a SQL Server database), or use the Hive ODBC driver to define a
linked server in a SQL Server instance that provides pass through access to the Hive tables, and then
use the SQL Server instance as the data source for the multidimensional model. You can download a
case study that describes how to create a multidimensional SSAS model that uses a linked server in a
SQL Server instance to access data in Hive tables.

Data warehouse level integration


Integration at the corporate data model level enables you to include data from a big data solution in a
data model that is used for analysis and reporting by multiple users, without changing the schema of the
enterprise data warehouse. However, in some cases you might want to use HDInsight as a component of
the overall ETL solution that is used to populate the data warehouse, in effect using a source of data

70 Designing big data solutions using HDInsight

queried through HDInsight just like any other business data source, and consolidating the data from all
sources into an enterprise dimensional model.
In addition, as with other data sources, its likely that the data import process from the cluster into the
database tables will occur on a schedule so that the data is as up to date as possible. The schedule will
depend on the time taken to execute the queries and perform ETL tasks prior to loading the results into
the database tables.
You can use the following techniques to integrate data from HDInsight into an enterprise data
warehouse:

Use Sqoop to copy data directly into database tables. These might be tables in a SQL Server data
warehouse that you do not want to integrate into the dimensional model of the data
warehouse. Alternatively, they may be tables in a staging database where the data can be
validated, cleansed, and conformed to the dimensional model of the data warehouse before
being loaded into the fact and dimension tables. Any firewalls located between the cluster and
the target database must be configured to allow the database protocols that Sqoop uses.

Use PolyBase for SQL Server to copy data directly into database tables. PolyBase is a component
of Microsoft Analytical Platform System (APS) and is available only on APS appliances (see
PolyBase on the SQL Server website). The tables might be in a SQL Server data warehouse and
you do not want to integrate them into the dimensional model of the data warehouse.
Alternatively, they may be tables in a staging database where the data can be validated,
cleansed, and conformed to the dimensional model of the data warehouse before being loaded
into the fact and dimension tables. Any firewalls located between the cluster and the target
database must be configured to allow the database protocols that PolyBase uses.

Create an SSIS package that reads the output file from the cluster, or uses the Hive ODBC driver
to extract the data, and then validates, cleanses, and transforms it before loading it into the fact
and dimension tables in the data warehouse.

Create a Linked Server in SQL Server that links to Hive tables in HDInsight through the Hive
ODBC Driver. You can then execute SQL queries that extract the data from HDInsight. However,
you must be aware of some issues such as compatible data types and some language syntax
limitations. For more information see How to create a SQL Server Linked Server to HDInsight
HIVE using Microsoft Hive ODBC Driver.

When reading data from HDInsight you must open port 1000 on the clusteryou can do this using the
management portal. For more information see Configure the Windows Firewall to Allow SQL Server
Access.

Use case 4: BI integration 71

When to choose this model


This model is typically suited to the following scenarios:

You have an existing enterprise data warehouse and BI system that you want to augment with
data from outside your organization.

You want to explore new ways to combine data in order to provide better insight into history
and to predict future trends.

You want to give users more opportunities for self-service reporting and analysis that combines
managed business data and big data from other sources.

Data sources
The input data can be almost anything, but for the BI integration model it typically includes the
following:

Social media data, log files, sensor data, and the output from applications that generate data
files.

Datasets obtained from Azure Marketplace and other commercial data providers.

Streaming data captured, filtered, and processed through a suitable tool or framework (see
Collecting and loading data into HDInsight).

Output targets
The results from your HDInsight queries can be visualized using any of the wide range of tools that are
available for analyzing data, combining it with other datasets, and generating reports. Typical examples
for the BI integration model are:

SQL Server Reporting Services (SSRS).

SharePoint Server or other information management systems.

Business performance dashboards such as PerformancePoint Services in SharePoint Server.

Interactive analytical tools such as Excel, Power Query, Power Pivot, Power View, and Power
Map.

Custom or third party analysis and visualization tools.

You will see more details of these tools in Consuming and visualizing data from HDInsight.

72 Designing big data solutions using HDInsight

Considerations
There are some important points to consider when choosing the BI integration model:

This model is typically used when you want to:

Integrate external data sources with your enterprise data warehouse.

Augment the data in your data warehouse with external data.

Update the data at scheduled intervals or on demand.

ETL processes in a data warehouse usually execute on a scheduled basis to add new data to the
warehouse. If you intend to integrate the results from HDInsight into your data warehouse so
that the information stored there is updated, you must consider how you will automate and
schedule the tasks of executing the query and importing the results.

You must ensure that data imported from your HDInsight solution contains valid values,
especially where there are typically multiple common possibilities (such as in street addresses
and city names). You may need to use a data cleansing mechanism such as Data Quality Services
to force such values to the correct leading value.

Most data warehouse implementations use slowly changing dimensions to manage the history
of values that change over time. Different versions of the same dimension member have the
same alternate key but unique surrogate keys, and so you must ensure that data imported into
the data warehouse tables uses the correct surrogate key value. This means that you must
either:

Use some complex logic to match the business key in the source data (which will typically
be the alternate key) with the correct surrogate key in the data model when you join the
tables. If you simply join on the alternate key, some loss of data accuracy may occur
because the alternate key is not guaranteed to be unique.

Load the data into the data warehouse and conform it to the dimensional data model,
including setting the correct surrogate key values.

One of the difficult tasks in full data warehouse integration is matching rows imported from a big data
solution to the correct dimension members in the data warehouse dimension tables. You must use a
combination of the alternate key and the point in time to which the imported row relates in order to
look up the correct surrogate key. This key can differ based on the date when changes were made to the
original entity. For example, a product may have more than one surrogate key over its lifetime, and the
imported data must match the correct version of this key.

Summary of integration level scenarios and considerations


The following table summarizes the primary considerations for deciding whether to adopt one of the
three integration approaches described in this topic. Each approach has its own benefits and challenges.

Use case 4: BI integration 73

Integration Level

Typical Scenarios

Considerations

None

No enterprise BI solution currently exists.


The organization wants to evaluate HDInsight
and big data analysis without affecting current
BI and business operations.

Open-ended exploration of unmanaged data


could produce a lot of business information, but
without rigorous validation and cleansing the data
might not be completely accurate.

Business analysts in the organization want to


explore external or unmanaged data sources
that are not included in the managed
enterprise BI solution.

Continued experimental analysis may become a


distraction from the day-to-day running of the
business, particularly if the data being analyzed is
not related to core business data.

A small number of individuals with advanced


self-service reporting and data analysis skills
need to augment corporate data that is
available from the enterprise BI solution with
big data from other sources.

It can be difficult to find common data values on


which to join data from multiple sources to gain
meaningful comparisons.

Report Level

A single report combining corporate data and


some external data is required for one-time
analysis.
Corporate Data
Model Level

A wide audience of business users must


consume multiple reports that rely on data
from both the corporate data warehouse and
big data. These users may not have the skills
necessary to create their own reports and
data models.

Integrated data in a report can be difficult to share


and reuse in different ways, though the ability to
share Power Pivot workbooks in SharePoint
Server may provide a solution for small to medium
sized groups of users.
It can be difficult to find common data values on
which to join data from multiple sources.
Integrating data generated by HDInsight into
tabular SSAS models can be accomplished easily
through the Hive ODBC driver. Integration with
multidimensional models is more challenging.

Data from a big data solution is required for


specific business logic in a corporate data
model that is used for reports and
dashboards.
Data Warehouse
Level

The organization wants a complete managed


BI platform for reporting and analysis that
includes business application data sources as
well as big data sources.
The desired level of integration between data
from HDInsight and existing business data,
and the required tolerance for data integrity
and accuracy across all data sources,
necessitates a formal dimensional model with
conformed dimensions.

Data warehouses typically have demanding data


validation requirements.
Loading data into a data warehouse that includes
slowly changing dimensions with surrogate keys
can require complex ETL processes.

74 Designing big data solutions using HDInsight

Scenario 1: Iterative exploration


A common big data analysis scenario is to explore data iteratively, refining the processing techniques
used until you discover something interesting or find the answers you seek. In this example of the
iterative exploration use case and model, HDInsight is used to perform some basic analysis of data from
Twitter.
Social media analysis is a common big data use case, and this example demonstrates how to extract
information from semi-structured data in the form of tweets. However, the goal of this scenario is not to
provide a comprehensive guide to analyzing data from Twitter, but rather to show an example of
iterative data exploration with HDInsight and demonstrate some of the techniques discussed in this
guide. Specifically, this scenario describes:

Finding insights in data

Introduction to Blue Yonder Airlines

Analytical goals and data sources

Phase 1: Initial data exploration

Phase 2: Refining the solution

Phase 3: Stabilizing the solution

Consuming the results

Finding insights in data


Previously in this guide you saw that one of the typical uses of a big data solution such as HDInsight is to
explore data that you already have, or data you collect speculatively, to see if it can provide insights into
information that you can use within your organization. The decision flow shown in Figure 1 is an
example of how you might start with a guess based on intuition, and progress towards a repeatable
solution that you can incorporate into your existing BI systems. Or, perhaps, to discover that there is no
interesting information in the data, but the cost of discovering this has been minimized by using a pay
for what you use mechanism that you can set up and then tear down again very quickly and easily.

Scenario 1: Iterative exploration 75

Figure 1 - The iterative exploration cycle for finding insights in data

Introduction to Blue Yonder Airlines


This example is based on a fictitious company named Blue Yonder Airlines. The company is an airline
serving passengers in the USA, and operates flights from its home hub at JFK airport in New York to SeaTac airport in Seattle and LAX in Los Angeles. The company has a customer loyalty program named Blue
Yonder Points.
Some months ago, the CEO of Blue Yonder Airlines was talking to a colleague who mentioned that he
had seen tweets that were critical of the companys seating policy. The CEO decided that the company
should investigate this possible source of valuable customer feedback, and instructed her customer
service department to start using Twitter as a means of communicating with its customers.
Customers send tweets to @BlueYonderAirlines and use the standard Twitter convention of including
hashtags to denote key terms in their messages. In order to provide a basis for analyzing these tweets,
the CEO also asked the BI team to start collecting any that mention @BlueYonderAirlines.

Analytical goals and data sources


Initially the plan is simply to collect enough data to begin exploring the information it contains. To
determine if the results are both useful and valid, the team must collect enough data to generate a
statistically valid result. However, they do not want to invest significant resources and time at this point,
and so use a manual interactive process for collecting the data from Twitter using the public API.
Even though they dont know what they are looking for at this stage, the managers still have an
analytical goalto explore the data and discover if, as they are led to believe, it really does contains
useful information that could provide a benefit for their business.
The Twitter data that has been captured consists of tab-delimited text files in the following format:
Source Data
4/16/2013 http://twitter.com/CameronWhite/statuses/123456789 CameronWhite (Cameron
White) terrible trip @blueyonderairlines missed my connection because of a delay :( #SEATAC
4/16/2013 http://twitter.com/AmelieWilkins/statuses/123456790 AmelieWilkins (Amelie
Wilkins) terrific journey @blueyonderairlines
- favorite movie on in-flight entertainment! #JFK_Airport
4/16/2013 http://twitter.com/EllaWilliamson/statuses/123456791 EllaWilliamson (Ella
Williamson) lousy experience
@blueyonderairlines - 2 hour delay! #SEATAC_Airport
4/16/2013 http://twitter.com/BarbaraWilson/statuses/123456792 BarbaraWilson (Barbara
Wilson) fantastic time @blueyonderairlines
- great film on my seat back screen :) #blueyonderpoints
4/16/2013 http://twitter.com/WilliamWoodward/statuses/123456793 WilliamWoodward
(William Woodward) dreadful voyage
@blueyonderairlines - entertainment system and onboard wifi not working! #IHateFlying

76 Designing big data solutions using HDInsight

Although there is no specific business decision under consideration, the customer services managers
believe that some analysis of the tweets sent by customers may reveal important information about
how they perceive the airline and the issues that matter to customers. The kinds of question the team
expects to answer are:

Are people talking about Blue Yonder Airlines on Twitter?

If so, are there any specific topics that regularly arise?

Of these topics, if any, is it possible to get a realistic view of which are the most important?

Does the process provide valid and useful information? If not, can it be refined to produce
more accurate and useful results?

If the results are valid and useful, can the process be made repeatable?

Collecting and uploading the source data


The data analysts at Blue Yonder Airlines have been collecting data from Twitter and uploading it to
Azure blob storage ready to begin the investigation. They have gathered sufficient data over a period of
weeks to ensure a suitable sample for analysis. To upload the source data files to Azure storage, the
data analysts used the following Windows PowerShell script:
Windows PowerShell
$storageAccountName = "storage-account-name"
$containerName = "container-name"
$localFolder = "D:\Data\Tweets"
$destfolder = "tweets"
$storageAccountKey = (Get-AzureStorageKey -StorageAccountName
$storageAccountName).Primary
$destContext = New-AzureStorageContext -StorageAccountName $storageAccountName StorageAccountKey $storageAccountKey
$files = Get-ChildItem $localFolder
foreach($file in $files){
$fileName = "$localFolder\$file"
$blobName = "$destfolder/$file"
write-host "copying $fileName to $blobName"
Set-AzureStorageBlobContent -File $filename -Container $containerName -Blob
$blobName -Context $destContext -Force
}
write-host "All files in $localFolder uploaded to $containerName!"

The source data and results will be retained in Azure blob storage for visualization and further
exploration in Excel after the investigation is complete.

Scenario 1: Iterative exploration 77

If you are just experimenting with data so see if it useful, you probably wont want to spend inordinate
amounts of time and resources building a complex or automated data ingestion mechanism. Often it
easier and quicker to just use a simple PowerShell script. For details of other options for ingesting data
see Collecting and loading data into HDInsight.

The HDInsight infrastructure


Blue Yonder Airlines does not have the capital expenditure budget to purchase new servers for the
project, especially as the hardware may only be required for the duration of the project. Therefore, the
most appropriate infrastructure approach is to provision an HDInsight cluster on Azure that can be
released when no longer required, minimizing the overall cost of the project.
To further minimize cost the team began with a single node HDInsight cluster, rather than accepting the
default of four nodes when initially provisioning the cluster. However, if necessary the team can
provision a larger cluster should the processing load indicate that more nodes are required. At the end
of the investigation, if the results indicate that valuable information can be extracted from the data, the
team can fine tune the solution to use a cluster with the appropriate number of nodes.
For details of the cost of running an HDInsight cluster, see HDInsight Pricing Details. In addition to the
cost of the cluster you must pay for a storage account to hold the data for the cluster.

Iterative data processing


After capturing a suitable volume of data and uploading it to Azure blob storage the analysts can
configure an HDInsight cluster associated with the blob container holding the data, and then begin
processing it. In the absence of a specific analytical goal, the data processing follows an iterative pattern
in which the analysts explore the data to see if they find anything that indicates specific topics of
interest for customers, and then build on what they find to refine the analysis.
At a high level, the iterative process breaks down into three phases:

Explore the analysts explore the data to determine what potentially useful information it
contains.

Refine when some potentially useful data is found, the data processing steps used to query
the data are refined to maximize the analytical value of the results.

Stabilize when a data processing solution that produces useful analytical results has been
identified, it is stabilized to make it robust and repeatable.

Although many big data solutions will be developed using the stages described here, its not
mandatory. You may know exactly what information you want from the data, and how to extract it.
Alternatively, if you dont intend to repeat the process, theres no point in refining or stabilizing it.

78 Designing big data solutions using HDInsight

Phase 1: Initial data exploration


In the first phase of the analysis the team at Blue Yonder Airlines decided to explore the data
speculatively by coming up with a hypothesis about what information the data might be able to reveal,
and using HDInsight to process the data and generate results that validate the hypothesis. The goal for
this phase is open-ended; the exploration might result in a specific avenue of investigation that merits
further refinement of a particular data processing solution, or it might simply prove (or disprove) an
assumption about the data.
An alternative technique from that described here might be to use the more recent capabilities of Hive
to estimate frequency distribution. For more details see Statistics and Data Mining in Hive.

Using Hive to explore the volume of tweets


In most cases the simplest way to start exploring data with HDInsight is to create a Hive table, and then
query it with HiveQL statements. The analysts at Blue Yonder Airlines created a Hive table based on the
source data obtained from Twitter. The following HiveQL code was used to define the table and load the
source data into it.
HiveQL
CREATE EXTERNAL TABLE Tweets (PubDate DATE, TweetID STRING, Author STRING, Tweet
STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE LOCATION '/twitterdata/tweets';
LOAD DATA INPATH '/tweets' INTO TABLE Tweets;

An external table was used so that the table can be dropped without deleting the data, and recreated as
the analysis continues.
The analysts hypothesized that the use of Twitter to communicate with the company is significant, and
that the volume of tweets that mention the company is growing. They therefore used the following
query to determine the daily volume and trend of tweets.
HiveQL
SELECT PubDate, COUNT(*) TweetCount FROM Tweets GROUP BY PubDate SORT BY PubDate;

The results of this query are shown in the following table.


PubDate

TweetCount

4/16/2013

1964

4/17/2013

2009

4/18/2013

2058

4/19/2013

2107

4/20/2013

2160

Scenario 1: Iterative exploration 79

4/21/2013

2215

4/22/2013

2274

These results seem to validate the hypothesis that the volume of tweets is growing. It may be worth
refining this query to include a larger set of source data that spans a longer time period, and potentially
include other aggregations in the results such as the number of distinct authors that tweeted each day.
However, while this analytical approach might reveal some information about the importance of Twitter
as a channel for customer communication, it doesnt provide any information about the specific topics
that concern customers. To determine whats important to the airlines customers, the analysts must
look more closely at the actual contents of the tweets.

Using map/reduce code to identify common terms


The analysts at Blue Yonder Airlines hypothesized that analysis of the individual words used in tweets
addressed to the airlines account might reveal topics that are important to customers. The team
decided to examine the individual words in the data, count the number of occurrences of each word,
and from this determine the main topics of interest.
Parsing the unstructured tweet text and identifying discrete words can be accomplished by
implementing a custom map/reduce solution. However, HDInsight includes a sample map/reduce
solution called WordCount that counts the number of words in a text source. The sample is provided in
various forms including Java, JavaScript, and C#.
There are many ways to create and execute map/reduce functions, though often you can use Hive or
Pig directly instead of resorting to writing map/reduce code.
The code consists of a map function that runs on each cluster node in parallel and parses the text input
to create a key/value pair with the value of 1 for every word found. These key/value pairs are passed to
the reduce function, which counts the total number of instances of each key. Therefore, the result is the
number of times each word was mentioned in the source data. An extract of the Java source code is
shown below.
Java (Wordcount.java)
public class WordCount {
public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>
{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context)
throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);

80 Designing big data solutions using HDInsight

}
}
}
public static class IntSumReducer extends
Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) { sum += val.get(); }
result.set(sum);
context.write(key, result);
}
}
...
}

For information about writing Java map/reduce code for HDInsight see Develop Java MapReduce
programs for HDInsight. For more details of the Java classes used when creating map/reduce functions
see Understanding MapReduce.
The Java code, compiled to a .jar file, can be executed using the following PowerShell script.
Windows PowerShell
$clusterName = "cluster-name"
$storageAccountName = "storage-account-name"
$containerName = "container-name"
$jarFile =
"wasbs://$containerName@$storageAccountName.blob.core.windows.net/example/jars/hadoop
-mapreduce-examples.jar"
$input =
"wasbs://$containerName@$storageAccountName.blob.core.windows.net/twitterdata/tweets"
$output =
"wasbs://$containerName@$storageAccountName.blob.core.windows.net/twitterdata/words"
$jobDef = New-AzureHDInsightMapReduceJobDefinition -JarFile $jarFile
-ClassName "wordcount" -Arguments $input , $output
$wordCountJob = Start-AzureHDInsightJob Cluster $clusterName JobDefinition $jobDef
Write-Host "Map/Reduce job submitted..."
Wait-AzureHDInsightJob -Job $wordCountJob -WaitTimeoutInSeconds 3600
Get-AzureHDInsightJobOutput -Cluster $clusterName -JobId $wordCountJob.JobId StandardError

Scenario 1: Iterative exploration 81

For more information about running map/reduce jobs in HDInsight see Building custom clients in the
topic Processing, querying, and transforming data using HDInsight.
The job generates a file named part-r-00000 containing the total number of instances of each word in
the source data. An extract from the results is shown here.
Partial output from map/reduce job
http://twitter.com/<user_name>/statuses/12347297
http://twitter.com/<user_name>/statuses/12347149
in
1408
in-flight
1057
incredible
541
is
704
it
352
it!
352
job
1056
journey 1057
just
352
later
352
lost
704
lots
352
lousy
515
love
1408
lugage? 352
luggage 352
made
1056
...

1
1

Unfortunately, these results are not particularly useful in trying to identify the most common topics
discussed in the tweets because the words are not ordered by frequency, and the list includes words
derived from Twitter names and other fields that are not actually a part of the tweeted messages.

Using Pig to group and summarize word counts


It would be possible to modify the Java code to overcome the limitations identified. However, it may be
simpler (and quicker) to use a higher-level abstraction such as Pig to filter, count, and sort the words. Pig
provides a workflow-based approach to data processing that is ideal for restructuring and summarizing
data. A Pig Latin script that performs the aggregation is syntactically much easier to create than
implementing the equivalent custom map/reduce code.
Pig makes it easy to write procedural workflow-style code that builds a result by repeated operations
on a dataset. It also makes it easier to debug queries and transformations because you can dump the
intermediate results of each operation in the script to a file.
The Pig Latin code created by the analysts is shown below. It uses the source data files in the
/twitterdata/tweets folder to group the number of occurrences of each word, and then sorts the results

82 Designing big data solutions using HDInsight

in descending order of occurrences and stores the first 100 results in the /twitterdata/wordcounts
folder.
Pig Latin (WordCount.pig)
-- load tweets.
Tweets = LOAD '/twitterdata/tweets' AS (date, id, author, tweet);
-- split tweet into words.
TweetWords = FOREACH Tweets GENERATE FLATTEN(TOKENIZE(tweet)) AS word;
-- clean words by removing punctuation.
CleanWords = FOREACH TweetWords GENERATE LOWER(REGEX_EXTRACT(word, '[a-zA-Z]*', 0))
as word;
-- filter text to eliminate empty strings.
FilteredWords = FILTER CleanWords BY word != '';
-- group by word.
GroupedWords = GROUP FilteredWords BY (word);
-- count mentions per group.
CountedWords = FOREACH GroupedWords GENERATE group, COUNT(FilteredWords) as count;
-- sort by count.
SortedWords = ORDER CountedWords BY count DESC;
-- limit results to the top 100.
Top100Words = LIMIT SortedWords 100;
-- store the results as a file.
STORE Top100Words INTO '/twitterdata/wordcounts';

This script is saved as WordCount.pig, uploaded to Azure storage, and executed in HDInsight using the
following Windows PowerShell script.
Windows PowerShell
$clusterName = "cluster-name"
$storageAccountName = "storage-account-name"
$containerName = "container-name"
$localfolder = "D:\Data\Scripts"
$destfolder = "twitterdata/scripts"
$scriptFile = "WordCount.pig"
$outputFolder = "twitterdata/wordcounts"
$outputFile = "part-r-00000"
# Upload Pig Latin script to Azure.
$storageAccountKey = (Get-AzureStorageKey -StorageAccountName
$storageAccountName).Primary

Scenario 1: Iterative exploration 83

$blobContext = New-AzureStorageContext -StorageAccountName $storageAccountName StorageAccountKey $storageAccountKey


$blobName = "$destfolder/$scriptFile"
$filename = "$localfolder\$scriptFile"
Set-AzureStorageBlobContent -File $filename -Container $containerName -Blob $blobName
-Context $blobContext -Force
write-host "$scriptFile uploaded to $containerName!"
# Run the Pig Latin script.
$jobDef = New-AzureHDInsightPigJobDefinition
-File
"wasbs://$containerName@$storageAccountName.blob.core.windows.net/$destfolder/$script
File"
$pigJob = Start-AzureHDInsightJob Cluster $clusterName JobDefinition $jobDef
Write-Host "Pig job submitted..."
Wait-AzureHDInsightJob -Job $pigJob -WaitTimeoutInSeconds 3600
Get-AzureHDInsightJobOutput -Cluster $clusterName -JobId $pigJob.JobId -StandardError
# Get the job output.
$remoteblob = "$destfolder/$outputFolder/$outputFile"
write-host "Downloading $remoteBlob..."
Get-AzureStorageBlobContent -Container $containerName -Blob $remoteblob -Context
$blobContext -Destination $localFolder
cat $localFolder\$destfolder\$outputFolder\$outputFile

When the script has completed successfully, the results are stored in a file named part-r-00000 in the
/twitterdata/wordcounts folder. This can be downloaded and viewed using the Hadoop cat command.
The following is an extract of the results.
Extract from /twitterdata/wordcounts/part-r-00000
my
delayed
flight
to
entertainment
the
a
delay
of
bags

3437
2749
2749
2407
2064
2063
2061
1720
1719
1718

These results show that the word count approach has the potential to reveal some insights. For
example, the high number of occurrences of delayed and delay are likely to be relevant in determining
common customer concerns. However, the solution needs to be modified to restrict the output to
include only significant words, which will improve its usefulness. To accomplish this the analysts decided
to refine it to produce accurate and meaningful insights into the most common words used by
customers when communicating with the airline by Twitter. This is described in Phase 2: Refining the
solution.

84 Designing big data solutions using HDInsight

Phase 2: Refining the solution


After you find a question that does provide some useful insight from your data, its usual to attempt to
refine it to maximize the relevance of the results, minimize the processing overhead, and therefore
improve its value. This is generally the second phase of an analysis process.
Typically you will only refine solutions that you intend to repeat on a reasonably regular basis. Be sure
that they are still correct and produce valid results after refinement. For example, compare the
outputs of the original and the refined solutions.
Having determined that counting the number of occurrences of each word has the potential to reveal
common topics of customer tweets, the analysts at Blue Yonder Airlines decided to refine the data
processing solution to improve its accuracy and usefulness. The starting point for this refinement is the
WordCount.pig Pig Latin script described in Phase 1: Initial data exploration.

Enhancing the Pig Latin code to exclude noise words


The results generated by the WordCount.pig script contain a large volume of everyday words such as a,
the, of, and so on. These occur frequently in every tweet but provide little or no semantic value when
trying to identify the most common topics discussed in the tweets. Additionally, there are some domainspecific words such as flight and airport that, within the context of discussions around airlines, are so
common that their usefulness in identifying topics of conversation is minimal. These noise words
(sometimes called stop words) can be disregarded when analyzing the tweet content.
The analysts decided to create a file containing a list of words to be excluded from the analysis, and then
use it to filter the results generated by the Pig Latin script. This is a common way to filter words out of
searches and queries. An extract from the noise word file created by the analysts at Blue Yonder Airlines
is shown below. This file was saved as noisewords.txt in the /twitterdata folder in the HDInsight cluster.
Extract from /twitterdata/noisewords.txt
a
i
as
do
go
in
is
it
my
the
that
what
flight
airline

You can obtain lists of noise words from various sources such as TextFixer and Armand Brahajs Blog. If
you have installed SQL Server you can start with the list of noise words that are included in the

Scenario 1: Iterative exploration 85

Resource database. For more information, see Configure and Manage Stopwords and Stoplists for FullText Search. In addition, you may find the N-gram datasets available from Google are useful. These
contain lists of words and phrases with their observed frequency counts.
With the noise words file in place the analysts modified the WordCount.pig script to use a LEFT OUTER
JOIN matching the words in the tweets with the words in the noise list, and to store the result in the file
named noisewordcounts. Only words with no matching entry in the noise words file are now included in
the aggregated results. The modified section of the script is shown below.
Pig Latin (FilterNoiseWords.pig)
...
-- load the noise words file.
NoiseWords = LOAD '/twitterdata/noisewords.txt' AS noiseword:chararray;
-- join the noise words file using a left outer join.
JoinedWords = JOIN FilteredWords BY word LEFT OUTER, NoiseWords BY noiseword USING
'replicated';
-- filter the joined words so that only words with
-- no matching noise word remain.
UsefulWords = FILTER JoinedWords BY noiseword IS NULL;
...

Partial results from this version of the script are shown below.
Extract from /twitterdata/noisewordcounts/part-r-00000
delayed
entertainment
delay
bags
service
time
vacation
food
wifi
connection
seattle
bag

2749
2064
1720
1718
1718
1375
1031
1030
1030
688
688
687

These results are more useful than the previous output, which included the noise words. However, the
analysts have noticed that semantically equivalent words are counted separately. For example, in the
results shown above, delayed and delay both indicate that customers are concerned about delays, while
bags and bag both indicate concerns about baggage.

86 Designing big data solutions using HDInsight

Using a user-defined function to find similar words


After some consideration, the analysts realized that multiple words with the same stem (for example
delayed and delay), different words with the same meaning (for example baggage and luggage), and
mistyped or misspelled words (for example bagage) will skew the resultspotentially causing topics
that are important to customers to be obscured because many differing terms are used as synonyms. To
address this problem, the analysts initially decided to try to find similar words by calculating the Jaro
distance between every word combination in the source data.
The Jaro distance is a value between 0 and 1 used in statistical analysis to indicate the level of
similarity between two words. For more information about the algorithm used to calculate Jaro
distance, see JaroWinkler distance on Wikipedia.
There is no built-in Pig function to calculate Jaro distance, so the analysts recruited the help of a Java
developer to create a user-defined function. The code for the user-defined function is shown below.
Java (WordDistance.java)
package WordDistanceUDF;
...
public class WordDistance extends EvalFunc<Double>
{
public Double exec(final Tuple input) throws IOException{
if (input == null || input.size() == 0) return null;
try
{
final String firstWord = (String)input.get(0);
final String secondWord = (String)input.get(1);
return getJaroSimilarity(firstWord, secondWord);
}
catch(Exception e)
{
throw new IOException("Caught exception processing input row ", e);
}
}
private Double getJaroSimilarity(final String firstWord, final String secondWord)
{
double defMismatchScore = 0.0;
if (firstWord != null && secondWord != null)
{
// get half the length of the string rounded up
//(this is the distance used for acceptable transpositions)
final int halflen = Math.min(firstWord.length(), secondWord.length()) / 2 + 1;
//get common characters
final StringBuilder comChars = getCommonCharacters(firstWord, secondWord,
halflen);

Scenario 1: Iterative exploration 87

final int commonMatches = comChars.length();


//check for zero in common
if (commonMatches == 0) { return defMismatchScore; }
final StringBuilder comCharsRev = getCommonCharacters(secondWord, firstWord,
halflen);
//check for same length common strings returning 0.0f is not the same
if (commonMatches != comCharsRev.length()) { return defMismatchScore; }
//get the number of transpositions
int transpositions = 0;
for (int i = 0; i < commonMatches; i++)
{
if (comChars.charAt(i) != comCharsRev.charAt(i)) { transpositions++; }
}
//calculate jaro metric
transpositions = transpositions / 2;
defMismatchScore = commonMatches / (3.0 * firstWord.length()) +
commonMatches / (3.0 * secondWord.length()) +
(commonMatches - transpositions) / (3.0 * commonMatches);
}
return defMismatchScore;
}
private static StringBuilder getCommonCharacters(final String firstWord,
final String secondWord, final int distanceSep)
{
if (firstWord != null && secondWord != null)
{
final StringBuilder returnCommons = new StringBuilder();
final StringBuilder copy = new StringBuilder(secondWord);
for (int i = 0; i < firstWord.length(); i++)
{
final char character = firstWord.charAt(i);
boolean foundIt = false;
for (int j = Math.max(0, i - distanceSep);
!foundIt && j < Math.min(i + distanceSep, secondWord.length());
j++)
{
if (copy.charAt(j) == character)
{
foundIt = true;
returnCommons.append(character);
copy.setCharAt(j, '#');
}
}
}

88 Designing big data solutions using HDInsight

return returnCommons;
}
return null;
}
}

For information about creating UDFs for use in HDInsight scripts see User-defined functions.
This function was compiled, packaged as WordDistanceUDF.jar, and saved on the HDInsight cluster.
Next, the analysts modified the Pig Latin script that generates a list of all non-noise word combinations
in the tweet source data to use the function to calculate the Jaro distance between each combination of
words generated by the script. This modified section of the script is shown here.
Pig Latin (MatchWords.pig)
-- register custom jar.
REGISTER WordDistanceUDF.jar;
...
...
-- sort by count.
SortedWords = ORDER WordList BY word;
-- create a duplicate set.
SortedWords2 = FOREACH SortedWords GENERATE word AS word:chararray;
-- cross join to create every combination of pairs.
CrossWords = CROSS SortedWords, SortedWords2;
-- find the Jaro distance.
WordDistances = FOREACH CrossWords GENERATE
SortedWords::word as word1:chararray,
SortedWords2::word as word2:chararray,
WordDistanceUDF.WordDistance(SortedWords::word, SortedWords2::word) AS
jarodistance:double;
-- filter out word pairs with jaro distance less than 0.9.
MatchedWords = FILTER WordDistances BY jarodistance >= 0.9;
-- store the results as a file.
STORE MatchedWords INTO '/twitterdata/matchedwords';

Notice that the script filters the results to include only word combinations with a Jaro distance value of
0.9 or higher.
The results include a row for every word matched to itself with a Jaro score of 1.0, and two rows for
each combination of words with a score of 0.9 or above (one row for each possible word order). Some of
the results in the output file are shown below.

Scenario 1: Iterative exploration 89

Extract from /twitterdata/matchedwords/part-r-00000


bag
bag
baggage
baggage
bags
bags
delay
delay
delay
delayed
delayed
delays
delays
seat
seat
seated
seats
seats

bag
bags
baggage
bagage
bag
bags
delay
delayed
delays
delay
delayed
delay
delays
seat
seats
seated
seat
seat

1.0
0.9166666666666665
1.0
1.0
0.9166666666666665
1.0
1.0
0.9047619047619047
0.9444444444444444
0.9047619047619047
1.0
0.9444444444444444
1.0
1.0
0.9333333333333333
1.0
0.9333333333333333
1.0

Close examination of these results reveals that, while the code has successfully matched some words
appropriately (for example, bag/bags, delay/delays, delay/delayed, and seat/seats), it has failed to
match some others (for example, delays/delayed, bag/baggage, bagage/baggage, and seat/seated).
The analysts experimented with the Jaro value used to filter the results, lowering it to achieve more
matches. However, in doing so they found that the number of false positives increased. For example,
lowering the filter score to 0.85 matched delays to delayed and seats to seated, but also matched
seated to seattle.
The results obtained, and the attempts to improve them by adjusting the matching algorithm, reveal
just how difficult it is to infer semantics and sentiment from free-form text. In the end the analysts
realized that it would require some type of human intervention, in the form of a manually maintained
synonyms list.

Using a lookup file to combine matched words


The results generated by the user-defined function were not sufficiently accurate to rely on in a fully
automated process for matching words, but they did provide a useful starting point for creating a
custom lookup file for words that should be considered synonyms when analyzing the tweet contents.
The analysts reviewed and extended the results produced by the function and created a tab-delimited
text file, which was saved as synonyms.txt in the /twitterdata folder on the HDInsight cluster. The
following is an extract from this file.
Extract from /twitterdata/synonyms.txt
bag
bag
bag

bag
bags
bagage

90 Designing big data solutions using HDInsight

bag
bag
bag
delay
delay
delay
drink
drink
drink
drink
drink
...

baggage
luggage
lugage
delay
delayed
delays
drink
drinks
drinking
beverage
beverages

The first column in this file contains the list of leading values that should be used to aggregate the
results. The second column contains synonyms that should be converted to the leading values for
aggregation.
With this synonyms file in place, the Pig Latin script used to count the words in the tweet contents was
modified to use a LEFT OUTER JOIN between the words in the source tweets (after filtering out the
noise words) and the words in the synonyms file to find the leading values for each matched word. A
UNION clause is then used to combine the matched words with words that are not present in the
synonyms file, and the results are saved into a file named synonymcounts. The modified section of the
Pig Latin script is shown here.
Pig Latin (CountSynonymns.pig)
...
-- Match synonyms.
Synonyms = LOAD '/twitterdata/synonyms.txt' AS (leadingvalue:chararray,
synonym:chararray);
WordsAndSynonyms = JOIN UsefulWords BY word LEFT OUTER, Synonyms BY synonym USING
'replicated';
UnmatchedWords = FILTER WordsAndSynonyms BY synonym IS NULL;
UnmatchedWordList = FOREACH UnmatchedWords GENERATE word;
MatchedWords = FILTER WordsAndSynonyms BY synonym IS NOT NULL;
MatchedWordList = FOREACH MatchedWords GENERATE leadingvalue as word;
AllWords = UNION MatchedWordList, UnmatchedWordList;
-- group by word.
GroupedWords = GROUP AllWords BY (word);
-- count mentions per group.
CountedWords = FOREACH GroupedWords GENERATE group as word, COUNT(AllWords) as count;
...

Partial results from this script are shown below.


Extract from /twitterdata/synonymcounts/part-r-00000
delay

4812

Scenario 1: Iterative exploration 91

bag
seat
service
movie
vacation
time
entertainment
food
wifi
connection
seattle
drink

3777
2404
1718
1376
1375
1375
1032
1030
1030
688
688
687

In these results, semantically equivalent words are combined into a single leading value for aggregation.
For example the counts for seat, seats, and seated are combined as a single count for seat. This makes
the results more useful in terms of identifying topics that are important to customers. For example, it is
apparent from these results that the top three subjects that customers have tweeted about are delays,
bags, and seats.

Moving from topic lists to sentiment analysis


At this point the team has a valuable and seemingly valid list of the most important topics that
customers are discussing. However, the results do not include any indication of sentiment. While its
reasonably safe to assume that the majority of people discussing delays (the top term) will not be happy
customers, its more difficult to detect the sentiment for a top item such as bag or seat. Its not possible
to tell from this data if most customers are happy or are complaining.
Sentiment analysis is difficult. As an extreme example, tweets containing text such as No wifi, thanks
for that Blue Yonder! or Still in Chicago airport due to flight delays - what a great trip this is turning out
to be! are obviously negative to a human reader who appreciates sentiment modifiers such as satire.
However, a simple analysis based on individual words that indicate positive and negative sentiment,
such as thanks and great, would indicate overall positive sentiment.
There are solutions available that are specially designed to accomplish sentiment analysis, and the team
might also investigate using the rudimentary sentiment analysis capabilities of the Twitter search API.
However, they decided that the list of topics provided by the analysis theyve done so far would
probably be better employed in driving more focused customer research, such as targeted
questionnaires and online surveysperhaps through external agencies that specialize in this field.
Instead, the team decided to move forward by improving the stability of the current solution as
described in the next section, Phase 3: Stabilizing the solution.

Phase 3: Stabilizing the solution


With the data processing solution refined to produce useful insights, the analysts at Blue Yonder Airlines
decided to stabilize the solution to make it more robust and repeatable. Key concerns about the solution

92 Designing big data solutions using HDInsight

include the dependencies on hard-coded file paths, the data schemas in the Pig Latin scripts, and the
data ingestion process.
With the current solution, any changes to the location or format of the tweets.txt, noisewords.txt, or
synonyms.txt files would break the current scripts. Such changes are particularly likely to occur if the
solution gains acceptance among users and Hive tables are created on top of the files to provide a more
convenient query interface.

Using HCatalog to create Hive tables


HCatalog provides an abstraction layer between data processing tools such as Hive, Pig, and custom
map/reduce code and the physical storage of the data. By using HCatalog to manage data storage the
analysts hoped to remove location and format dependencies in Pig Latin scripts, while making it easy for
users to access the analytical data directly though Hive.
The analysts had already created a table named Tweets for the source data, and now decided to create
tables for the noise words file and the synonyms file. Additionally, they decided to create a table for the
results generated by the WordCount.pig script to make it easier to perform further queries against the
aggregated word counts. To accomplish this the analysts created the following script file containing
HiveQL statements, which they saved as CreateTables.hcatalog on the HDInsight server.
HiveQL (CreateTables.hcatalog)
CREATE EXTERNAL TABLE NoiseWords (noiseword STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE LOCATION '/twitterdata/noisewords';
CREATE EXTERNAL TABLE Synonyms (leadingvalue STRING, synonym STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE LOCATION '/twitterdata/synonyms';
CREATE EXTERNAL TABLE TopWords (word STRING, count BIGINT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE LOCATION '/twitterdata/topwords';

To execute this script with HCatalog, the following command was executed in the Hadoop command
window on the cluster.
Command Line
%HCATALOG_HOME%\bin\hcat.py f C:\Scripts\CreateTables.hcatalog

After the script had successfully created the new tables, the noisewords.txt and synonyms.txt files were
moved into the folders used by the corresponding tables.

Using HCatalog in a Pig Latin script


Moving the noisewords.txt and synonyms.txt files made the data available by querying the
corresponding Hive tables; but broke the WordCount.pig script, which includes a hard-coded path to
these files. The script also uses hard-coded paths to load the tweets.txt file from the folder used by the
Tweets table, and to save the results.

Scenario 1: Iterative exploration 93

This dependency on hard-coded paths makes the data processing solution vulnerable to changes in the
way that data is stored. One of the major advantages of using HCatalog is that you can relocate and
redefine data as required without breaking all the scripts and code that accesses the files. For example,
if at a later date an administrator modifies the Tweets table to partition the data, the code to load the
source data would no longer work. Additionally, if the source data was modified to use a different
format or schema, the script would need to be modified accordingly.
To eliminate the dependency, the analysts modified the WordCount.pig script to use HCatalog classes to
load and save data in Hive tables instead of accessing the source files directly. The modified sections of
the script are shown below.
Pig Latin (GetTopWords.pig)
-- load tweets using HCatalog.
Tweets = LOAD 'Tweets' USING org.apache.hcatalog.pig.HCatLoader();
...
-- load the noise words file using HCatalog.
NoiseWords = LOAD 'NoiseWords' USING org.apache.hcatalog.pig.HCatLoader();
...
-- Match synonyms using data loaded through HCatalog
Synonyms = LOAD 'Synonyms' USING org.apache.hcatalog.pig.HCatLoader();
...
-- store the results as a file using HCatalog
STORE Top100Words INTO 'TopWords' USING org.apache.hcatalog.pig.HCatStorer();

The script no longer includes any hard-coded paths to data files or schemas for the data as it is loaded or
stored, and instead uses HCatalog to reference the Hive tables created previously. The results of the
script are stored in the TopWords table, and can be viewed by executing a HiveQL query such as the
following example.
HiveQL
SELECT * FROM TopWords;

Finalizing the data ingestion process


Having stabilized the solution, the team finally examined ways to improve and automate the way that
data is extracted from Twitter and loaded into Azure blob storage. The solution they adopted uses SQL
Server Integration Services with a package that connects to the Twitter streaming API every day to
download new tweets, and uploads these as a file to blob storage.
The team will also review the columns used in the query; for example, by using only the Tweet column
rather than storing other columns that are not used in the current process. This might be possible by

94 Designing big data solutions using HDInsight

adding a Hive staging table and preprocessing the data. However, the team needs to consider that the
data in other columns may be useful in the future.
To complete the examination of the data the team next explored how the results would be used, as
described in the next section, "Consuming the results."

Consuming the results


Now that the solution has generated some useful information, the team can decide how to consume the
results for analysis or reporting. Storing the results of the Pig data processing jobs generated by the
WordCount.pig script in a Hive table makes it easy for analysts to consume the data from Excel.
To enable access to Hive from Excel, the ODBC driver for Hive has been installed on all of the analysts
workstations and a data source name (DSN) has been created for the HDInsight cluster. The analysts can
use the Data Connection Wizard in Excel to connect to HDInsight using the DSN, and import data from
the TopWords table, as shown in Figure 1.

Figure 1 - Using the Data Connection Wizard to access a Hive table from Excel

Scenario 1: Iterative exploration 95

For details of how to consume the output from HDInsight jobs in Excel, see Built-in data connectivity in
the topic Consuming and visualizing data from HDInsight.
After the data has been imported into a worksheet, the analysts can use the full capabilities of Excel to
explore and visualize it, as shown in Figure 2.

Figure 2 - Visualizing results in Excel


After the data has been extracted into a visualization tool, and you want to rerun the process with
additional data, you will need to refresh the results. This typically means rerunning the scripts that
perform the analysis in HDInsight and then refreshing the view of the results. For details of how this can
be done, depending on the tools you are using, see the section Scheduling data refresh in consumers
in Scheduling solution and task execution.

96 Designing big data solutions using HDInsight

Scenario 2: Data warehouse on demand


In a data warehousing scenario, HDInsight is used as a data source for big data analysis and reporting.
This scenario also discusses how you can minimize running costs by shutting down the cluster when its
not is use. Specifically, the scenario describes:

Introduction to A. Datum

Analytical goals and data sources

Creating the data warehouse

Loading data into the data warehouse

Analyzing data from the data warehouse

Introduction to A. Datum
This scenario is based on a fictional company named A. Datum, which conducts research into tornadoes
and other weather-related phenomena in the Unites States. In the scenario, data analysts at A. Datum
want to use HDInsight as a central repository for historical tornado data in order to analyze and visualize
previous tornados, and to try to identify trends in terms of geographical locations and times.

Analytical goals and data sources


The data analysts at A. Datum have obtained historical data about tornadoes in the United States since
1934. The data includes the date and time, geographical start and end point, category, size, number of
fatalities and casualties, and damage costs of each tornado. The goal of the data warehouse is to enable
analysts to slice and dice this data and visualize aggregated values on a map. This will act as a resource
for research into severe weather patterns, and will support further investigation in the future.
The data analysts also implemented a mechanism to continuously collect new data as it is published,
and this data is added to the existing data by being uploaded to Azure blob storage. However, as a
research organization, A. Datum does not monitor the data on a daily basisthis is the job of the storm
warning agencies in the United States. Instead, A. Datums primary aim is to analyze the data only after
significant tornado events have occurred, or on a quarterly basis to provide updated results for use in
their scientific publications. Therefore, to minimize running costs, the analysts want to be able to shut
down the cluster when it is not being used for analysis, and recreate it when required.
This scenario demonstrates:

How you can define and create a data warehouse containing a database and Hive tables in
HDInsight.

How you can automate the loading of data into the tables in the data warehouse.

How you can define queries to extract the data from the data warehouse.

Scenario 2: Data warehouse on demand 97

How you can view and analyze the data, and generate compelling visualizations using a range of
tools

Creating the data warehouse


A data warehouse is fundamentally a database optimized for queries that support data analysis and
reporting. Data warehouses are often implemented as relational databases with a schema that
optimizes query performance, and aggregation of important numerical measures, at the expense of
some data duplication in denormalized tables. When creating a data warehouse for big data you can use
a relational database engine that is designed to handle huge volumes of data (such as Microsoft
Analytics Platform System), or you can load the data into a Hadoop cluster and use Hive tables to project
a schema onto the data.
In this scenario the data analysts at A. Datum have decided to use a Hadoop cluster provided by the
HDInsight service. This enables them to build a Hive-based data warehouse schema that can be used as
a source for analysis and reporting in traditional tools such as Excel, but which also will enable them to
apply big data analysis tools and techniques.
Figure 1 shows a schematic view of the use case and model for using HDInsight to implement a data
warehouse.

Figure 1 - Using HDInsight as a data warehouse for analysis, reporting, and as a business data source
Unlike a traditional relational database, HDInsight allows you to manage the lifetime and storage of
tables and indexes (metadata) separately from the data that populates the tables. A Hive table is simply
a definition that is applied over a folder containing data, and this separation of schema and data is what
enables one of the primary differences between Hadoop-based big data batch processing solutions and
relational database: you apply a schema when the data is read, rather than when it is written.

98 Designing big data solutions using HDInsight

In this scenario youll see how the capability to use a schema on read approach provides an advantage
for organizations that need a data warehousing capability where data can be continuously collected, but
analysis and reporting is carried out only occasionally.

Creating a database
When planning the HDInsight data warehouse, the data analysts at A. Datum needed to consider ways
to ensure that the data and Hive tables can be easily recreated in the event of the HDInsight cluster
being released and re-provisioned. This might happen for a number of reasons, including temporarily
decommissioning the cluster to save costs during periods of non-use, and releasing the cluster in order
to create a new one with more nodes in order to scale out the data warehouse.
A new cluster can be created over one or more existing Azure blob storage containers that hold the
data, but the Hive (and other) metadata is stored separately in an Azure SQL Database instance. To be
able to recreate this metadata, the analysts identified two possible approaches:

Save a HiveQL script that can be used to recreate EXTERNAL tables based on the data persisted
in Azure blob storage.

Specify an existing Azure SQL Database instance to host the Hive metadata store when the
cluster is created.

Using a HiveQL script to recreate tables after releasing and re-provisioning a cluster is an effective
approach when the data warehouse will contain only a few tables and other objects. The script can be
executed to recreate the tables over the existing data when the cluster is re-provisioned. However,
selecting an existing SQL Database instance (which you maintain separately from the cluster) to be used
as the Hive metadata store is also very easy, and removes the need to rerun the scripts. You can back up
this database using the built-in tools, or export the data so that you can recreate the database if
required.
See Cluster and storage initialization for more details of using existing storage accounts and a separate
Azure SQL Database instance to restore a cluster.
Creating a logical database in HDInsight is a useful way to provide separation between the contents of
the database and other items located in the same cluster; for example, to ensure a logical separation
from Hive tables used for other analytical processes. To do this the data analysts created a dedicated
database for the data warehouse by using the following HiveQL statement.
HiveQL
CREATE DATABASE DW LOCATION '/DW/database';

This statement creates a folder named /DW/database as the default folder for all objects created in the
DW database.

Scenario 2: Data warehouse on demand 99

Creating tables
The tornado data includes the code for the state where the tornado occurred, as well as the date and
time of the tornado. The data analysts want to be able to display the full state name in reports, and so
created a table for state names using the following HiveQL statement.
HiveQL
CREATE EXTERNAL TABLE DW.States (StateCode STRING, StateName STRING)
STORED AS SEQUENCEFILE;

Notice that the table is stored in the default location. For the DW database this is the /DW/database
folder, and so this is where a new folder named States is created. The table is formatted as a Sequence
File. Tables in this format typically provide faster performance than tables in which data is stored as text.
You must use EXTERNAL tables if you want the data to be persisted when you delete a table definition
or when you recreate a cluster. Storing the data in SEQUENCEFILE format is also a good idea as it can
improve performance. You might also consider using the ORC file format, which provides a highly
efficient way to store Hive data and can improve performance when reading, writing, and processing
data. See ORC File Format for more information.
The data analysts also want to be able to aggregate data by temporal hierarchies (year, month, and day)
and create reports that show month and day names. While many client applications that are used to
analyze and report data support this kind of functionality, the analysts want to be able to generate
reports without relying on specific client application capabilities.
To support date-based hierarchies and reporting, the analysts created a table containing various date
attributes that can be used as a lookup table for date codes in the tornado data. The creation of a date
table like this is a common pattern in relational data warehouses.
HiveQL
CREATE EXTERNAL TABLE DW.Dates
(DateCode STRING, CalendarDate STRING, DayOfMonth INT, MonthOfYear INT,
Year INT, DayOfWeek INT, WeekDay STRING, Month STRING)
STORED AS SEQUENCEFILE;

Finally, the data analysts created a table for the tornado data itself. Since this table is likely to be large,
and many queries will filter by year, they decided to partition the table on a Year column, as shown in
the following HiveQL statement.
HiveQL
CREATE EXTERNAL TABLE DW.Tornadoes
(DateCode STRING, StateCode STRING, EventTime STRING, Category INT, Injuries INT,
Fatalities INT, PropertyLoss DOUBLE, CropLoss DOUBLE, StartLatitude DOUBLE,
StartLongitude DOUBLE, EndLatitude DOUBLE, EndLongitude DOUBLE,
LengthMiles DOUBLE, WidthYards DOUBLE)
PARTITIONED BY (Year INT) STORED AS SEQUENCEFILE;

100 Designing big data solutions using HDInsight

Next, the analysts needed to upload the data for the data warehouse. This is described in the next
section, "Loading data into the data warehouse."

Loading data into the data warehouse


The tornado source data is in tab-delimited text format, and the data analysts have used Excel to create
lookup tables for dates and states in the same format. To simplify loading this data into the Sequence
File formatted tables created previously, they decided to create staging tables in Text File format to
which the source data will be uploaded. They can then use a HiveQL INSERT statement to load data from
the staging tables into the data warehouse tables, implicitly converting the data to Sequence File format
and generating partition key values for the Tornadoes table. This approach also makes it possible to
compress the data as it is extracted from the staging table, reducing storage requirements and
improving query performance in the data warehouse tables.
The format and content of the data is shown in the following tables.
Date table
1932-01-01

01/01/1932

1932

Friday

January

1932-01-02

01/02/1932

1932

Saturday

January

...

...

...

...

...

...

...

...

State table
AL

Alabama

AK

Alaska

AZ

Arizona

...

...

Tornadoes
table
1934-01-18

OK

01/18/1934
02:20

4000000

35.4

96.67

35.45

-96.6

5.1

30

1934-01-18

AR

01/18/1934
08:50

1600000

35.2

93.18

0.1

10

1934-01-18

MO

01/18/1934
13:55

4000000

36.68

90.83

36.72

90.77

4.3

100

Creating scripts for the data load process


The data warehouse will be updated with new data periodically, so the load process should be easily
repeatable. To accomplish this the analysts created HiveQL scripts to drop and recreate each staging

Scenario 2: Data warehouse on demand 101

table. For example, the following script is used to drop and recreate a staging table named
StagedTornadoes for the tornadoes data.
HiveQL (CreateStagedTornadoes.q)
DROP TABLE DW.StagedTornadoes;
CREATE TABLE StagedTornadoes
(DateCode STRING, StateCode STRING, EventTime STRING, Category INT, Injuries INT,
Fatalities INT, PropertyLoss DOUBLE, CropLoss DOUBLE, StartLatitude DOUBLE,
StartLongitude DOUBLE, EndLatitude DOUBLE, EndLongitude DOUBLE,
LengthMiles DOUBLE, WidthYards DOUBLE)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE LOCATION '/staging/tornadoes';

Notice that the staging table is an INTERNAL table. When it is dropped, any staged data left over from a
previous load operation is deleted and a new, empty /staging/tornadoes folder is created ready for new
data fileswhich can simply be copied into the folder. Similar scripts were created for the StagedDates
and StagedStates tables.
In addition to the scripts used to create the staging tables, the data load process requires scripts to
insert the staged data into the data warehouse tables. For example, the following script is used to load
the staged tornadoes data.
HiveQL (StagingScripts\LoadStagedTornadoes.q)
SET mapreduce.map.output.compress=true;
SET mapreduce.output.fileoutputformat.compress=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
FROM DW.StagedTornadoes s
INSERT INTO TABLE DW.Tornadoes PARTITION (Year)
SELECT s.DateCode, s.StateCode, s.EventTime, s.Category, s.Injuries, s.Fatalities,
s.PropertyLoss, s.CropLoss, s.StartLatitude, s.StartLongitude, s.EndLatitude,
s.EndLongitude, s.LengthMiles, s.WidthYards, SUBSTR(s.DateCode, 1, 4) Year;

Notice that the script includes some configuration settings to enable compression of the query output
(which will be inserted into the data warehouse table). Additionally, the script for the tornadoes data
includes an option to enable dynamic partitions and a function to generate the appropriate partitioning
key value for Year. Similar scripts, without the partitioning functionality, were created for the states and
dates data.
The scripts to create staging tables and load staged data were then uploaded to the /staging/scripts
folder so that they can be used whenever new data is available for loading into the data warehouse.

Loading data
With the Hive table definition scripts in place, the data analysts could now implement a solution to
automate the data load process. It is possible to create a custom application to load the data using the
.NET SDK for HDInsight, but a simple approach using Windows PowerShell scripts was chosen for this
scenario. A PowerShell script was created for each staging table, including the following script that is
used to stage and load tornadoes data.

102 Designing big data solutions using HDInsight

Windows PowerShell (LoadTornadoes.ps1)


# Azure subscription-specific variables.
$subscriptionName = "subscription-name"
$clusterName = "cluster-name"
$storageAccountName = "storage-account-name"
$containerName = "container-name"
Select-AzureSubscription $subscriptionName
# Find the local Data folder.
$thisfolder = Split-Path -parent $MyInvocation.MyCommand.Definition
$localfolder = "$thisfolder\Data"
# Run Hive script to drop and recreate staging table.
$jobDef = New-AzureHDInsightHiveJobDefinition
-File
"wasbs://$containerName@$storageAccountName.blob.core.windows.net/staging/scripts/Cre
ateStagedTornadoes.q"
$hiveJob = Start-AzureHDInsightJob Cluster $clusterName JobDefinition $jobDef
Wait-AzureHDInsightJob -Job $hiveJob -WaitTimeoutInSeconds 3600
Get-AzureHDInsightJobOutput -Cluster $clusterName -JobId $hiveJob.JobId StandardError
# Upload data to staging table.
$destfolder = "staging/tornadoes"
$dataFile = "Tornadoes.txt"
$storageAccountKey = (Get-AzureStorageKey -StorageAccountName
$storageAccountName).Primary
$blobContext = New-AzureStorageContext -StorageAccountName $storageAccountName StorageAccountKey $storageAccountKey
$blobName = "$destfolder/$dataFile"
$filename = "$localfolder\$dataFile"
Set-AzureStorageBlobContent -File $filename -Container $containerName -Blob $blobName
-Context $blobContext -Force
# Run Hive script to load staged data to DW table.
$jobDef = New-AzureHDInsightHiveJobDefinition
-File
"wasbs://$containerName>@$storageAccountName.blob.core.windows.net/staging/scripts/Lo
adStagedTornadoes.q"
$hiveJob = Start-AzureHDInsightJob Cluster $clusterName JobDefinition $jobDef
Wait-AzureHDInsightJob -Job $hiveJob -WaitTimeoutInSeconds 3600
Get-AzureHDInsightJobOutput -Cluster $clusterName -JobId $hiveJob.JobId StandardError
# All done!
Write-Host "Data in $dataFile has been loaded!"

After setting some initial variables to identify the cluster, storage account, blob container, and the local
folder where the source data is stored, the script performs the following three tasks:

Scenario 2: Data warehouse on demand 103

1. Runs the HiveQL script to drop and recreate the staging table.
2. Uploads the source data file to the staging table folder.
3. Runs the HiveQL script to load the data from the staging table into the data warehouse table.
Two similar scripts, LoadDates.ps1 and LoadStates.ps1, are run to load the dates and states into the data
warehouse. Whenever new data is available for any of the data warehouse tables, the data analysts can
run the appropriate PowerShell script to automate the data load process for that table.
Now that the data warehouse is complete, the analysts can explore how to analyze the data. This is
discussed in the next section, "Analyzing data from the data warehouse."

Analyzing data from the data warehouse


Having built and populated the data warehouse, the data analysts were ready to start analyzing the data
in the data warehouse tables. In this example scenario, Excel is used to examine all of the data.
However, as the volume of data increases it will be necessary to preselect parts of the data or an
aggregated result by applying some processing within HDInsight to create a Hive table of manageable
size.

Analyzing data in Excel


The data analysts typically use Excel for analysis, so the first task they undertook was to create a
PowerPivot data model in an Excel workbook. The data model uses the Hive ODBC driver to import data
from the data warehouse tables in the HDInsight cluster, and defines relationships and hierarchies that
can be used to aggregate the data, as shown in Figure 1.

104 Designing big data solutions using HDInsight

Figure 1 - A PowerPivot data model based on a Hive data warehouse


For information about using PowerPivot see PowerPivot. For information about using the Hive ODBC
driver see Built-in data connectivity.
The business analysts could then use the data model to explore the data by creating a PivotTable
showing the total cost of property and crop damage by state and time period, as shown in Figure 2.

Scenario 2: Data warehouse on demand 105

Figure 2 - Analyzing the data with a PivotTable


For information about using a PivotTable see Visualizing data in Excel.

Analyzing data with Power Map


The data includes geographic locations as well as the date and time of each tornado, making it ideal for
visualization using Power Map. Power Map enables analysts to create interactive tours that show
summarized data on a map, and animate it based on a timeline.
The analysts created a Power Map tour that consist of a single scene with the following two layers:

Layer 1: Accumulating property and crop damage by state costs shown as a stacked column
chart.

Layer 2: Average tornado category by latitude and longitude shown as a heat map.

106 Designing big data solutions using HDInsight

The Power Map designer for the tour is shown in Figure 3.

Figure 3 - Designing a Power Map tour


Power Map is a great tool for generating interactive and highly immersive presentations of data,
especially where the data has a date or time component that allows you to generate a changes over
time animation. For more information see Visualizing data in Excel.
The scene in the Power Map tour was then configured to animate movement across the map as the
timeline progresses. While the tour runs, the property and crop losses are displayed as columns that
accumulate each month, and the tornados are displayed as heat maps based on the average tornado
category (defined in the Enhanced Fujita scale), as shown in Figure 4.

Scenario 2: Data warehouse on demand 107

Figure 4 - Viewing a Power Map tour


After the tour has finished the analysts can explore the final costs incurred in each state by zooming and
moving around the map viewer.

108 Designing big data solutions using HDInsight

Scenario 3: ETL automation


In this scenario, HDInsight is used to perform an Extract, Transform, and Load (ETL) process that filters
and shapes the source data, and then uses it to populate a database table. Specifically, this scenario
describes:

Introduction to racecar telemetry

ETL process goals and data sources

The ETL workflow

Encapsulating the ETL tasks in an Oozie workflow

Automating the ETL workflow

Analyzing the loaded data

The scenario demonstrates how you can:

Use the .NET Library for Avro to serialize data for processing in HDInsight.

Use the classes in the .NET API for Hadoop WebClient package to upload files to Azure storage.

Use an Oozie workflow to define an ETL process that includes Pig, Hive, and Sqoop tasks.

Use the classes in the .NET API for Hadoop WebClient package to automate execution of an
Oozie workflow.

Introduction to racecar telemetry


This scenario is based on a fictitious motor racing team that captures and monitors real-time telemetry
from sensors on a racecar as it is driven around a racetrack. To perform further analysis of the telemetry
data, the team plans to use HDInsight to filter and shape the data before loading it into Azure SQL
Database, from where it will be consumed and visualized in Excel. Loading the data into a database for
analysis enables the team to decommission the HDInsight server after the ETL process is complete, and
makes the data easily consumable from client applications that have the ability to connect to a SQL
Server data source.
In this simplified example, the racecar has three sensors that are used to capture telemetry readings at
one second intervals: a global positioning system (GPS) sensor, an engine sensor, and a brake sensor.
The telemetry data captured from the sensors on the racecar includes:

From the GPS sensor:

The date and time the sensor reading was taken.

The geographical position of the car (its latitude and longitude coordinates).

The current speed of the car.

Scenario 3: ETL automation 109

From the engine sensor:

The date and time the sensor reading was taken.

The revolutions per minute (RPM) of the crankshaft.

The oil temperature.

From the brake sensor:

The date and time the sensor reading was taken.

The temperature of the brakes.

The sensors used in this scenario are deliberately simplistic. Real racecars include hundreds of sensors
emitting thousands of telemetry readings at sub-second intervals.

ETL process goals and data sources


Motor racing is a highly technical sport, and analysis of how the critical components of a car are
performing is a major aspect of how teams refine the design of the car, and how drivers optimize their
driving style. The data captured over a single lap consists of many telemetry readings, which must be
analyzed to find correlations and patterns in the cars performance.
In this scenario, a console application is used to capture and display the sensor readings in real-time. The
application is shown in Figure 1.

Figure 1 - A console application displaying racecar telemetry

110 Designing big data solutions using HDInsight

For the purpose of the example, to make it repeatable if you want to experiment with the code
yourself, the source data is provided in a file named Lap.csv. The example console application reads
this file to generate the source data for analysis.
The application captures the sensor readings as objects based on the following classes. Note that the
Position property of the GpsReading class is based on the Location struct.
C# (Program.cs in RaceTracker project)
[DataContract]
internal struct Location
{
[DataMember]
public double lat { get; set; }
[DataMember]
public double lon { get; set; }
}
[DataContract(Name = "GpsReading", Namespace = "CarSensors")]
internal class GpsReading
{
[DataMember(Name = "Time")]
public string Time { get; set; }
[DataMember(Name = "Position")]
public Location Position { get; set; }
[DataMember(Name = "Speed")]
public double Speed { get; set; }
}
[DataContract(Name = "EngineReading", Namespace = "CarSensors")]
internal class EngineReading
{
[DataMember(Name = "Time")]
public string Time { get; set; }
[DataMember(Name = "Revs")]
public double Revs { get; set; }
[DataMember(Name="OilTemp")]
public double OilTemp { get; set; }
}
[DataContract(Name = "BrakeReading", Namespace = "CarSensors")]
internal class BrakeReading
{
[DataMember(Name = "Time")]
public string Time { get; set; }

Scenario 3: ETL automation 111

[DataMember(Name = "BrakeTemp")]
public double BrakeTemp { get; set; }
}

As the application captures the telemetry data, each sensor reading object is added to a List as defined
in the following code.
C# (Program.cs in RaceTracker project)
static List<GpsReading> GpsReadings = new List<GpsReading>();
static List<EngineReading> EngineReadings = new List<EngineReading>();
static List<BrakeReading> BrakeReadings = new List<BrakeReading>();

As part of the ETL processing workflow in HDInsight, the captured readings must be filtered to remove
any null values caused by sensor transmission problems. At the end of the processing the data must be
restructured to a tabular format that matches the following Azure SQL Database table definition.
Transact-SQL (Create LapData Table.sql)
CREATE TABLE [LapData]
(
[LapTime] [varchar](25) NOT NULL PRIMARY KEY CLUSTERED,
[Lat] [float] NOT NULL,
[Lon] [float] NOT NULL,
[Speed] [float] NOT NULL,
[Revs] [float] NOT NULL,
[OilTemp] [float] NOT NULL,
[BrakeTemp] [float] NOT NULL,
);

The workflow and its individual components are described in the next section, "The ETL workflow."

The ETL workflow


The key tasks that the ETL workflow for the racecar telemetry data must perform are:

Serializing the sensor reading objects as files, and uploading them to Azure storage.

Filtering the data to remove readings that contain null values, and restructuring it into tabular
format.

Combining the readings from each sensor into a single table.

Loading the combined sensor readings data into the table in Windows Azure SQL Database.

Figure 1 shows this workflow.

112 Designing big data solutions using HDInsight

Figure 1 - The ETL workflow required to load racecar telemetry data into Azure SQL Database
The team wants to integrate these tasks into the existing console application so that, after a test lap, the
telemetry data is loaded into the database for later analysis.

Serializing and uploading the sensor readings


The first challenge in implementing the ETL workflow is to serialize each list of captured sensor reading
objects into a file, and upload the files to Azure blob storage. There are numerous serialization formats
that can be used to achieve this objective, but the team decided to use the Avro serialization format in
order to include both the schema and the data in a single file. This enables a downstream data
processing task in HDInsight to successfully read and parse the data, regardless of the programming
language used to implement the data processing task.
Serializing the data using Avro
To use Avro serialization the application developer imported the Microsoft .NET Library for Avro
package from NuGet into the solution and added a using statement that references the
Microsoft.Hadoop.Avro.Container namespace. With this library in place the developer can use a
FileStream object from the System.IO namespace to write sensor readings to a file in Avro format,
applying a compression codec to minimize file size and optimize data load performance. The following
code shows how a SequentialWriter instance for the type GpsReading serializes objects in the
GpsReadings list to a file.

Scenario 3: ETL automation 113

C# (Program.cs in RaceTracker project)


string gpsFile = new DirectoryInfo(".") + @"\gps.avro";
using (var buffer = new FileStream(gpsFile, FileMode.Create))
{
// Serialize a sequence of GpsReading objects to stream.
// Data will be compressed using Deflate codec.
using (var w = AvroContainer.CreateWriter<GpsReading>(buffer, Codec.Deflate))
{
using (var writer = new SequentialWriter<GpsReading>(w, 24))
{
// Serialize the data to stream using the sequential writer.
GpsReadings.ForEach(writer.Write);
}
}
buffer.Close();
}

Similar code is used to serialize the engine and brake sensor data into files in the bin/debug folder of the
solution.
Uploading the files to Azure storage
After the data for each sensor has been serialized to a file, the program must upload the files to the
Azure blob storage container used by the HDInsight cluster. To accomplish this the developer imported
the Microsoft .NET API for Hadoop WebClient package and added using statements that reference the
Microsoft.Hadoop.WebHDFS and Microsoft.Hadoop.WebHDFS.Adapters namespaces. The developer
can then use the WebHDFSClient class to connect to Azure storage and upload the files. The following
code shows how this technique is used to upload the file containing the GPS sensor readings.
C# (Program.cs in RaceTracker project)
// Get Azure storage settings from App.Config.
var hdInsightUser = ConfigurationManager.AppSettings["HDInsightUser"];
var storageKey = ConfigurationManager.AppSettings["StorageKey"];
var storageName = ConfigurationManager.AppSettings["StorageName"];
var containerName = ConfigurationManager.AppSettings["ContainerName"];
var destFolder = ConfigurationManager.AppSettings["InputDir"];
// Upload GPS data.
var hdfsClient = new WebHDFSClient(
hdInsightUser,
new BlobStorageAdapter(storageName, storageKey, containerName, false));
Console.WriteLine("Uploading GPS data...");
await hdfsClient.CreateFile(gpsFile, destFolder + "gps.avro");

Notice that the settings used by the WebHDFSClient object are retrieved from the App.Config file. These
settings include the credentials required to connect to the Azure storage account used by HDInsight and

114 Designing big data solutions using HDInsight

the path for the folder to which the files should be uploaded. In this scenario the InputDir configuration
settings has the value /racecar/source/, so the GPS data file will be saved as /racecar/source/gps.avro.

Filtering and restructuring the data


After the data has been uploaded to Azure storage, HDInsight can be used to process the data and
upload it to Azure SQL Database. In this scenario the first task is to filter the sensor data in the files to
remove any readings that contain null values, and then restructure the data into tabular format. In the
case of the engine and brake readings the data is already structured as a series of objects, each
containing simple properties. Converting these objects to rows with regular columns is relatively
straightforward.
However, the GPS data is a list of more complex GpsReading objects in which the Position property is a
Location object that has Lat and Lon properties representing latitude and longitude coordinates. The
presence of data values that are not easily translated into simple rows and columns led the developers
to choose Pig as the appropriate tool to perform the initial processing. The following Pig Latin script was
created to process the GPS data.
Pig Latin (gps.pig)
gps = LOAD '/racecar/source/gps.avro' USING AvroStorage();
gpsfilt = FILTER gps BY (Position IS NOT NULL) AND (Time IS NOT NULL);
gpstable = FOREACH gpsfilt GENERATE Time, FLATTEN(Position), Speed;
STORE gpstable INTO '/racecar/gps';

Note that the Pig Latin script uses the AvroStorage load function to load the data file. This load function
enables Pig to read the schema and data from the Avro file, with the result that the script can use the
properties of the serialized objects to refer to the data structures in the file. For example, the script
filters the data based on the Position and Time properties of the objects that were serialized. The script
then uses the FLATTEN function to extract the Lat and Lon values from the Position property, and stores
the resulting data (which now consists of regular rows and columns) in the /racecar/gps folder using the
default tab-delimited text file format.
Similar Pig Latin scripts named engine.pig and brake.pig were created to process the engine and brake
data files.
Combining the readings into a single table
The Pig scripts that process the three Avro-format source files restructure the data for each sensor and
store it in tab-delimited files. To combine the data in these files the developers decided to use Hive
because of the simplicity it provides when querying tabular data structures. The first stage in this
process was to create a script that builds Hive tables over the output files generated by Pig. For
example, the following HiveQL code defines a table over the filtered GPS data.
HiveQL (createtables.hql)
CREATE TABLE gps (laptime STRING, lat DOUBLE, lon DOUBLE, speed FLOAT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE LOCATION '/racecar/gps';

Scenario 3: ETL automation 115

Similar code was used for the tables that will hold the engine and brake data.
The script also defines the schema for a table named lap that will store the combined data. This script
contains the following HiveQL code, which references a currently empty folder.
HiveQL (createtables.hql)
CREATE TABLE lap
(laptime STRING, lat DOUBLE, lon DOUBLE, speed FLOAT, revs FLOAT, oiltemp FLOAT,
braketemp FLOAT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE LOCATION '/racecar/lap';

To combine the data from the three sensors and load it into the lap table, the developers used the
following HiveQL statement.
HiveQL (loadlaptable.hql)
FROM gps LEFT OUTER JOIN engine
ON (gps.laptime = engine.laptime) LEFT OUTER JOIN brake
ON (gps.laptime = brake.laptime)
INSERT INTO TABLE lap
SELECT gps.*, engine.revs, engine.oiltemp, brake.braketemp;

This code joins the data in the three tables based on a common time value (so that each row contains all
of the readings for a specific time), and inserts all fields from the gps table, the revs and oiltemp fields
from the engine table, and the braketemp field from the brake table, into the lap table.

Loading the combined data to SQL Database


After the ETL process has filtered and combined the data, it loads it into the LapData table in Azure SQL
Database. To accomplish this the developers used Sqoop to copy the data from the folder on which the
lap Hive table is based and transfer it to the database. The following command shows an example of
how Sqoop can be used to perform this task.
Command Line
Sqoop export --connect "jdbc:sqlserver://abcd1234.database.windows.net:1433;
database=MyDatabase;username=MyLogin@abcd1234;
password=Pa$$w0rd;logintimeout=30;"
--table LapData
--export-dir /racecar/lap
--input-fields-terminated-by \t
--input-null-non-string \\N

Now that each of the tasks for the workflow have been defined, they can be combined into a workflow
definition. This is described in Encapsulating the ETL tasks in an Oozie workflow.

Encapsulating the ETL tasks in an Oozie workflow


Having defined the individual tasks that make up the ETL process, the developers decided to encapsulate
the ETL process in a workflow. This makes it easier to integrate the tasks into the existing telemetry

116 Designing big data solutions using HDInsight

application, and reuse them regularly. The workflow must execute these tasks in the correct order and,
where appropriate, wait until each one completes before starting the next one.
A workflow defined in Oozie can fulfil these requirements, and enable automation of the entire process.
Figure 1 shows an overview of the Oozie workflow that the developers implemented.

Figure 1 - The Oozie workflow for the ETL process

Scenario 3: ETL automation 117

Note that, in addition to the tasks described earlier, a new task has been added that drops any existing
Hive tables before processing the data. Because the Hive tables are INTERNAL, dropping them cleans up
any data left by previous uploads. This task uses the following HiveQL code.
HiveQL (droptables.hql)
DROP
DROP
DROP
DROP

TABLE
TABLE
TABLE
TABLE

gps;
engine;
brake;
lap;

The workflow includes a fork, enabling the three Pig tasks that filter the individual data files to be
executed in parallel. A join is then used to ensure that the next phase of the workflow doesnt start until
all three Pig jobs have finished.
If any of the tasks should fail, the workflow executes the kill task. This generates a message containing
details of the error, abandons any subsequent tasks, and halts the workflow. As long as there are no
errors, the workflow ends after the Sqoop task that loads the data into Azure SQL Database has
completed.
When executed, the workflow currently exits with an error. This is due to a fault in Oozie and is not an
error in the scripts. For more information see A CoordActionUpdateXCommand gets queued for all
workflows even if they were not launched by a coordinator.

Defining the workflow


The Oozie workflow shown in Figure 1 is defined in a Hadoop Process Definition Language (hPDL) file
named workflow.xml. The file contains a <start> and an <end> element that define where the workflow
process starts and ends, and a <kill> element that defines the action to take when prematurely halting
the workflow if an error occurs.
The following code shows an outline of the workflow definition file with the contents of each action
removed for clarity. The <start> element specifies that the first action to execute is the one named
DropTables. Each <action> includes an <ok> element with a to attribute specifying the next action (or
fork) to be performed, and an <error> element with a to attribute directing the workflow to the <kill>
action in the event of a failure. Notice the <fork> and <join> elements that delineate the actions that
can be executed in parallel.
hPDL (workflow.xml)
<workflow-app xmlns="uri:oozie:workflow:0.2" name="ETLWorkflow">
<start to="DropTables"/>
<action name="DropTables">
...
<ok to="CleanseData"/>
<error to="fail"/>
</action>

118 Designing big data solutions using HDInsight

<fork name="CleanseData">
<path start="FilterGps" />
<path start="FilterEngine" />
<path start="FilterBrake" />
</fork>
<action name="FilterGps">
...
<ok to="CombineData"/>
<error to="fail"/>
</action>
<action name="FilterEngine">
...
<ok to="CombineData"/>
<error to="fail"/>
</action>
<action name="FilterBrake">
...
<ok to="CombineData"/>
<error to="fail"/>
</action>
<join name="CombineData" to="CreateTables" />
<action name="CreateTables">
...
<ok to="LoadLapTable"/>
<error to="fail"/>
</action>
<action name="LoadLapTable">
...
<ok to="TransferData"/>
<error to="fail"/>
</action>
<action name="TransferData">
...
<ok to="end"/>
<error to="fail"/>
</action>
<kill name="fail">
<message>Workflow failed, error message[${wf:errorMessage(wf:lastErrorNode())}]
</message>
</kill>

Scenario 3: ETL automation 119

<end name="end"/>
</workflow-app>

Each action in the workflow is of a particular type, indicated by the first child element of the <action>
element. For example, the following code shows the DropTables action, which uses Hive.
hPDL (workflow.xml)
...
<action name="DropTables">
<hive xmlns="uri:oozie:hive-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>default</value>
</property>
<property>
<name>oozie.hive.defaults</name>
<value>hive-default.xml</value>
</property>
</configuration>
<script>droptables.hql</script>
</hive>
<ok to="CleanseData"/>
<error to="fail"/>
</action>
...

The DropTables action references the script droptables.hql, which contains the HiveQL code to drop any
existing Hive tables. All the script files are stored in the same folder as the workflow.xml file. This folder
also contains files used by the workflow to determine configuration settings for specific execution
environments; for example, the hive-default.xml file referenced by all Hive actions contains the
environment settings for Hive.
The FilterGps action, shown in the following code, is a Pig action that references the gps.pig script. This
script contains the Pig Latin code to process the GPS data.
hPDL (workflow.xml)
...
<action name="FilterGps">
<pig>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<script>gps.pig</script>
</pig>
<ok to="CombineData"/>

120 Designing big data solutions using HDInsight

<error to="fail"/>
</action>
...

The FilterEngine and FilterBrake actions are similar to the FilterGps action, but specify the appropriate
value for the <script> element.
After the three filter actions have completed, following the <join> element in the workflow file, the
CreateTables action generates the new internal Hive tables over the data, and the LoadLapTable action
combines the data into the lap table. These are both Hive actions, defined as shown in the following
code.
hPDL (workflow.xml)
...
<action name="CreateTables">
<hive xmlns="uri:oozie:hive-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>default</value>
</property>
<property>
<name>oozie.hive.defaults</name>
<value>hive-default.xml</value>
</property>
</configuration>
<script>createtables.hql</script>
</hive>
<ok to="LoadLapTable"/>
<error to="fail"/>
</action>
<action name="LoadLapTable">
<hive xmlns="uri:oozie:hive-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>default</value>
</property>
<property>
<name>oozie.hive.defaults</name>
<value>hive-default.xml</value>
</property>
</configuration>
<script>loadlaptable.hql</script>

Scenario 3: ETL automation 121

</hive>
<ok to="TransferData"/>
<error to="fail"/>
</action>
...

The final action is the TransferData action. This is a Sqoop action, defined as shown in the following
code.
hPDL (workflow.xml)
...
<action name="TransferData">
<sqoop xmlns="uri:oozie:sqoop-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<arg>export</arg>
<arg>--connect</arg>
<arg>${connectionString}</arg>
<arg>--table</arg>
<arg>${targetSqlTable}</arg>
<arg>--export-dir</arg>
<arg>${outputDir}</arg>
<arg>--input-fields-terminated-by</arg>
<arg>\t</arg>
<arg>--input-null-non-string</arg>
<arg>\\N</arg>
</sqoop>
<ok to="end"/>
<error to="fail"/>
</action>
...

Several of the values used by the actions in this workflow are parameters that are set in the job
configuration, and are populated when the workflow executes. The syntax ${...} denotes a parameter
that is populated at runtime. For example, the TransferData action includes an argument for the
connection string to be used when connecting to Azure SQL database. The value for this argument is
passed to the workflow as a parameter named connectionString. When running the Oozie workflow
from a command line, the parameter values can be specified in a job.properties file as shown in the
following example.
job.properties file
nameNode=wasbs://container-name@mystore.blob.core.windows.net
jobTracker=jobtrackerhost:9010
queueName=default
oozie.use.system.libpath=true
oozie.wf.application.path=/racecar/oozieworkflow/
outputDir=/racecar/lap/
connectionString=jdbc:sqlserver://server-name.database.windows.net:1433;

122 Designing big data solutions using HDInsight

database=database-name;user=user-name@server-name;password=password;encrypt=true;
trustServerCertificate=true;loginTimeout=30;
targetSqlTable=LapData

The ability to abstract settings in a separate file makes the ETL workflow more flexible. It can be easily
adapted to handle future changes in the environment, such as a requirement to use alternative folder
locations or a different Azure SQL Database instance.
The job.properties file may contain sensitive information such as database connection strings and
credentials (as in the example above). This file is uploaded to the cluster and so cannot easily be
encrypted. Ensure you properly protect this file when it is stored outside of the cluster, such as on
client machines that will initiate the workflow, by applying appropriate file permissions and computer
security practices.

Executing the Oozie workflow interactively


To execute the Oozie workflow the administrators upload the workflow files and the job.properties file
to the HDInsight cluster, and then run the following command on the cluster.
Command Line
oozie job -oozie http://localhost:11000/oozie/
-config c:\ETL\job.properties
-run

When the Oozie job starts, the command line interface displays the unique ID assigned to the job. The
administrators can then view the progress of the job by using the browser on the HDInsight cluster to
display job status at http://localhost:11000/oozie/v0/job/the_unique_job_id?show=log.
With the Oozie workflow definition complete, the next stage it to automate its execution. This is
described in the next section, "Automating the ETL workflow."

Automating the ETL workflow


With the workflow tasks encapsulated in an Oozie workflow, the developers can automate the process
by using the classes in the .NET API for Hadoop to upload the workflow files and initiate the Oozie job.

Uploading the workflow files


To implement code to upload the workflow files, the developer used the same WebHDFSClient class
that was previously used to upload the serialized data files. The workflow files are stored in a folder
named OozieWorkflow in the execution folder of the console application, and the following code in the
UploadWorkflowFiles method uploads them to Azure storage.
C# (Program.cs)
var workflowLocalDir = new DirectoryInfo(@".\OozieWorkflow");
var hdfsClient = new WebHDFSClient(hdInsightUser,
new BlobStorageAdapter(storageName, storageKey, containerName, false));

Scenario 3: ETL automation 123

await hdfsClient.DeleteDirectory(workflowDir);
foreach (var file in workflowLocalDir.GetFiles())
{
await hdfsClient.CreateFile(file.FullName, workflowDir + file.Name);
}

Notice that the code begins by deleting the workflow directory if it already exists in Azure blob storage,
and then uploads each file from the local OozieWorkflow folder.

Initiating the Oozie job


To initiate the Oozie job the developers added using statements to reference the
Microsoft.Hadoop.WebClient.OozieClient, Microsoft.Hadoop.WebClient.OozieClient.Contracts, and
Newtonsoft.Json namespaces, and then added the following code to the application.
C# (Program.cs)
var hdInsightUser = ConfigurationManager.AppSettings["HDInsightuser"];
var hdInsightPassword = ConfigurationManager.AppSettings["HDInsightPassword"];
var storageName = ConfigurationManager.AppSettings["StorageName"];
var containerName = ConfigurationManager.AppSettings["ContainerName"];
var nameNodeHost = "wasbs://" + containerName + "@" + storageName +
".blob.core.windows.net";
var workflowDir = ConfigurationManager.AppSettings["WorkflowDir"];
var outputDir = ConfigurationManager.AppSettings["OutputDir"];
var sqlConnectionString = ConfigurationManager.AppSettings["SqlConnectionString"];
var targetSqlTable = ConfigurationManager.AppSettings["TargetSqlTable"];
var clusterName = ConfigurationManager.AppSettings["ClusterName"];
var clusterAddress = "https://" + clusterName + ".azurehdinsight.net";
var clusterUri = new Uri(clusterAddress);
// Create an Oozie job and execute it.
Console.WriteLine("Starting Oozie workflow...");
var client = new OozieHttpClient(clusterUri, hdInsightUser, hdInsightPassword);
var prop = new OozieJobProperties(hdInsightUser, nameNodeHost, "jobtrackerhost:9010",
workflowDir, inputDir, outputDir);
var parameters = prop.ToDictionary();
parameters.Add("oozie.use.system.libpath", "true");
parameters.Add("exportDir", outputDir);
parameters.Add("targetSqlTable", targetSqlTable);
parameters.Add("connectionString", sqlConnectionString);
var newJob = await client.SubmitJob(parameters);
var content = await newJob.Content.ReadAsStringAsync();
var serializer = new JsonSerializer();
dynamic json = serializer.Deserialize(new JsonTextReader(new StringReader(content)));

124 Designing big data solutions using HDInsight

string id = json.id;
await client.StartJob(id);
Console.WriteLine("Oozie job started");
Console.WriteLine("View workflow progress at " + clusterAddress + "/oozie/v0/job/" +
id + "?show=log");

This code retrieves the parameters for the Oozie job from the App.Config file for the application, and
initiates the job on the HDInsight cluster. When the job is submitted, its ID is retrieved and the
application displays a message such as:
View workflow progress at https://mycluster.azurehdinsight.net/oozie/v0/job/job_id?show=log
Users can then browse to the URL indicated by the application to view the progress of the Oozie job as it
performs the ETL workflow tasks.
The final stage is to explore how the data in SQL Database can be used. An example is shown in the next
section, "Analyzing the loaded data."

Analyzing the loaded data


After successful completion of the ETL process the LapData table in Azure SQL Database is populated
with the filtered and combined telemetry data. Engineers in the team can then use familiar tools such as
Excel to connect to the database, import the data, and analyze and visualize it. For example, team
engineers could use Power View to visually compare key sensor reading values at specific points during
the lap, as shown in Figure 1.

Scenario 3: ETL automation 125

Figure 1 - Visualizing racecar telemetry data in Power View


For information about using Power View see Visualizing data in Excel.

126 Designing big data solutions using HDInsight

Scenario 4: BI integration
This scenario explores ways in which big data batch processing with HDInsight can be integrated into a
business intelligence (BI) solution in a corporate environment. The emphasis in this scenario is on the
challenges and techniques associated with integrating data from HDInsight into a BI ecosystem based on
Microsoft SQL Server and Office technologies. This includes integration at the report, corporate data
model, and data warehouse levels of an enterprise BI solution, as well as how insights from big data
analysis in HDInsight can be shared in a self-service BI solution built on Office 365 and Power BI.
The scenario includes and demonstrates:

Introduction to Adventure Works

The analytical goals

The HDInsight solution

Report level integration

Corporate data model integration

Data warehouse integration

Collaborative self-service BI

The data ingestion and processing elements of the example used in this scenario have been deliberately
kept simple in order to focus on the integration techniques. In a real-world solution the challenge of
obtaining the source data, loading it to the HDInsight cluster, and using map/reduce code, Pig, or Hive to
process it before consuming it in a BI infrastructure are likely to be more complex than described in this
scenario.

Introduction to Adventure Works


Adventure Works is a fictional company that manufactures and sells bicycles and cycling equipment.
Adventure Works sells its products through an international network of resellers and also through its
own e-commerce site, which is hosted in an Internet Information Services (IIS)-based datacenter.
As a large-scale multinational company, Adventure Works has already made a considerable investment
in data and BI technologies to support formal corporate reporting, as well as for business analysis. In
addition to an enterprise BI solution that includes reporting and dashboards, Adventure Works has
empowered business analysts to perform self-service analysis and reporting using Excel, and has
recently added the Power BI service to the companys Office 365 subscription.

The existing enterprise BI solution


The BI solution at Adventure Works is built on an enterprise data warehouse hosted in SQL Server
Enterprise Edition. SQL Server Integration Services (SSIS) is used to refresh the data warehouse with new
and updated data from line of business systems. This includes sales transactions, financial accounts, and

Scenario 4: BI integration 127

customer profile data. The high-level architecture of the Adventure Works BI solution is shown in Figure
1.

Figure 1 - The Adventure Works enterprise BI solution


The enterprise data warehouse is based on a dimensional design in which multiple dimensions of the
business are conformed across aggregated measures. The dimensions are implemented as dimension
tables, and the measures are stored in fact tables at the lowest common level of grain (or granularity). A
partial schema of the data warehouse is shown in Figure 2.

128 Designing big data solutions using HDInsight

Figure 2 - Partial data warehouse schema


Figure 2 shows only a subset of the data warehouse schema. Some tables and columns have been
omitted for clarity.
The data warehouse supports a corporate data model, which is implemented as a SQL Server Analysis
Services (SSAS) tabular database. This data model provides a cube for analysis in Excel and reporting in
SQL Server Reporting Services (SSRS), and shown in Figure 3.

Scenario 4: BI integration 129

Figure 3 - An SSAS tabular data model


The SSAS data model supports corporate reporting, including a sales performance dashboard that shows
a scorecard of key performance indicators (KPIs), as shown in Figure 4. Additionally, the data model is
used by Adventure Works business users to create PivotTables and PivotCharts in Excel.

130 Designing big data solutions using HDInsight

Figure 4 - An SSRS report based on the SSAS data model


In addition to the managed reporting and analysis supported by the corporate data model, Adventure
Works has empowered business analysts to engage in self-service BI activities, including the creation of
SSRS reports with Report Builder and the use of PowerPivot in Excel to create personal data models
directly from tables in the data warehouse, as shown in Figure 5.

Scenario 4: BI integration 131

Figure 5 - Creating a personal data model with PowerPivot

The analytical goals


The existing BI solution provides comprehensive reporting and analysis of important business data. The
IIS web server farm hosting the e-commerce site generates log files that are retained and used for
troubleshooting purposes, but until now these log files have never been considered a viable source of
business information. The web servers generate a new log file each day, and each log contains details of
every request received and processed by the web site. This provides a huge volume of data that could
potentially provide useful insights into customers activity on the e-commerce site.
A small subset of the log file data is shown in the following example.

132 Designing big data solutions using HDInsight

Log file contents


#Software: Microsoft Internet Information Services 6.0
#Version: 1.0
#Date: 2008-01-01 09:15:23
#Fields: date time c-ip cs-username s-ip s-port cs-method cs-uri-stem cs-uri-query
sc-status sc-bytes cs-bytes time-taken cs(User-Agent) cs(Referrer)
2008-01-01 00:01:00 198.51.100.2 - 192.0.0.1 80 GET /default.aspx - 200 1000 1000 100
2008-01-01 00:04:00 198.51.100.5 - 192.0.0.1 80 GET /default.aspx - 200 1000 1000 100
Mozilla/5.0+(compatible;+MSIE+10.0;+Windows+NT+6.1;+WOW64;+Trident/6.0) www.bing.com
2008-01-01 00:05:00 198.51.100.6 - 192.0.0.1 80 GET /product.aspx productid=BL-2036
200 1000 1000 100
Mozilla/5.0+(compatible;+MSIE+10.0;+Windows+NT+6.1;+WOW64;+Trident/6.0) www.bing.com
2008-01-01 00:06:00 198.51.100.7 - 192.0.0.1 80 GET /default.aspx - 200 1000 1000 100
Mozilla/5.0+(compatible;+MSIE+10.0;+Windows+NT+6.1;+WOW64;+Trident/6.0) www.bing.com
/product.aspx productid=CN-6137 200 1000 1000 100
Mozilla/5.0+(compatible;+MSIE+10.0;+Windows+NT+6.1;+WOW64;+Trident/6.0) www.bing.com

The ability to analyze the log data and summarize website activity over time would help the business to
measure the amount of data transferred during web requests, and potentially correlate web activity
with sales transactions to better understand trends and patterns in e-commerce sales. However, the
large volume of log data that must be processed in order to extract these insights has prevented the
company from attempting to include the log data in the enterprise data warehouse.
The company has recently decided to use HDInsight to process and summarize the log data so that it can
be reduced to a more manageable volume, and integrated into the enterprise BI ecosystem. The
developers will integrate the results of the processing at all three levels of their existing BI system, as
shown in Figure 6, and also enable self-service BI through Power BI for Office 365.

Scenario 4: BI integration 133

Figure 6 - The three levels for integration of the results into the existing BI system.

The HDInsight solution


The HDInsight solution for Scenario 4: BI integration is based on an HDInsight cluster, enabling
Adventure Works to dynamically increase or reduce cluster resources as required.

Creating the Hive tables


To support the goal for summarization of the data, the BI developer intends to create a Hive table that
can be queried from client applications such as Excel. However, the raw source data includes header
rows that must be excluded from the analysis, and the tab-delimited text format of the source data will
not provide optimal query performance as the volume of data grows.
The developer therefore decided to create a temporary staging table, and use it as a source for a query
that loads the required data into a permanent table that is optimized for the typical analytical queries
that will be used. The following HiveQL statement has been used to define a staging table for the log
data.
HiveQL
DROP TABLE log_staging;
CREATE TABLE log_staging
(logdate STRING, logtime STRING, c_ip STRING, cs_username STRING, s_ip STRING,
s_port STRING, cs_method STRING, cs_uri_stem STRING, cs_uri_query STRING,
sc_status STRING, sc_bytes INT, cs_bytes INT, time_taken INT,
cs_User_Agent STRING, cs_Referrer STRING)

134 Designing big data solutions using HDInsight

ROW FORMAT DELIMITED FIELDS TERMINATED BY '32'


STORED AS TEXTFILE LOCATION '/data';

This Hive table defines a schema for the log file, making it possible to use a query that filters the rows in
order to load the required data into a permanent table for analysis. Notice that the staging table is
based on the /data folder but it is not defined as EXTERNAL, so dropping the staging table after the
required rows have been loaded into the permanent table will delete the source files that are no longer
required.
When designing the permanent table for analytical queries, the BI developer has decided to partition
the data by year and month to improve query performance when extracting data. To achieve this, a
second Hive statement is used to define a partitioned tablenotice the PARTITIONED BY clause near
the end of the following script. This instructs Hive to add two columns named year and month to the
table, and to partition the data loaded into the table based on the values inserted into these columns.
HiveQL
DROP TABLE iis_log;
CREATE TABLE iis_log
(logdate STRING, logtime STRING, c_ip STRING, cs_username STRING, s_ip STRING,
s_port STRING, cs_method STRING, cs_uri_stem STRING, cs_uri_query STRING,
sc_status STRING, sc_bytes INT, cs_bytes INT, time_taken INT,
cs_User_Agent STRING, cs_Referrer STRING)
PARTITIONED BY (year INT, month INT)
STORED AS SEQUENCEFILE;

The Hive scripts to create the tables are saved as text files in a local folder named scripts.
Storing the data in SEQUENCEFILE format can improve performance. You might also consider using the
ORC file format, which provides a highly efficient way to store Hive data and can improve performance
when reading, writing, and processing data. See ORC File Format for more information.
Next, the following Hive script is created to load data from the log_staging table into the iis_log table.
This script takes the values from the columns in the log_staging Hive table, calculates the values for the
year and month of each row, and inserts these rows into the partitioned iis_log Hive table.
HiveQL
SET mapred.output.compression.type=BLOCK;
SET hive.exec.compress.output=true;
SET mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;
SET hive.exec.dynamic.partition.mode=nonstrict;
FROM log_staging s
INSERT INTO TABLE iis_log PARTITION (year, month)
SELECT s.logdate, s.logtime, s.c_ip, s.cs_username, s.s_ip, s.s_port, s.cs_method,
s.cs_uri_stem, s.cs_uri_query, s.sc_status, s.sc_bytes, s.cs_bytes,
s.time_taken, s.cs_User_Agent, s.cs_Referrer,
SUBSTR(s.logdate, 1, 4) year, SUBSTR(s.logdate, 6, 2) month
WHERE SUBSTR(s.logdate, 1, 1) <> '#';

Scenario 4: BI integration 135

The source log data includes a number of header rows that are prefixed with the # character, which
could cause errors or add unnecessary complexity to summarizing the data. To resolve this the HiveQL
statement shown above includes a WHERE clause that ignores rows starting with # so that they are
not loaded into the permanent table.
To maximize performance the script includes statements that specify the output from the query should
be compressed. The code also sets the dynamic partition mode to nonstrict, enabling rows to be
dynamically inserted into the appropriate partitions based on the values of the partition columns.

Indexing the Hive table


To further improve performance, the BI developer at Adventure Works decided to index the iis_log table
on the date because most queries will need to select and sort the results by date. The following HiveQL
statement creates an index on the logdate column of the table.
HiveQL
CREATE INDEX idx_logdate ON TABLE iis_log (logdate)
AS org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler'
WITH DEFERRED REBUILD;

When data is added to the iis_log table the index can be updated using the following HiveQL statement.
HiveQL
ALTER INDEX idx_logdate ON iis_log REBUILD;

The scripts to load the iis_log table and build the index are also saved in the local scripts folder.
Tests revealed that indexing the tables provided only a small improvement in performance of queries,
and that building the index took longer than the time saved when running the query. However, the
results depend on factors such as the volume of source data, and so you should experiment to see if
indexing is a useful optimization technique in your scenario.

Loading data into the Hive tables


To upload the IIS log files to HDInsight, and load the Hive tables, the team at Adventure Works created
the following PowerShell script:
Windows PowerShell
# Azure subscription-specific variables.
$clusterName = "cluster-name"
$storageAccountName = "storage-account-name"
$containerName = "container-name"
# Find the local folder where this script is stored.
$thisfolder = Split-Path -parent $MyInvocation.MyCommand.Definition
# Upload the scripts.
$localfolder = "$thisfolder\scripts"

136 Designing big data solutions using HDInsight

$destfolder = "scripts"
$storageAccountKey = (Get-AzureStorageKey -StorageAccountName
$storageAccountName).Primary
$blobContext = New-AzureStorageContext -StorageAccountName $storageAccountName StorageAccountKey $storageAccountKey
$files = Get-ChildItem $localFolder foreach($file in $files)
{
$fileName = "$localFolder\$file"
$blobName = "$destfolder/$file"
write-host "copying $fileName to $blobName"
Set-AzureStorageBlobContent -File $filename -Container $containerName -Blob
$blobName -Context $blobContext -Force
}
write-host "All files in $localFolder uploaded to $containerName!"
# Run scripts to create Hive tables.
write-host "Creating Hive tables..."
$jobDef = New-AzureHDInsightHiveJobDefinition
File"wasbs://$containerName@$storageAccountName.blob.core.windows.net/scripts/CreateT
ables.txt"
$hiveJob = Start-AzureHDInsightJob Cluster $clusterName JobDefinition $jobDef
Wait-AzureHDInsightJob -Job $hiveJob -WaitTimeoutInSeconds 3600
Get-AzureHDInsightJobOutput -Cluster $clusterName -JobId $hiveJob.JobId StandardError
# Upload data to staging table.
$localfolder = "$thisfolder\iislogs"
$destfolder = "data"
$storageAccountKey = (Get-AzureStorageKey -StorageAccountName
$storageAccountName).Primary
$blobContext = New-AzureStorageContext -StorageAccountName $storageAccountName StorageAccountKey $storageAccountKey
$files = Get-ChildItem $localFolder foreach($file in $files)
{
$fileName = "$localFolder\$file"
$blobName = "$destfolder/$file"
write-host "copying $fileName to $blobName"
Set-AzureStorageBlobContent -File $filename -Container $containerName -Blob
$blobName -Context $blobContext -Force
}
write-host "All files in $localFolder uploaded to $containerName!"
# Run scripts to load Hive tables.
write-host "Loading Hive table..."
$jobDef = New-AzureHDInsightHiveJobDefinition
-File
"wasbs://$containerName@$storageAccountName.blob.core.windows.net/scripts/LoadTables.
txt"

Scenario 4: BI integration 137

$hiveJob = Start-AzureHDInsightJob Cluster $clusterName JobDefinition $jobDef


Wait-AzureHDInsightJob -Job $hiveJob -WaitTimeoutInSeconds 3600
Get-AzureHDInsightJobOutput -Cluster $clusterName -JobId $hiveJob.JobId StandardError
# All done!
write-host "Finished!"

This script performs the following actions:

It uploads the contents of the local scripts folder to the /scripts folder in HDInsight.

It runs the CreatedTables.txt Hive script to drop and recreate the log_staging and iis_log tables
(any previously uploaded data will be deleted because both are internal tables).

It uploads the contents of the local iislogs folder to the /data folder in HDInsight (thereby
loading the source data into the staging table).

It runs the LoadTables.txt Hive script to load the data from the log_staging table into the iis_log
table and create an index (note that the text data in the staging table is implicitly converted to
SEQUENCEFILE format as it is inserted into the iis_log table).

Returning the required data


After running the PowerShell script that uploads the data to Azure blob storage and inserts it into the
Hive tables, the data is now ready for use in the corporate BI systemand it can be consumed by
querying the iis_log table. For example, the following script returns all of the data for all years and
months.
HiveQL
SELECT * FROM iis_log

Where a more restricted dataset is required the BI developer or business user can use a script that
selects on the year and month columns, and transforms the data as required. For example, the following
script extracts just the data for the first quarter of 2012, aggregates the number of hits for each day (the
logdate column contains the date in the form yyyy-mm-dd), and returns a dataset with two columns: the
date and the total number of page hits.
HiveQL
SELECT logdate, COUNT(*) pagehits FROM iis_log
WHERE year = 2012 AND month < 4
GROUP BY logdate

Report level integration


Now that the log data is encapsulated in a Hive table, HiveQL queries can be used to summarize the log
entries and extract aggregated values for reporting and analysis. Initially, business analysts at Adventure
Works want to try to find a relationship between the number of website hits (obtained from the web

138 Designing big data solutions using HDInsight

server log data in HDInsight) and the number of items sold (available from the enterprise data
warehouse). Since only the business analysts require this combined data, there is no need at this stage
to integrate the web log data from HDInsight into the entire enterprise BI solution. Instead, a business
analyst can use PowerPivot to create a personal data model in Excel specifically for this mashup analysis.

Adding data from a Hive table to a PowerPivot model


The business analyst starts with the PowerPivot model shown in Figure 5 in the topic Scenario 4: BI
integration, which includes a Date table and an Internet Sales table based on the DimDate and
FactInternetSales tables in the data warehouse. To add IIS log data from HDInsight the business analyst
uses an ODBC connection based on the Microsoft Hive ODBC driver, as shown in Figure 1.

Figure 1 - Creating an ODBC connection to Hive on HDInsight

Scenario 4: BI integration 139

The ODBC connection to the HDInsight cluster is typically defined in a data source name (DSN) on the
local computer, which makes it easy to define a connection for programs that will access data in the
cluster. The DSN encapsulates a connection string such as this:
Connection string
DRIVER={Microsoft Hive ODBC Driver};Host=<cluster_name>.azurehdinsight.net;Port=443;
Schema=default;
RowsFetchedPerBlock=10000;HiveServerType=2;AuthMech=6;UID=UserName;PWD=Password;Defau
ltStringColumnLength=4000

After the connection has been defined and tested, the business analyst uses the following HiveQL query
to create a new table named Page Hits that contains aggregated log data from HDInsight.
HiveQL
SELECT logdate, COUNT(*) hits FROM iis_log GROUP BY logdate

This query returns a single row for each distinct date that has log entries, along with a count of the
number of page hits that were recorded on that date. The logdate values in the underlying Hive table
are defined as text, but the yyyy-mm-dd format of the text values means that the business analyst can
simply change the data type for the column in the PowerPivot table to Date; making it possible to create
a relationship that joins the logdate column in the Page Hits table to the Date column in the Date table,
as shown in Figure 2.

140 Designing big data solutions using HDInsight

Figure 2 - Creating a relationship in PowerPivot


When you are designing Hive queries that return tables you want to integrate with your BI system, you
should plan ahead by considering the appropriate format for columns that contain key values you will
use in the relationships with existing tables.
The business analyst can now use Excel to analyze the data and try to correlate web page site activity in
the form of page hits with sales transactions in terms of the number of units sold. Figure 3 shows how
the business analyst can use Power View in Excel to visually compare page hits and sales over a six
month period and by quarter, determine weekly patterns of website activity, and filter the visualizations
to show comparisons for a specific weekday.

Scenario 4: BI integration 141

Figure 3 - Using Power View in Excel to analyze data from HDInsight and the data warehouse
By integrating IIS log data from HDInsight with enterprise BI data at the report level, business analysts
can create mashup reports and analyses without impacting the BI infrastructure used for corporate
reporting. However, after using this report-level integration to explore the possibilities of using IIS log
data to increase understanding of the business, it has become apparent that the log data could be useful
to a wider audience of users than just business analysts, and for a wider range of business processes.
This can be achieved through corporate data model integration, discussed in the next section.

Corporate data model integration


Senior executives of the e-commerce division at Adventure Works want to expose the newly discovered
information from the web server log files to a wider audience, and use it in BI-focused business
processes. Specifically, they want to compare the ratio of web page hits to sales as a core metric that
can be included as a key performance indicator (KPI) in a business scorecard. The goal is to achieve a hit
to sale ratio of around 6.5% (in other words, between six and seven items are sold for every hundred
web page hits).

142 Designing big data solutions using HDInsight

Scorecards and dashboards for Adventure Works are currently based on the SSAS corporate data model,
which is also used to support formal reports and analytical business processes. The corporate data
model is implemented as an SSAS database in tabular mode, so the process to add a table for the IIS log
data is similar to the one used to import the results of a HiveQL query into a PowerPivot model. A BI
developer uses SQL Server Data Tools to add an ODBC connection to the HDInsight cluster and create a
new table named Page Hits based on the following query.
HiveQL
SELECT logdate, SUM(sc_bytes) sc_bytes, SUM(cs_bytes) cs_bytes, COUNT(*) pagehits
FROM iis_log GROUP BY logdate

Notice that this query includes more columns than the one previously used in the personal data model,
making it useful for more kinds of analysis by a wider audience.
The fact that Adventure Works is using SSAS in tabular mode makes it possible to connect to an ODBC
source such as Hive. If SSAS had been installed in multidimensional mode, the developer would have
had to either extract the data from HDInsight into an OLE DB compliant data source, or base the data
model on a linked SQL Server database that has a remote server connection over ODBC to the Hive
tables.
After the Page Hits table has been created and the data imported into the model, the data type of the
logdate column is changed to Date and a relationship is created with the Date table in the same way as
in the PowerPivot data model discussed in Report level integration. However, one significant difference
between PowerPivot models and SSAS tabular models is that SSAS does not create implicit aggregated
measures from numeric columns in the same way as PowerPivot does.
The Page Hits table contains a row for each date, with the total bytes sent and received, and the total
number of page hits for that date. The BI developer created explicit measures to aggregate the sc_bytes,
cs_bytes, and pagehits values across multiple dates based on data analysis expression (DAX) formulae,
as shown in Figure 1.

Scenario 4: BI integration 143

Figure 1 - Creating measures in an SSAS tabular data model


These measures include DAX expressions that calculate the sum of page hits as a measure named Hits,
the sum of sc_bytes as a measure named Bytes Sent, and the sum of cs_bytes as a measure named
Bytes Received. Additionally, to support the requirement to track page hits to sales performance against
a target of 6.5%, the following measures have been added to the Internet Sales table.
DAX
Actual Units:=SUM([Order Quantity])
Target Units:=([Hits]*0.065)

A KPI is then defined on the Actual Units measure, as shown in Figure 2.

144 Designing big data solutions using HDInsight

Figure 2 - Defining a KPI in a tabular data model


When the changes to the data model are complete, it is deployed to an SSAS server and fully processed
with the latest data. It can then be used as a data source for reports and analytical tools. For example,
Figure 3 shows that the dashboard report used to summarize business performance has been modified
to include a scorecard for the hits to sales ratio KPI that was added to the data model.

Scenario 4: BI integration 145

Figure 3 - A report showing a KPI from a corporate data model


Dashboards are a great way to present high-level views of information, and are often appreciated most
by business managers because they provide an easy way to keep track of performance of multiple
sectors across the organization. By including the ability to drill down into the information, you make
the dashboard even more valuable to these types of users.
In addition, because the data from HDInsight has now been integrated at the corporate data model
level, it is more readily available to a wider range of users who may not have the necessary skills and
experience to create their own data models or reports from an HDInsight source. For example, Figure 4
shows how a network administrator in the IT department can use the corporate data model as a source
for a PivotChart in Excel that shows data transfer information for the e-commerce site.

146 Designing big data solutions using HDInsight

Figure 4 - Using a corporate data model in Excel

Data warehouse integration


Having explored Report level integration and Corporate data model integration for visualizing total sales
and total website hits, the BI developer at Adventure Works now needs to respond to requests from
business analysts who would like to use the web server log information to analyze page hits for
individual products, and include this information in managed and self-service reports.
The IIS log entries include the query string passed to the web page, which for pages that display
information about a product includes a productid parameter with a value that matches the relevant
product code. Details of products are already stored in the data warehouse but, because product is
implemented as a slowly changing dimension, the product records in the data warehouse are uniquely
identified by a surrogate key and not by the original product code. Although the product code
referenced in the query string is retained as an alternate key, it is not guaranteed to be unique if details
of the product have changed over time.

Scenario 4: BI integration 147

The requirement to match the product code to the surrogate key for the appropriate version of the
product makes integration at the report or corporate data model levels problematic. It is possible to
perform complex lookups to find the appropriate surrogate key for an alternate key at any level at the
time a specific page hit occurred (assuming both the surrogate and alternate keys for the products are
included in the data model or report dataset). However, it is more practical to integrate the IIS log data
into the dimensional model of the data warehouse so that the relationship with the product dimension
(and the date dimension) is present throughout the entire enterprise BI stack.
The most problematic task for achieving integration with BI systems at data warehouse level is
typically matching the keys in the source data with the correct surrogate key in the data warehouse
tables where changes to the existing data over time prompt the use of an alternate key.

Changes to the data warehouse schema


To integrate the IIS log data into the dimensional model of the data warehouse, the BI developer creates
a new fact table in the data warehouse for the IIS log data, as shown in the following Transact-SQL
statement.
Transact-SQL
CREATE TABLE [dbo].[FactIISLog](
[LogDateKey] [int] NOT NULL REFERENCES DimDate(DateKey),
[ProductKey] [int] NOT NULL REFERENCES DimProduct(ProductKey),
[BytesSent] decimal NULL, [BytesReceived] decimal NULL, [PageHits] integer NULL);

The table will be loaded with new log data on a regular schedule as part of the ETL process for the data
warehouse. In common with most data warehouse ETL processes, the solution at Adventure Works
makes use of staging tables as an interim store for new data, making it easier to coordinate data loads
into multiple tables and perform lookups for surrogate key values. A staging table is created in a
separate staging schema using the following Transact-SQL statement.
Transact-SQL
CREATE TABLE staging.IISLog([LogDate] nvarchar(50) NOT NULL,
[ProductID] nvarchar(50) NOT NULL, [BytesSent] decimal NULL,
[BytesReceived] decimal NULL, [PageHits] int NULL);

Good practice when regularly loading data into a data warehouse is to minimize the amount of data
extracted from each data source so that only data that has been inserted or modified since the last
refresh cycle is included. This minimizes extraction and load times, and reduces the impact of the ETL
process on network bandwidth and storage utilization. There are many common techniques you can use
to restrict extractions to only modified data, and some data sources support change tracking or change
data capture (CDC) capabilities to simplify this.
In the absence of support in Hive tables for restricting extractions to only modified data, the BI
developers at Adventure Works have decided to use a common pattern that is often referred to as a
high water mark technique. In this pattern the highest log date value that has been loaded into the data
warehouse is recorded, and used as a filter boundary for the next extraction. To facilitate this, the

148 Designing big data solutions using HDInsight

following Transact-SQL statement is used to create an extraction log table and initialize it with a default
value.
Transact-SQL
CREATE TABLE staging.highwater([ExtractDate] datetime DEFAULT GETDATE(),
[HighValue] nvarchar(200));
INSERT INTO staging.highwater (HighValue) VALUES ('0000-00-00');

Extracting data from HDInsight


After the modifications to the data warehouse schema and staging area have been made, a solution to
extract the IIS log data from HDInsight can be designed. There are many ways to transfer data from
HDInsight to a SQL Server database, including:

Using Sqoop to export the data from HDInsight and push it to SQL Server.

Using PolyBase to combine the data in an HDInsight cluster with a Microsoft Analytics Platform
System (APS) database (PolyBase is available only in APS appliances).

Using SSIS to extract the data from HDInsight and load it into SQL Server.

In the case of the Adventure Works scenario, the data warehouse is hosted on an on-premises server
that cannot be accessed from outside the corporate firewall. Since the HDInsight cluster is hosted
externally in Azure, the use of Sqoop to push the data to SQL Server is not a viable option. In addition,
the SQL Server instance used to host the data warehouse is running SQL Server 2012 Enterprise Edition,
not APS, and PolyBase cannot be used in this scenario.
The most appropriate option, therefore, is to use SSIS to implement a package that transfers data from
HDInsight to the staging table. Since SSIS is already used as the ETL platform for loading data from other
business sources to the data warehouse, this option also reduces the development and management
challenges for implementing an ETL solution to extract data from HDInsight.
The document Leveraging a Hadoop cluster from SQL Server Integration Services (SSIS) contains a
wealth of useful information about using SSIS with HDInsight.
The control flow for an SSIS package to extract the IIS log data from HDInsight is shown in Figure 1.

Scenario 4: BI integration 149

Figure 1 - SSIS control flow for HDInsight data extraction


This control flow performs the following sequence of tasks:
1. An Execute SQL task that extracts the HighValue value from the staging.highwater table and
stores it in a variable named HighWaterMark.
2. An Execute SQL task that truncates the staging.IISLog table, removing any rows left over from
the last time the extraction process was executed.
3. A Data Flow task that transfers the data from HDInsight to the staging table.
4. An Execute SQL task that updates the HighValue value in staging.highwater with the highest log
date value that has been extracted.
The Data Flow task in step 3 extracts the data from HDInsight and performs any necessary data type
validation and conversion before loading it into a staging table, as shown in Figure 2.

150 Designing big data solutions using HDInsight

Figure 2 - SSIS data flow to extract data from HDInsight


The data flow consists of the following sequence of components:
1. An ODBC Source that extracts data from the iis_log Hive table in the HDInsight cluster.
2. A Data Type Conversion transformation that ensures data type compatibility and prevents field
truncation issues.
3. An OLE DB Destination that loads the extracted data into the staging.IISLog table.
If there is a substantial volume of data in the Hive table the extraction might take a long time, so it
might be necessary to set the timeout on the IIS source component to a high value. Alternatively, the
refresh of data into the data warehouse could be carried out more often, such as every week. You
could create a separate Hive table for the extraction and automate running a HiveQL statement every
day to copy the log entries for that day to the extraction table after theyve been uploaded into the
iis_page_hits table. At the end of the week you perform the extraction from the (smaller) extraction
table, and then empty it ready for next weeks data.
Every SSIS Data Flow includes an Expressions property that can be used to dynamically assign values to
properties of the data flow or the components it contains, as shown in Figure 3.

Scenario 4: BI integration 151

Figure 3 - Assigning a property expression


In this data flow, the SqlCommand property of the Hive Table ODBC source component is assigned using
the following expression.
SSIS expression
"SELECT logdate, REGEXP_REPLACE(cs_uri_query, 'productid=', '') productid,
SUM(sc_bytes) sc_bytes, SUM(cs_bytes) cs_bytes, COUNT(*) PageHits
FROM iis_log WHERE logdate > '" + @[User::HighWaterMark] + "'
GROUP BY logdate, REGEXP_REPLACE(cs_uri_query, 'productid=', '')"

This expression consists of a HiveQL query to extract the required data, combined with the value of the
HighWaterMark SSIS variable to filter the data being extracted so that only rows with a logdate value
greater than the highest ones already in the data warehouse are included.
Based on the log files for the Adventure Works e-commerce site, the cs_uri_query values in the web
server log file contain either the value - (for requests with no query string) or productid=productcode (where product-code is the product code for the requested product). The HiveQL query includes a
regular expression that parses the cs_uri_query value and removes the text productid=. The Hive
query therefore generates a results set that includes a productid column, which contains either the
product code value or -.

152 Designing big data solutions using HDInsight

The query string example in this scenario is deliberately simplistic in order to reduce complexity. In a
real-world solution, parsing query strings in a web server log may require a significantly more complex
expression, and may even require a user-defined Java function.
After the data has been extracted it flows to the Data Type Conversion transformation, which converts
the logdate and productid values to 50-character Unicode strings. The rows then flow to the Staging
Table destination, which loads them into the staging.IISLog table.

Loading staged data into a fact table


After the data has been staged, a second SSIS package is used to load it into the fact table in the data
warehouse. This package consists of a control flow that contains a single Execute SQL task to execute
the following Transact-SQL code.
Transact-SQL
INSERT INTO dbo.FactIISLog
(LogDateKey, ProductKey, BytesSent, BytesReceived, PageHits)
SELECT d.DateKey, p.ProductKey, s.BytesSent, s.BytesReceived, s.PageHits
FROM staging.IISLog s JOIN DimDate d ON s.LogDate = d.FullDateAlternateKey
JOIN DimProduct p ON s.productid = p.ProductAlternateKey
AND (p.StartDate <= s.LogDate AND (p.EndDate IS NULL OR p.EndDate > s.LogDate))
ORDER BY d.DateKey;

The code inserts rows from the staging.IISLog table into the dbo.FactIISLog table, looking up the
appropriate dimension keys for the date and product dimensions. The surrogate key for the date
dimension is an integer value derived from the year, month, and day. The LogDateKey value extracted
from HDInsight is a string in the format YYYY-MM-DD. SQL Server can implicitly convert values in this
format to the Date data type, so a join can be made to the FullDateAlternateKey column in the
DimDate table to find the appropriate surrogate DateKey value. The DimProduct dimension table
includes a row for None (with the ProductAlternateKey value -), and one or more rows for each
product.
Each product row has a unique ProductKey value (the surrogate key) and an alternate key that matches
the product code extracted by the Hive query. However, because Product is a slowly changing
dimension there may be multiple rows with the same ProductAlternateKey value, each representing the
same product at a different point in time. When loading the product data, the appropriate surrogate key
for the version of the product that was current when the web page was requested must be looked up
based on the alternate key and the start and end date values associated with the dimension record, so
the join for the DimProduct table in the Transact-SQL code includes a clause to check for a StartDate
value that is before the log date, and an EndDate value that is either after the log date or null (for the
record representing the current version of the product member).

Using the data


Now that the IIS log data has been summarized by HDInsight and loaded into a fact table in the data
warehouse, it can be used in corporate data models and reports throughout the organization. The data

Scenario 4: BI integration 153

has been conformed to the dimensional model of the data warehouse, and this deep integration enables
business users to intuitively aggregate IIS activity across dates and products. For example, Figure 4
shows how a user can create a Power View visualization in Excel that includes sales and page view
information for product categories and individual products.

Figure 4 - A Power View report showing data that is integrated in the data warehouse
The inclusion of IIS server log data from HDInsight in the data warehouse enables it to be used easily
throughout the entire BI ecosystem, in managed corporate reports and self-service BI scenarios. For
example, a business user can use Report Builder to create a report that includes web site activity as well
as sales revenue from a single dataset, as shown in Figure 5.

154 Designing big data solutions using HDInsight

Figure 5 - Creating a self-service report with Report Builder


When published to an SSRS report server, this self-service report provides an integrated view of business
data, as shown in Figure 6.

Scenario 4: BI integration 155

Figure 6 - A self-service report based on a data warehouse that contains data from HDInsight

Collaborative self-service BI
In addition to the enterprise BI solution at Adventure Works, described in Scenario 4: BI integration,
business analysts use Excel and SharePoint Server to create and share their own analytical models. This
self-service BI approach has become increasingly useful at Adventure Works because it makes it easier
for business analysts to rapidly develop custom reports that combine internal and external data, without
over-burdening the IT department with requests for changes to the data warehouse. The company has
therefore added the Power BI service to its corporate Office 365 subscription, and encourages business
analysts to use it to share insights gained from their analysis.

Custom big data processing


The business analysts at Adventure Works are not software developers, and do not have the expertise
to create custom map/reduce components using Java or C#. However, some of the senior analysts are
proficient in higher-level Hadoop languages such as Hive and Pig, and they can use these languages to
generate analytical datasets from source data in HDInsight. This ability to perform custom processing

156 Designing big data solutions using HDInsight

means that the business analysts can store analytical datasets as files in Azure blob storage without
relying on the HDInsight cluster remaining available to service Hive queries.
For example, a senior business analyst can use the following Pig script to generate a result set that is
saved as a file in Azure blob storage.
Pig Latin
Logs = LOAD '/data' USING PigStorage(' ')
AS (log_date, log_time, c_ip, cs_username, s_ip, s_port, cs_method, cs_uri_stem,
cs_uri_query,
sc_status, sc_bytes:int, cs_bytes:int, time_taken:int, cs_user_agent,
cs_referrer);
CleanLogs = FILTER Logs BY SUBSTRING(log_date, 0, 1) != '#';
GroupedLogs = GROUP CleanLogs BY log_date;
GroupedTotals = FOREACH GroupedLogs GENERATE group, COUNT(CleanLogs) AS page_hits,
SUM(CleanLogs.sc_bytes) AS bytes_received, SUM(CleanLogs.cs_bytes) AS bytes_sent;
DailyTotals = FOREACH GroupedTotals GENERATE FLATTEN(group) as log_date, page_hits,
bytes_received, bytes_sent;
SortedDailyTotals = ORDER DailyTotals BY log_date ASC;
STORE SortedDailyTotals INTO '/webtraffic';

Running this Pig script produces a file named part-r-00000 in the /webtraffic folder in the Azure blob
storage container used by the HDInsight cluster. The file contains the date, total page hits, and total
bytes received and sent, for each day. This file will be persisted even if the HDInsight cluster is
deactivated.

Creating and sharing a query


After processing the source data in HDInsight, business analysts can use Power Query in Excel to retrieve
the results as a file from the Azure storage container used by the HDInsight cluster. Power Query
provides a query editing wizard that makes it easy to connect to a range of data sources (including Azure
storage), filter and shape the data that is retrieved, and import it into a worksheet or data model.
Additionally, business analysts can combine multiple queries that retrieve data from different sources to
generate mashup datasets for analysis.
Figure 1 shows how Power Query can be used to import the output file generated by the Pig script
shown previously.

Scenario 4: BI integration 157

Figure 1 - Using Power Query to retrieve Pig output


Having defined a query to retrieve the output generated by the Pig script, the business analyst can sign
into Power BI for Office 365 and share the query with other users in the organization, as shown in Figure
2.

158 Designing big data solutions using HDInsight

Figure 2 - Sharing a Query


After the query has been shared, other users in the organization can discover the results through the
Power BI Online Search feature, which enables users to search public datasets that are curated by
Microsoft as well as organizational datasets that are curated by data stewards within the organization.
Figure 3 shows an online search for the term page hits. The results include the Adventure Works Web
Traffic query that was shared previously.

Scenario 4: BI integration 159

Figure 3 - Discovering data with Online Search

Self-service analysis with discovered data


Having discovered the dataset, a business user can use Power Query to refine and import it into an Excel
data model, combine it with other data, and include it in analysis and reports (as discussed in Report
level integration). For example, Figure 4 shows a Power View report based on data imported from a
shared query based on output of the Pig script discussed previously.

160 Designing big data solutions using HDInsight

Figure 4 - A Power View report based on data imported from a shared query
To share the insights gained from the data, business users can publish Excel workbooks that contain
PowerPivot data models and Power View visualizations as reports in a Power BI site, as shown in Figure
5.

Scenario 4: BI integration 161

Figure 5 - A report in a Power BI site


Users can view and interact with reports in a Power BI site using a web browser or the Windows Store
app, which makes Power BI reports available on tablets and other touch devices running Windows 8 or
Windows RT. In addition, the Q&A feature of Power BI sites enables users to query data models in
shared workbooks using natural language queries. For example, users can generate data visualizations
by entering a query such as How many page hits were then in April? or Show average bytes by
month. Power BI automatically chooses an appropriate way to visualize the data, but users can
override this in queries such as Show average page hits and average sales amount by month as line
chart. Figure 6 shows how a business user can use Q&A in a Power BI site.

162 Designing big data solutions using HDInsight

Figure 6 - Using Q&A in a Power BI site


Self-Service BI offers significant advantages for organizations that want to empower users to explore big
data. It minimizes the burden on IT, and increases the agility of the organization by reducing the time it
takes to incorporate the results of big data processing into data models and reports. Self-service
solutions for sharing and discovering data and insights through tools like Power BI for Office 365 can
complement more formally managed enterprise BI infrastructure based on SQL Server technologies such
as Analysis Services, Reporting Services, and Integration Services, and HDInsight can be used to integrate
big data processing results into an organizations BI in multiple ways.

Collecting and loading data into HDInsight 163

Implementing big data solutions using


HDInsight
This section of the guide is aimed at developers, and explores the typical stages of implementing a big
data solution. The examples focus on Microsoft Azure HDInsight, butbecause the underlying big data
framework is Hadoopmuch of the information about loading, querying, visualization, and automation
is equally applicable to big data solutions built on non-Microsoft operating systems, and using services
other than HDInsight.

This section is divided into convenient areas that make it easier to understand the challenges, options,
solutions, and considerations for each stage. It describes and demonstrates the individual tasks that are
part of typical end-to-end big data solutions.

The following sections demonstrate the three main stages of the process, followed by an exploration of
how you can combine and automate them to build a comprehensive managed solution. The sections
are:

Obtaining the data and submitting it to the cluster. During this stage you decide how you
will collect the data you have identified as the source, and how you will get it into your big
data solution for processing. Often you will store the data in its raw format to avoid losing
any useful contextual information it contains, though you may choose to do some preprocessing before storing it to remove duplication or to simplify it in some other way. You
must also make several decisions about how and when you will initialize a cluster and the
associated storage. For more details, see Collecting and loading data into HDInsight.

164 Implementing big data solutions using HDInsight

Processing the data. After you have started to collect and store the data, the next stage is
to develop the processing solutions you will use to extract the information you need. While
you can usually use Hive and Pig queries for even quite complex data extraction, you will
occasionally need to create map/reduce components to perform more complex queries
against the data. For more details, see Processing, querying, and transforming data using
HDInsight.

Visualizing and analyzing the results. Once you are satisfied that the solution is working
correctly and efficiently, you can plan and implement the analysis and visualization
approach you require. This may be loading the data directly into an application such as
Microsoft Excel, or exporting it into a database or enterprise BI system for further analysis,
reporting, charting, and more. For more details, see Consuming and visualizing data from
HDInsight.

Building an automated end-to-end solution. At this point it will become clear whether the
solution should become part of your organizations business management infrastructure,
complementing the other sources of information that you use to plan and monitor business
performance and strategy. If this is the case you should consider how you might automate
and manage some or all of the solution to provide predictable behavior, and perhaps so
that it is executed on a schedule. For more details, see Building end-to-end solutions using
HDInsight.

Collecting and loading data into HDInsight


This section of the guide explores how you can load data into your Hadoop-based big data solutions. It
describes several different but typical data ingestion techniques that are generally applicable to any big
data solution. These techniques include handling streaming data and automating the ingestion process.
While the focus is primarily on Microsoft Azure HDInsight, many of the techniques described here are
equally relevant to solutions built on other Hadoop frameworks and platforms.
Figure 1 shows an overview of the techniques and technologies related to this section of the guide.

Collecting and loading data into HDInsight 165

Figure 1 - Overview of data ingestion techniques and technologies for HDInsight


For more details of the tools shown in Figure 1 see the tables in Appendix A - Tools and technologies
reference.
The following topics in this section discuss the considerations for collecting and loading data into your
big data solutions:

Data types and data sources

Cluster and storage initialization

Performance and reliability

Pre-processing and serializing the data

Choosing tools and technologies

Building custom clients

Security is also a fundamental concern in all computing scenarios, and big data processing is no
exception. Security considerations apply during all stages of a big data process, and include securing
data while in transit over the network, securing data in storage, and authenticating and authorizing
users who have access to the tools and utilities you use as part of your process. For more details of how
you can maximize security of your HDInsight solutions see the topic Security in the section Building endto-end solutions using HDInsight.

166 Implementing big data solutions using HDInsight

Data types and data sources


The data sources for a big data solution are likely to be extremely variable. Typical examples of data
sources are web clickstreams, social media, server logs, devices and sensors, and geo-location data.
Some data may be persisted in a repository such as a database or a NoSQL store (including cloud-based
storage), while other data may be accessible only as a stream of events.
There are specific tools designed to handle different types of data and different data sources. For
example, streaming data may need to be captured and persisted so that it can be processed in batches.
Data may also need to be staged prior to loading it so that it can be pre-processed to convert it into a
form suitable for processing in a big data cluster.
However, you can collect and load almost any type of data from almost anywhere, even if you decide
not to stage and/or pre-process the data. For example, you can use a custom Input Formatter to load
data that is not exposed in a suitable format for the built-in Hadoop Input Formatters.
The blog page Analyzing Azure Table Storage data with HDInsight demonstrates how you can use a
custom Input Formatter to collect and load data form Azure table storage.
For information about handling streaming data and pre-processing data, see the topic Pre-processing
and serializing the data in this section of the guide. For information about choosing a tool specific to
data sources such as relational databases and server log files, see the topic Choosing tools and
technologies in this section of the guide.

Considerations
When planning how you will obtain the source data for your big data solution, consider the following:

You may need to load data from a range of different data sources such as websites, RSS feeds,
clickstreams, custom applications and APIs, relational databases, and more. Its vital to ensure
that you can submit this data efficiently and accurately to cluster storage, including performing
any preprocessing that may be required to capture the data and convert it into a suitable form.

In some cases, such as when the data source is an internal business application or database,
extracting the data into a file in a form that can be consumed by your solution is relatively
straightforward. In the case of external data obtained from sources such as governments and
commercial data providers, the data is often available for download in a suitable format.
However, in other cases you may need to extract data through a web service or other API,
perhaps by making a REST call or using code.

You may need to stage data before submitting it to a big data cluster for processing. For
example, you may want to persist streaming data so that it can be processed in batches, or
collect data from more than one data source and combine the datasets before loading this into
the cluster. Staging is also useful when combining data from multiple sources that have
different formats and velocity (rate of arrival).

Collecting and loading data into HDInsight 167

Dedicated tools are available for handling specific types of data such as relational or server log
data. See Choosing tools and technologies for more information.

More information
For more information about HDInsight, see the Microsoft Azure HDInsight web page.
For a guide to uploading data to HDInsight, and some of the tools available to help, see Upload data to
HDInsight on the HDInsight website.
For more details of how HDInsight uses Azure blob storage, see Use Microsoft Azure Blob storage with
HDInsight on the HDInsight website.

Cluster and storage initialization


HDInsight stores its data in Azure blob storage. This enables you to manage the location of your data,
and retain the data if you need to delete and recreate a cluster. It also enables flexibility in that you can
store data for different applications in different storage accounts, initiate a cluster only when required,
process data in the relevant storage account, and then delete the cluster afterwards to minimize
running costs. To understand how you can do this, see the following sections of this topic:

Storage accounts and containers

Deleting and recreating clusters

For information about creating a cluster using scripts or code see Custom cluster management clients.

Storage accounts and containers


When you create a cluster, unless you specify otherwise, HDInsight creates a new storage account in the
same datacenter as the cluster and uses the default container in this account to store its data. This
account is automatically linked to the cluster (linked storage accounts are sometimes referred to as
connected, configured, or managed storage accounts). The default container in this storage account is
used by HDInsight as the root of the virtual HDFS store. When you access data using just the relative
path (such as /myfiles/thisone.txt) HDInsight uses the default container.
You can add additional storage accounts during the cluster creation process. You can specify existing
storage accounts, or you can allow HDInsight to create new storage accounts. These storage accounts
are also linked to the cluster. HDInsight stores the credentials required to access all of the linked storage
accounts in its configuration. Linked storage accounts are not deleted when you delete the cluster,
which means that the data in them is retained and can be accessed afterwards.
However, you can process data from, and store the results in any blob storage container in any Azure
storage account by specifying the account name and the storage key when you submit a job. This
provides flexibility, and can help to improve the security and manageability of your solution. For

168 Implementing big data solutions using HDInsight

example, you can store parts of your data in separate storage accounts to help protect and isolate
sensitive information, or use different storage accounts to stage data as part of your ingestion process.
You can also reduce runtime costs by creating the storage account and loading the data before you
create the cluster. Additionally, using non-linked storage accounts can help to maximize security by
isolating data for different users or tenants and allowing each one to manage their own storage account
and upload the data to it themselves, before you process the data in your HDInsight cluster.
Considerations
Keep in mind the following when deciding how and when you will create storage accounts for a cluster:

The main advantage of allowing HDInsight to create one or more storage accounts that are
automatically linked to the cluster during the creation process is that you do not need to specify
the storage account credentials, such as the storage account name and key, when you access
the data in a query or transformation process running on your HDInsight cluster. HDInsight
automatically stores the required credentials within its configuration. However, you will need to
obtain the storage key when you want to upload data to the storage account and access the
results.

The main advantage of using non-linked storage accounts and containers is the flexibility this
provides in choosing the storage account to use with each job. However, you must specify the
target storage account name and key within your query or transformation when you access
data stored in accounts that are not linked to the cluster.

You can specify the storage accounts that are linked to the cluster only when you create the
cluster. You cannot add or remove linked accounts after a cluster has been created. If you need
more than one storage account to be linked to your cluster, you must specify them all as part of
the cluster creation operation.

You can create the storage accounts before or after you create the cluster. Typically you will use
this capability to minimize cluster runtime cost by creating the storage accounts (or using
existing storage accounts) and loading the data before you create the cluster.

If you store parts of your data in different storage accounts, perhaps to separate sensitive data
such as personally identifiable information (PII) and account information from non-sensitive
data, you can create a cluster that uses just a subset of these as the linked accounts. This allows
you to isolate and protect parts of the data while avoiding the need to specify storage account
credentials in queries and transformations. Be aware, however, that code running in HDInsight
will have full access to all of the data in a linked account because the account name and key are
stored in the cluster configuration.

If you do not specify the storage account and path to the data when you submit a job, HDInsight
will use the default container. If you intend to use accounts and containers other than the
default, or delete and then recreate a cluster over the same data, specify the full path of the
account and container in all queries and transformation processes that you will execute on your
HDInsight cluster. This ensures that each job accesses the correct container, and prevents errors

Collecting and loading data into HDInsight 169

if you subsequently delete and recreate the cluster with different default containers. The full
path and name of a container is in the form wasbs://[container-name]@[storage-accountname].blob.core.windows.net.

Any storage accounts associated with an HDInsight cluster should be in the same data center as
the cluster, and must not be in an affinity group. Using a container in a storage account in a
different datacenter will result in delays as data is transmitted between datacenters, and you
will be billed for these data transfers.

For more information see Use Azure Blob storage with HDInsight, Provision Hadoop clusters in
HDInsight, and Using an HDInsight Cluster with Alternate Storage Accounts and Metastores.

Deleting and recreating clusters


When you delete a cluster, the data for that cluster is retained in the associated Azure blob storage
containers. When you subsequently create a new cluster you can specify this container as the default
container, and all of the data will be available for processing in the cluster. You can specify multiple
existing storage accounts and containers when you create the cluster, which means that you can create
a cluster over just the data that you want to process.
However, some metadata for the cluster is stored in an Azure SQL Database. This includes the
definitions of any Hive tables you created with the EXTERNAL option, and HCatalog metadata that maps
data files to schemas (the data for Hive tables is in blob storage). By default this database is created and
populated with the cluster metadata automatically when a new cluster is created.
When the cluster is deleted, the database is also deleted. To avoid this you can use the option available
when creating the cluster that allows you to specify an existing database to hold the cluster metadata.
This database is not deleted when the cluster is deleted. When you recreate the cluster you can specify
this database, and all of the metadata it contains will be available to the cluster. The Hive tables (and
any indexes or other features they contain) and the HCatalog information will be available and
accessible in the new cluster.
Considerations
Keep in mind the following when deleting and recreating your HDInsight clusters:

If you want to retain the schema definitions of Hive tables and the HCatalog metadata, you
must specify an existing SQL Database instance when you create the cluster for the first time. If
you allow HDInsight to create the database, it will be deleted when you delete the cluster.

The data for Hive tables you create in the cluster is retained only if you specify the EXTERNAL
option when you create the tables.

You can back up and restore a SQL Database instance, and export or import the data, using the
tools provided by the Azure management portal or through scripting using the REST interface
for SQL Database.

170 Implementing big data solutions using HDInsight

Ensure you set the required configuration properties for a cluster when you create it. You can
change some properties at runtime for individual jobs (see Configuring and debugging solutions
for details), but you cannot change the properties of an existing cluster. See Custom cluster
management clients for information about automating the creation of clusters and setting
cluster properties.

Performance and reliability


Big data solutions typically work against vast quantities of source data, and are usually automated to
some extent. The results they produce may be used to make important business decisions, and so it is
vital to ensure that the processes you use to collect and load data provide both an appropriate level of
performance and operate in a consistent and reliable way. This topic discusses the requirements and
considerations for performance and scalability, and for reliability.

Performance and scalability


Big data solutions typically work against vast quantities of source data. In some cases this data may be
progressively added to the clusters data store by appending it to existing files, or by adding new files, as
the data is collected. This is common when collecting streaming data such as clickstreams or sensor data
that will be processed later.
However, in many cases you may need to upload large volumes of data to a cluster before starting a
processing job. Uploading multiple terabytes, or even petabytes, of data will take many hours. During
this time, the freshness of the data is impaired and the results may even be obtained too late to actually
be useful.
Therefore, its vital to consider how you can optimize upload processes in order to mitigate this latency.
Common approaches are to use multiple parallel upload streams, compress the data, use the most
efficient data transfer protocols, and ensure that data uploads can be resumed from the point where
they failed should network connectivity be interrupted.
Considerations for performance and scalability
Consider the following performance and scalability factors when designing your data ingestion
processes:

Consider if you can avoid the need to upload large volumes of data as a discrete operation
before you can begin processing it. For example, you might be able to append data to existing
files in the cluster, or upload it in small batches on a schedule.

If possible, choose or create a utility that can upload data in parallel using multiple threads to
reduce upload time, and that can resume uploads that are interrupted by temporary network
connectivity. Some utilities may be able to split the data into small blocks and upload multiple
blocks or small files in parallel; and then combine them into larger files after they have been
uploaded.

Collecting and loading data into HDInsight 171

Bottlenecks when loading data are often caused by lack of network bandwidth. Adding more
threads may not improve throughput, and can cause additional latency due to the opening and
closing of the connection for each item. In many cases, reusing the connection (which avoids
the TCP ramp-up) is more important.

You can often reduce upload time considerably by compressing the data. If the data at the
destination should be uncompressed, consider compressing it before uploading it and then
decompressing it within the datacenter. Alternatively, you can use one of the HDInsightcompatible compression codecs so that the data in compressed form can be read directly by
HDInsight. This can also improve the efficiency and reduce the running time of jobs that use
large volumes of data. Compression may be done as a discrete operation before you upload the
data, or within a custom utility as part of the upload process. For more details see Preprocessing and serializing the data.

Consider if you can reduce the volume of data to upload by pre-processing it. For example, you
may be able to remove null values or empty rows, consolidate some parts of the data, or strip
out unnecessary columns and values. This should be done in staging, and you should ensure
that you keep a copy of the original data in case the information it contains is required
elsewhere or later. What may seem unnecessary today may turn out to be useful tomorrow.

If you are uploading data continuously or on a schedule by appending it to existing files,


consider how this may impact the query and transformation processes you execute on the data.
The source data folder contents for a query in HDInsight should not be modified while a query
process is running against this data. This may mean saving new data to a separate location and
combining it later, or scheduling processing so that it executes only when new data is not being
appended to the files.

Choose efficient transfer protocols for uploading data, and ensure that the process can resume
an interrupted upload from the point where it failed. For example, some tools such as Aspera,
Signiant, and File Catalyst use UDP for the data transfer and TCP working in parallel to validate
the uploaded data packages by ensuring each one is complete and has not been corrupted
during the process.

If one instance of the uploader tool does not meet the performance criteria, consider using
multiple instances to scale out and increase the upload velocity if the tool can support this.
Tools such as Flume, Storm, Kafka, and Samza can scale to multiple instances. SSIS can also be
scaled out, as described in the presentation Scaling Out SSIS with Parallelism (note that you will
require additional licenses for this). Each instance of the uploader you choose might create a
separate file or set of files that can be processed as a batch, or could be combined into fewer
files or a single file by a process running on the cluster servers.

Ensure that you measure the performance of upload processes to ensure that the steps you
take to maximize performance are appropriate to different types and varying volumes of data.
What works for one type of upload process may not provide optimum performance for other
upload processes with different types of data. This is particularly the case when using

172 Implementing big data solutions using HDInsight

serialization or compression. Balance the effects of the processes you use to maximize upload
performance with the impact these have on subsequent query and transformation processing
jobs within the cluster.

Reliability
Data uploads must be reliable to ensure that the data is accurately represented in the cluster. For
example, you might need to validate the uploaded data before processing it. Transient failures or errors
that might occur during the upload process must be prevented from corrupting the data.
However, keep in mind that validation extends beyond just comparing the uploaded data with the
original files. For example, you may extend data validation to ensure that the original source data does
not contain values that are obviously inaccurate or invalid, and that there are no temporal or logical
gaps for which data that should be included.
To ensure reliability, and to be able to track and cure faults, you will also need to monitor the process.
Using logs to record upload success and failure, and capturing any available error messages, provides a
way to ensure the process is working as expected and to locate issues that may affect reliability.
Considerations for reliability
Consider the following reliability factors when designing your data ingestion processes:

Choose a technology or create an upload tool that can handle transient connectivity and
transmission failures, and can properly resume the process when the problem clears. Many of
the APIs exposed by Azure and HDInsight, and SDKs such as the Azure Storage client libraries,
include transient fault handling management. If you are building custom tools that do not use
these libraries or APIs, you can include this capability using a framework such as the Transient
Fault Handling Application Block.

Monitor upload processes so that failures are detected early and can be fixed before they have
an impact on the reliability of the solution and the accuracy or timeliness of the results. Also
ensure you log all upload operations, including both successes and failures, and any error
information that is available. This is invaluable when trying to trace problems. Some tools, such
as Flume and CloudBerry, can generate log files. AzCopy provides a command line option to log
the upload or download status. You can also enable the built-in monitoring and logging for
many Azure features such as storage, and use the APIs they expose to generate logs. If you are
building a custom data upload utility, you should ensure it can be configured to log all
operations.

Implement linear tracking where possible by recording each stage involved in a process so that
the root cause of failures can be identified by tracing the issue back to its original source.

Consider validating the data after it has been uploaded to ensure consistency, integrity, and
accuracy of the results and to detect any loss or corruption that may have occurred during the
transmission to cluster storage. You might also consider validating the data before you upload

Collecting and loading data into HDInsight 173

it, although a large volume of data arriving at high velocity may make this impossible. Common
types of validation include counting the number of rows or records, checking for values that
exceed specific minimum or maximum values, and comparing the overall totals for numeric
fields. You may also apply more in-depth approaches such as using a data dictionary to ensure
relevant values meet business rules and constraints, or cross-referencing fields to ensure that
matching values are present in the corresponding reference tables.

Pre-processing and serializing the data


Big Data solutions such as HDInsight are designed to meet almost any type of data processing
requirement, with almost any type of source data. However, there may be times when you need to
perform some transformations or other operations on the source data, before or as you load it into
HDInsight. These operations may include capturing and staging streaming data, staging data from
multiple sources that arrives at different velocities, and compressing or serializing the data. This topic
discusses the considerations for pre-processing data and techniques for serialization and compression.

Pre-processing data
You may want to perform some pre-processing on the source data before you load it into the cluster.
For example, you may decide to pre-process the data in order to simplify queries or transformations, to
improve performance, or to ensure accuracy of the results. Pre-processing might also be required to
cleanse and validate the data before uploading it, to serialize or compress the data, or to improve
upload efficiency by removing irrelevant or unnecessary rows or values.
Considerations for pre-processing data
Consider the following pre-processing factors when designing your data ingestion processes:

Before you implement a mechanism to pre-process the source data, consider if this processing
could be better handled within your cluster as part of a query, transformation, or workflow.
Many of the data preparation tasks may not be practical, or even possible, when you have very
large volumes of data. They are more likely to be possible when you stream data from your data
sources, or extract it in small blocks on a regular basis. Where you have large volumes of data to
process you will probably perform these preprocessing tasks within your big data solution as the
initial steps in a series of transformations and queries.

You may need to handle data that arrives as a stream. You may choose to convert and buffer
the incoming data so that it can be processed in batches, or consider a real-time stream
processing technology such as Storm (see the section Overview of Storm in the topic Data
processing tools and techniques) or StreamInsight.

You may need to format individual parts of the data by, for example, combining fields in an
address, removing duplicates, converting numeric values to their text representation, or
changing date strings to standard numerical date values.

174 Implementing big data solutions using HDInsight

You may want to perform some automated data validation and cleansing by using a technology
such as SQL Server Data Quality Services before submitting the data to cluster storage. For
example, you might need to convert different versions of the same value into a single leading
value (such as changing NY and Big Apple into New York).

If reference data you need to combine with the source data is not already available as an
appropriately formatted file, you can prepare it for upload and processing using a tool such as
Excel to extract a relatively small volume of tabular data from a data source, reformat it as
required, and save it as a delimited text file. Excel supports a range of data sources, including
relational databases, XML documents, OData feeds, and the Azure Data Market. You can also
use Excel to import a table of data from any website, including an RSS feed. In addition to the
standard Excel data import capabilities, you can use add-ins such as Power Query to import and
transform data from a wide range of sources.

Be careful when removing information from the data; if possible keep a copy of the original
files. You may subsequently find the fields you removed are useful as you refine queries, or if
you use the data for a different analytical task.

Serialization and compression


In many cases you can reduce data upload time for large source files by serializing and/or compressing
the source data before, or as you upload it to your cluster storage. Serialization is useful when the data
can be assembled as a series of objects, or it is in a semi-structured format. Although there are many
different serialization formats available, the Avro serialization format is ideal because it works
seamlessly with Hadoop and contains both the schema and the data. Avro allows you to use rich data
structures, a binary data format that can provide fast and compact formats for data transmission using
remote procedure calls (RPC), a container file to store persistent data, and it can be integrated easily
with dynamic programming languages.
Other commonly used serialization formats are:

Apache ProtocolBuffers. This is an open source project designed to provide a platform-neutral


and language-neutral inter-process communication (IPC) and serialization framework. See
ProtocolBuffers on the Hadoop Wiki for more information.

Optimized Row Columnar (ORC). This provides a highly efficient way to store Hive data in a way
that was designed to overcome limitations of the other Hive file formats. The ORC format can
improve performance when Hive is reading, writing, and processing data. See ORC File Format
for more information.

Compression can improve the performance of data processing on the cluster by reducing I/O and
network usage for each node in the cluster as it loads the data from storage into memory. However,
compression does increase the processing overhead for each node, and so it cannot be guaranteed to
reduce execution time. Compression is typically carried out using one of the standard algorithms for
which a compression codec is installed by default in Hadoop.

Collecting and loading data into HDInsight 175

You can combine serialization and compression to achieve optimum performance when you use Avro
because, in addition to serializing the data, you can specify a codec that will compress it.
Tools for Avro serialization and compression
An SDK is available from NuGet that contains classes to help you work with Avro from programs and
tools you create using .NET languages. For more information see Serialize data with the Avro Library on
the Azure website and Apache Avro Documentation on the Apache website. A simple example of using
the Microsoft library for Avro is included in this guidesee Serializing data with the Microsoft .NET
Library for Avro.
To compress the source data if you are not using Avro or another utility that supports compression, you
can usually use the tools provided by the codec supplier. For example, the downloadable libraries for
both GZip and BZip2 include tools that can help you apply compression. For more details see the
distribution sources for GZip and BZip2 on Source Forge.
You can also use the classes in the .NET Framework to perform GZip and DEFLATE compression on your
source files, perhaps by writing command line utilities that are executed as part of an automated upload
and processing sequence. For more details see the GZipStream Class and DeflateStream Class reference
sections on MSDN.
Another alternative is to create a query job that is configured to write output in compressed form using
one of the built-in codecs, and then execute the job against existing uncompressed data in storage so
that it selects all or some part of the source data and writes it back to storage in compressed form. For
an example of using Hive to do this see the Microsoft White Paper Compression in Hadoop.
Compression libraries available in HDInsight
The following table shows the class name of the codecs provided with HDInsight when this guide was
written. The table shows the standard file extension for files compressed with the codec, and whether
the codec supports split file compression and decompression.
Format

Codec

Extension

Splittable

DEFLATE

org.apache.hadoop.io.compress.DefaultCodec

.deflate

No

GZip

org.apache.hadoop.io.compress.GzipCodec

.gz

No

BZip2

org.apache.hadoop.io.compress.BZip2Codec

.bz2

Yes

.snappy

Yes

(this codec is not enabled by default in configuration)


Snappy

org.apache.hadoop.io.compress.SnappyCodec

A codec that supports splittable compression and decompression allows HDInsight to decompress the
data in parallel across multiple mapper and node instances, which typically provides better
performance. However, splittable codecs are less efficient at runtime, so there is a trade off in efficiency
between each type.
There is also a difference in the size reduction (compression rate) that each codec can achieve. For the
same data, BZip2 tends to produce a smaller file than GZip but takes longer to perform the

176 Implementing big data solutions using HDInsight

decompression. The Snappy codec works best with container data formats such as Sequence Files or
Avro Data Files. It is fast and typically provides a good compression ratio.
Considerations for serialization and compression
Consider the following points when you are deciding whether to compress the source data:

Compression may not produce any improvement in performance with small files. However, with
very large files (for example, files over 100 GB) compression is likely to provide dramatic
improvement. The gains in performance also depend on the contents of the file and the level of
compression that was achieved.

When optimizing a job, enable compression within the process using the configuration settings
to compress the output of the mappers and the reducers before experimenting with
compression of the source data. Compression within the job stages often provides a more
substantial gain in performance compared to compressing the source data.

Consider using a splittable algorithm for very large files so that they can be decompressed in
parallel by multiple tasks.

Ensure that the format you choose is compatible with the processing tools you intend to use.
For example, ensure the format is compatible with Hive and Pig if you intend to use these to
query your data.

Use the default file extension for the files if possible. This allows HDInsight to detect the file
type and automatically apply the correct decompression algorithm. If you use a different file
extension you must set the io.compression.codec property for the job to indicate the codec
used.

If you are serializing the source data using Avro, you can apply a codec to the process so that
the serialized data is also compressed.

For more information about using the compression codecs in Hadoop see the documentation for the
CompressionCodecFactory and CompressionCodec classes on the Apache website.

Choosing tools and technologies


Choosing an appropriate tool or technology for uploading data can make the process more efficient,
scalable, secure, and reliable. You can use scripts, tools, or custom utilities to perform the upload,
perhaps driven by a scheduler such as Windows Scheduled Tasks. In addition, if you intend to automate
the upload process, or the entire end-to-end solution, ensure that the tools you choose can be
instantiated and controlled as part of this process.
Several tools are specific to the types of data they handle. For example, there are tools designed to
handle the upload of relational data and server log files. The following sections explore some of these
scenarios. A list of popular tools is included in Appendix A - Tools and technologies reference. For
information on creating your own tools, see Custom data upload clients.

Collecting and loading data into HDInsight 177

Interactive data ingestion


An initial exploration of data with a big data solution often involves experimenting with a relatively small
volume of data. In this case there is no requirement for a complex, automated data ingestion solution,
and the tasks of obtaining the source data, preparing it for processing, and uploading it to cluster
storage can be performed interactively.
You can upload the source data interactively using:

A UI-based tool such as CloudBerry Explorer, Microsoft Azure Storage Explorer, or Server
Explorer in Visual Studio. For a useful list of third party tools for uploading data to HDInsight
interactively see Upload data for Hadoop jobs in HDInsight on the Azure website.

PowerShell commands that take advantage of the PowerShell cmdlets for Azure. This capability
is useful if you are just experimenting or working on a proof of concept.

The hadoop dfs -copyFromLocal [source] [destination] command at the Hadoop command line
using a remote desktop connection.

A command line tool such as AzCopy if you need to upload large files.

Consider how you will handle very large volumes of data. While small volumes of data can be copied
into storage interactively, you will need to choose or build a more robust mechanism capable of
handling large files when you move beyond the experimentation stage.

Handling streaming data


Solutions that perform processing of streaming data, such as that arriving from device sensors or web
clickstreams, must be able to either completely process the data on an event-by-event basis in real-time,
or capture the events in a persistent or semi-persistent store so that they can be processed in batches.
For small batches (sometimes referred to as micro-batch processing), the events might be captured in
memory before being uploaded as small batches at short intervals to provide near real-time results. For
larger batches, the data is likely to be stored in a persistent mechanism such as a database, disk file, or
cloud storage.
However, some solutions may combine these two approaches by capturing events that are both
redirected to a real-time visualization solution and stored for submission in batches to the big data
solution for semi-real-time or historic analysis. This approach is also useful if you need to quickly detect
some specific events in the data stream, but store the rest for batch analysis. For example, financial
organizations may use real-time data processing to detect fraud or non-standard trading events, but also
maintain the data to predict patterns and future.
The choice of tools and technologies depends on the platform and the type of processing you need to
accomplish. Typical tools to capture or process stream data are:

Microsoft StreamInsight. This is a complex event processing (CEP) engine with a framework API
for building applications that consume and process event streams. It can be run on-premises or

178 Implementing big data solutions using HDInsight

in a virtual machine. For more information about developing StreamInsight applications, see
Microsoft StreamInsight on MSDN.

Apache Storm. This is an open-source framework that can run on a Hadoop cluster to capture
streaming data. It uses other Hadoop-related technologies such as Zookeeper to manage the
data ingestion process. See the section Overview of Storm in the topic Data processing tools
and techniques and Apache Storm on the Hortonworks website for more information.

Other open source frameworks such as Kafka, and Samza. These frameworks provide
capabilities to capture streaming data and process it in real time, including persisting the data
or messages as files for batch processing when required.

A custom event or stream capture solution that feeds the data into the cluster data store in real
time or in batches. The interval should be based on the frequency that related query jobs will be
instantiated. You could use the Reactive Extensions (Rx) library to implement a real-time stream
capture utility.

For more information see Appendix A - Tools and technologies reference.

Loading relational data


Source data for analysis is often obtained from a relational database. Its quite common to use HDInsight
to query data extracted from a relational database. For example, you may use it to search for hidden
information in data exported from your corporate database or data warehouse as part of a sandboxed
experiment, without absorbing resources from the database or risking corruption of the existing data.
You can use Sqoop to extract the data you require from a table, view, or query in the source database
and save the results as a file in your cluster storage. In HDInsight this approach makes it easy to transfer
data from a relational database when your infrastructure supports direct connectivity from the
HDInsight cluster to the database server, such as when your database is hosted in Azure SQL Database
or a virtual machine in Azure.
Some business intelligence (BI) and data warehouse implementations contain interfaces that support
connectivity to a big data cluster. For example, Microsoft Analytics Platform System (MAPS) contains
PolyBase, which exposes a SQL-based interface for accessing data stored in Hadoop and HDInsight.

Loading web server log files


Analysis of web server log data is a common use case for big data solutions, and requires log files to be
uploaded to the cluster storage. Flume is an open source project for an agent-based framework to copy
log files to a central location, and is often used to load web server logs to HDFS for processing in
Hadoop.
Flume was not included in HDInsight when this guide was written, but can be downloaded from the
Flume project website. See the blog post Using Apache Flume with HDInsight.

Collecting and loading data into HDInsight 179

As an alternative to using Flume, you can use SSIS to implement an automated batch upload solution.
For more details of using SQL Server Integration Services (SSIS) see Scenario 4: BI integration and
Appendix A - Tools and technologies reference.

Building custom clients


You can use existing applications, utilities, and third party tools to manage clusters and storage
accounts, and to upload data to your cluster storage. Many of these are listed in Appendix A - Tools and
technologies reference. However, you may find that building your own custom tools and utilities, even if
only scripts that you execute on demand or as part of a scheduled process, is a useful approach that can
minimize operator errors and provide a standardized process.
Creating or adopting automated mechanisms can also help to make a solution more secure because you
can assign specific permissions to each operation or tool, control access to data by allowing it to be read
only through a specific tool, and hide sensitive configuration settings (such as keys and credentials) from
users.

Automating data upload


You can upload data to the storage accounts that HDInsight uses before or after you have initialized a
cluster. While there are many tools available that you can use to upload data files to Azure storage, it is
common in many big data processing scenarios to implement custom data loading code in a client
application or utility. In some cases this code may take the form of a script or simple command line
utility that simplifies the upload of data for a repeatable data processing task. In others, the code may
be used to integrate big data processing into a business application or solution. Techniques for
automating data upload are described in Custom data upload clients.

Automating cluster management


Before you can use HDInsight to process data you have uploaded, you must provision an HDInsight
cluster. In environments where big data processing is a constant activity, you might choose to do this
once and leave the cluster running. However, if data is processed only periodically you can reduce
operational costs by provisioning the cluster just when you need it, and deleting it when each batch of
data processing tasks is complete. Techniques for automating cluster management are described in
Custom cluster management clients.

Custom data upload clients


The following topics demonstrate how you can use PowerShell and the .NET SDKs to upload and to
serialize data:

Uploading data with Windows PowerShell

180 Implementing big data solutions using HDInsight

Uploading data with the Microsoft .NET Framework

Uploading data with the Azure Storage SDK

Serializing data with the Microsoft .NET Library for Avro

You can also use the AZCopy utility in scripts to automate uploading data to HDInsight. For more details
see AzCopy Uploading/Downloading files for Windows Azure Blobs on the Azure storage team blog. In
addition, a library called Casablanca can be used to access Azure storage from native C++ code. For more
details see Announcing Casablanca, a Native Library to Access the Cloud From C++.
Considerations
Consider the following factors when designing your automated data ingestion processes:

Consider how much effort is required to create an automated upload solution, and balance this
with the advantages it provides. If you are simply experimenting with data in an iterative
scenario, you may not need an automated solution. Creating automated processes to upload
data is probably worthwhile only when you will repeat the operation on a regular basis, or when
you need to integrate big data processing into a business application.

When creating custom tools or scripts to upload data to a cluster, consider including the ability
to accept command-line parameters so that the tools can be used in a range of automation
processes.

Consider how you will protect the data, the cluster, and the solution as a whole from
inappropriate use of custom upload tools and applications. It may be possible to set permissions
on tools, files, folders, and other resources to restrict access to only authorized users.

PowerShell is a good solution for uploading data files in scenarios where users are exploring
data iteratively and need a simple, repeatable way to upload source data for processing. You
can also use PowerShell as part of an automated processing solution in which data is uploaded
automatically by a scheduled operating system task or SQL Server Integration Services package.

.NET Framework code that uses the .NET SDK for HDInsight can be used to upload data for
processing by HDInsight jobs. This may be a better choice than using PowerShell for large
volumes of data.

In addition to the HDInsight-specific APIs for uploading data to the cluster, the more general
Azure Storage API offers greater flexibility by allowing you to upload data directly to Azure blob
storage as files, or write data directly to blobs in an Azure blob storage container. This enables
you to build client applications that capture real-time data and write it directly to a blob for
processing in HDInsight without first storing the data in local files.

Other tools and frameworks are available that can help you to build data ingestion mechanisms.
For example, Falcon provides an automatable system for data replication, data lifecycle
management (such as data eviction), data lineage and tracing, and process coordination and
scheduling based on a declarative programming model.

Collecting and loading data into HDInsight 181

More information
For information about creating end-to-end automated solutions that include automated upload stages,
see Building end-to-end solutions using HDInsight.
For more details of the tools and technologies available for automating upload processes see Appendix
A - Tools and technologies reference.
For information on using PowerShell with HDInsight see HDInsight PowerShell Cmdlets Reference
Documentation.
For information on using the HDInsight SDK see HDInsight SDK Reference Documentation and the
incubator projects on the Codeplex website.
Uploading data with Windows PowerShell
The Azure module for Windows PowerShell includes a range of cmdlets that you can use to work with
Azure services programmatically, including Azure storage. You can run PowerShell scripts interactively in
a Windows command line window or in a PowerShell-specific command line console. Additionally, you
can edit and run PowerShell scripts in the Windows PowerShell Interactive Scripting Environment (ISE),
which provides IntelliSense and other user interface enhancements that make it easier to write
PowerShell code. You can schedule the execution of PowerShell scripts using Windows Scheduler, SQL
Server Agent, or other tools as described in Building end-to-end solutions using HDInsight.
Before you use PowerShell to work with HDInsight you must configure the PowerShell environment to
connect to your Azure subscription. To do this you must first download and install the Azure PowerShell
module, which is available through the Microsoft Web Platform Installer. For more details see How to
install and configure Azure PowerShell.
To upload data files to the Azure blob store, you can use the Set-AzureStorageBlobContent cmdlet, as
shown in the following code example.
Windows PowerShell
# Azure subscription-specific variables.
$storageAccountName = "storage-account-name"
$containerName = "container-name"
# Find the local folder where this PowerShell script is stored.
$currentLocation = Get-location
$thisfolder = Split parent $currentLocation
# Upload files in data subfolder to Azure.
$localfolder = "$thisfolder\data"
$destfolder = "data"
$storageAccountKey = (Get-AzureStorageKey -StorageAccountName
$storageAccountName).Primary
$blobContext = New-AzureStorageContext -StorageAccountName $storageAccountName StorageAccountKey $storageAccountKey
$files = Get-ChildItem $localFolder foreach($file in $files)

182 Implementing big data solutions using HDInsight

{
$fileName = "$localFolder\$file"
$blobName = "$destfolder/$file"
write-host "copying $fileName to $blobName"
Set-AzureStorageBlobContent -File $filename -Container $containerName -Blob
$blobName -Context $blobContext -Force
}
write-host "All files in $localFolder uploaded to $containerName!"

Note that the code uses the New-AzureStorageContext cmdlet to create a context for the Azure storage
account where the files are to be uploaded. This context requires the access key for the storage account,
which is obtained using the Get-AzureStorageKey cmdlet. Authentication to obtain the key is based on
the credentials or certificate used to connect the local PowerShell environment with the Azure
subscription.
The code shown above also iterates over all of the files to be uploaded and uses the SetAzureStorageBlobContent cmdlet to upload each one in turn. It does this in order to store each one in a
specific path that includes the destination folder name. If all of the files you need to upload are in a
folder structure that is the same as the required target paths, you could use the following code to
upload all of the files in one operation instead of iterating over them in your PowerShell script.
Windows PowerShell
cd [root-data-folder]
ls Recurse Path $localFolder | Set-AzureStorageBlobContent Container
$containerName Context $blobContext

Uploading data with the Microsoft .NET Framework


The .NET API for Hadoop WebClient is a component of the .NET SDK for HDInsight that you can add to a
project using NuGet. The library includes a range of classes that enable integration with HDInsight and
the Azure blob store that hosts the HDFS folder structure that HDInsight uses.
One of these classes is the WebHDFSClient class, which you can use to upload local files to Azure
storage for processing by HDInsight. The WebHDFSClient class enables you to treat blob storage like an
HDFS volume, navigating blobs within an Azure blob container as if they are directories and files.
The following code example shows how you can use the WebHDFSClient class to upload locally stored
data files to Azure blob storage. The example is deliberately kept simple by including the credentials in
the code so that you can copy and paste it while you are experimenting with HDInsight. In a production
system you must protect credentials, as described in Securing credentials in scripts and applications in
the Security section of this guide.
C#
using
using
using
using
using

System;
System.Collections.Generic;
System.Text;
System.Threading.Tasks;
System.IO;

Collecting and loading data into HDInsight 183

using Microsoft.Hadoop.WebHDFS;
using Microsoft.Hadoop.WebHDFS.Adapters;
namespace DataUploader
{
class Program
{
static void Main(string[] args)
{
UploadFiles().Wait();
Console.WriteLine("Upload complete!");
Console.WriteLine("Press a key to end");
Console.Read();
}
private static async Task UploadFiles()
{
var localDir = new DirectoryInfo(@".\data");
var hdInsightUser = "user-name";
var storageName = "storage-account-name";
var storageKey = "storage-account-key";
var containerName = "container-name";
var blobDir = "/data/";
var hdfsClient = new WebHDFSClient(hdInsightUser,
new BlobStorageAdapter(storageName, storageKey, containerName, false));
await hdfsClient.DeleteDirectory(blobDir);
foreach (var file in localDir.GetFiles())
{
Console.WriteLine("Uploading " + file.Name + " to " + blobDir + file.Name + "
...");
await hdfsClient.CreateFile(file.FullName, blobDir + file.Name);
}
}
}
}

Note that the code uses the DeleteDirectory method to delete all existing blobs in the specified path,
and then uses the CreateFile method to upload each file in the local data folder. All of the methods
provided by the WebHDFSClient class are asynchronous, enabling you to upload large volumes of data
to Azure without blocking the client application.
Uploading data with the Azure Storage SDK
The .NET Azure Storage Client, part of the Azure Storage library available from NuGet, offers a flexible
mechanism for uploading data to the Azure blob store as files or writing data directly to blobs in an

184 Implementing big data solutions using HDInsight

Azure blob storage container, including writing streams of data directly to Azure storage without first
storing the data in local files.
The following code example shows how you can use the .NET Azure Storage Client to write data in a
stream directly to a blob in Azure storage. The example is deliberately kept simple by including the
credentials in the code so that you can copy and paste it while you are experimenting with HDInsight. In
a production system you must protect credentials, as described in Securing credentials in scripts and
applications in the Security section of this guide.
C#
using
using
using
using
using

System;
System.Text;
System.Threading.Tasks;
System.IO;
System.Collections.Generic;

using Microsoft.WindowsAzure.Storage;
using Microsoft.WindowsAzure.Storage.Auth;
using Microsoft.WindowsAzure.Storage.Blob;
namespace AzureBlobClient
{
class Program
{
const string AZURE_STORE_CONN_STR = "DefaultEndpointsProtocol=https;"
+ "AccountName=storage-account-name;AccountKey=storage-account-key";
static void Main(string[] args)
{
Stream Observations = GetData();
CloudStorageAccount storageAccount =
CloudStorageAccount.Parse(AZURE_STORE_CONN_STR);
CloudBlobClient blobClient = storageAccount.CreateCloudBlobClient();
CloudBlobContainer container = blobClient.GetContainerReference("containername");
var blob = container.GetBlockBlobReference("data/weather.txt");
blob.UploadFromStreamAsync(Observations).Wait();
Console.WriteLine("Data Uploaded!");
Console.WriteLine("Press a key to end");
Console.Read();
}
static Stream GetData()
{
// code to retrieve data as a stream
}

Collecting and loading data into HDInsight 185

}
}

For information about the features of the Azure Storage Client libraries, see Whats new for Microsoft
Azure Storage at TechEd 2014.
Serializing data with the Microsoft .NET Library for Avro
The .NET Library for Avro is a component of the .NET SDK for HDInsight that you can use to serialize and
deserialize data using the Avro serialization format. Avro enables you to include schema metadata in a
data file, and is widely used in Hadoop (including in HDInsight) as a language-neutral means of
exchanging complex data structures between operations.
For example, consider a weather monitoring application that records meteorological observations. In
the application, each observation can be represented as an object with properties that contain the
specific data values for the observation. These properties might be simple values such as the date, the
time, the wind speed, and the temperature. However, some values might be complex structures such as
the geo-coded location of the monitoring station, which contains longitude and latitude coordinates.
The following code example shows how a list of weather observations in this complex data structure can
be serialized in Avro format and uploaded to Azure storage. The example is deliberately kept simple by
including the credentials in the code so that you can copy and paste it while you are experimenting with
HDInsight. In a production system you must protect credentials, as described in Securing credentials in
scripts and applications in the Security section of this guide.
C#
using
using
using
using
using
using
using

System;
System.Collections.Generic;
System.Text;
System.Threading.Tasks;
System.IO;
System.Runtime.Serialization;
System.Configuration;

using Microsoft.Hadoop.Avro.Container;
using Microsoft.Hadoop.WebHDFS;
using Microsoft.Hadoop.WebHDFS.Adapters;

namespace AvroClient
{
// Class representing a weather observation.
[DataContract(Name = "Observation", Namespace = "WeatherData")]
internal class Observation
{
[DataMember(Name = "obs_date")]
public DateTime Date { get; set; }
[DataMember(Name = "obs_time")]

186 Implementing big data solutions using HDInsight

public string Time { get; set; }


[DataMember(Name = "obs_location")]
public GeoLocation Location { get; set; }
[DataMember(Name = "wind_speed")]
public double WindSpeed { get; set; }
[DataMember(Name = "temperature")]
public double Temperature { get; set; }
}
// Struct for geo-location coordinates.
[DataContract]
internal struct GeoLocation
{
[DataMember]
public double lat { get; set; }
[DataMember]
public double lon { get; set; }
}
class Program
{
static void Main(string[] args)
{
// Get a list of Observation objects.
List<Observation> Observations = GetData();
// Serialize Observation objects to a file in Avro format.
string fileName = "observations.avro";
string filePath = new DirectoryInfo(".") + @"\" + fileName;
using (var dataStream = new FileStream(filePath, FileMode.Create))
{
// Compress the data using the Deflate codec.
using (var avroWriter = AvroContainer.CreateWriter<Observation>(dataStream,
Codec.Deflate))
{
using (var seqWriter = new SequentialWriter<Observation>(avroWriter, 24))
{
// Serialize the data to stream using the sequential writer.
Observations.ForEach(seqWriter.Write);
}
}
dataStream.Close();
// Upload the serialized data.
var hdInsightUser = "user-name";
var storageName = "storage-account-name";

Collecting and loading data into HDInsight 187

var storageKey = "storage-account-key";


var containerName = "container-name";
var destFolder = "/data/";
var hdfsClient = new WebHDFSClient(hdInsightUser,
new BlobStorageAdapter(storageName, storageKey, containerName, false));
hdfsClient.CreateFile(filePath, destFolder + fileName).Wait();
Console.WriteLine("The data has been uploaded in Avro format");
Console.WriteLine("Press a key to end");
Console.Read();
}
}
static List<Observation> GetData()
{
List<Observation> Observations = new List<Observation>();
// Code to capture a list of Observation objects.
return Observations;
}
}
}

The class Observation used to represent a weather observation, and the struct GeoLocation used to
represent a geographical location, include metadata to describe the schema. This schema information is
included in the serialized file that is uploaded to Azure storage, enabling an HDInsight process such as a
Pig job to deserialize the data into an appropriate data structure. Notice also that the data is
compressed using the Deflate codec as it is serialized, reducing the size of the file to be uploaded.

Custom cluster management clients


Options for provisioning (creating) and deleting an HDInsight cluster include:

Using the Azure management portal to create and delete the cluster interactively. For more
information see Manage Hadoop clusters in HDInsight using the Azure Management Portal.

Using Windows PowerShell scripts to automate provisioning and deletion of clusters. For more
information see Automating cluster management with PowerShell.

Using the SDK for HDInsight to integrate cluster management into a .NET Framework
application. For more information see Automating cluster management in a .NET application.

188 Implementing big data solutions using HDInsight

The correct approach to cluster provisioning depends on the specific business requirements and
constraints, but the following table describes typical approaches in relation to the common big data use
cases and models discussed in this guide.
Use case

Considerations

Iterative data
exploration

Creating and deleting the cluster manually when required through the Azure
management portal may be acceptable for data exploration scenarios where data
processing and analysis is performed interactively on an occasional basis by a
dedicated team of data analysts. However, if the analysis is more frequent the analysts
might benefit from creating a simple script or command line utility to automate the
process of creating and deleting the cluster.

Data warehouse on
demand

Data warehouses built on HDInsight are usually based on Hive tables, and the cluster
must be running to service Hive queries. If the data warehouse is queried directly by
users and applications, you may need to keep the cluster running continually.
However, if the data warehouse is used only as a data source for analytical data
models (for example, in SQL Server Analysis Services or PowerPivot workbooks) or
for cached reports you can create the cluster on demand to enable new data to be
processed, refresh the dependent data models and reports, and then delete the
cluster.

ETL automation

When HDInsight is used to filter and shape data in an ETL process, the destination of
the transformed data is usually another data store such as a SQL Server database.
Depending on the frequency of the ETL cycle, you may choose to include provisioning
and deletion of the cluster in the ETL process itself. In this case, cluster creation and
deletion are likely to be automated along with data ingestion, job execution, and the
data transfer tasks of the ETL workflow.

BI integration

In a managed BI solution, where HDInsight is used primarily as a means of preparing


big data for inclusion in an existing enterprise BI data warehouse or data models, the
cluster provisioning requirements are likely to be similar to those of the data
warehouse on demand and ETL automation models. If the HDInsight cluster must
support self-service BI that includes direct big data processing by business users, you
may need to consider keeping the cluster online continually.

Considerations
When planning how you will create a cluster for your solution, also consider the following points:

As part of the cluster provision process you may also need to create or manage storage
accounts. Often you will do this only once, and use the storage account each time you run your
automated solution. For more information see Cluster and storage initialization.

You should set all the properties for your cluster when you create it, using the techniques
described in this section of the guide. This ensures that the configuration is fixed in the cluster
definition, and will be reapplied to any virtual servers that make up the cluster if they are
automatically restarted after a failure or an upgrade. Virtual server management within the
datacenter may occur at any time, and you cannot control this. If you edit the configuration files
directly, any changes will be lost when a server restarts. However, you can change some cluster
properties for individual jobssee Configuring and debugging solutions for details

Collecting and loading data into HDInsight 189

Be careful how and when you delete a cluster as part of an automated solution. You may need
to implement a task that backs up the data and/or the metadata first. Ensure tools that allow
users to delete clusters perform user authentication and authorization to protect against
accidental and malicious use.

More information
For information about creating end-to-end automated solutions that include automated cluster
management stages, see Building end-to-end solutions using HDInsight.
For more details of the tools and technologies available for automating cluster management see
Appendix A - Tools and technologies reference.
For information on using PowerShell with HDInsight see HDInsight PowerShell Cmdlets Reference
Documentation.
For information on using the HDInsight SDK see HDInsight SDK Reference Documentation and the
incubator projects on the CodePlex website.
The topic Provision HDInsight clusters on the Azure website shows several ways that you can provision a
cluster.
Automating cluster management with PowerShell
You can use Windows PowerShell to create an HDInsight cluster by executing PowerShell commands
interactively, or by creating a PowerShell script that can be executed when required.
Before you use PowerShell to work with HDInsight you must configure the PowerShell environment to
connect to your Azure subscription. To do this you must first download and install the Azure PowerShell
module, which is available through the Microsoft Web Platform Installer. For more details see How to
install and configure Azure PowerShell.
Creating a cluster with the default configuration
When using PowerShell to create an HDInsight cluster, you use the New-AzureHDInsightCluster cmdlet
and specify the following configuration settings to create a cluster with the default settings for Hadoop
services:

A globally unique name for the cluster.

The geographical region where you want to create the cluster.

The Azure storage account to be used by the cluster.

The access key for the storage account.

The blob container in the storage account to be used by the cluster.

The number of data nodes to be created in the cluster.

The credentials to be used for administrative access to the cluster.

The version of HDInsight to be used.

190 Implementing big data solutions using HDInsight

If you do not intend to use an existing Azure storage account, you can create a new one with a globally
unique name using the New-AzureStorageAccount cmdlet, and then create a new blob container with
the New-AzureStorageContainer cmdlet. Many Azure services require a globally unique name. You can
determine if a specific name is already in use by an Azure service by using the Test-AzureName cmdlet.
The following code example creates an Azure storage account and an HDInsight cluster in the Southeast
Asia region (note that each command should be on a single, unbroken line). The example is deliberately
kept simple by including the credentials in the script so that you can copy and paste the code while you
are experimenting with HDInsight. In a production system you must protect credentials, as described in
Securing credentials in scripts and applications in the Security section of this guide.
Windows PowerShell
$storageAccountName = "unique-storage-account-name"
$containerName = "container-name"
$clusterName = "unique-cluster-name"
$userName = "user-name"
$password = ConvertTo-SecureString "password" -AsPlainText -Force
$location = "Southeast Asia"
$clusterNodes = 4
# Create a storage account.
Write-Host "Creating storage account..."
New-AzureStorageAccount -StorageAccountName $storageAccountName -Location $location
$storageAccountKey = Get-AzureStorageKey $storageAccountName | %{ $_.Primary }
$destContext = New-AzureStorageContext -StorageAccountName $storageAccountName StorageAccountKey $storageAccountKey
# Create a Blob storage container.
Write-Host "Creating container..."
New-AzureStorageContainer -Name $containerName -Context $destContext
# Create a cluster.
Write-Host "Creating HDInsight cluster..."
$credential = New-Object System.Management.Automation.PSCredential ($userName,
$password)
New-AzureHDInsightCluster -Name $clusterName -Location $location DefaultStorageAccountName "$storageAccountName.blob.core.windows.net"
-DefaultStorageAccountKey $storageAccountKey -DefaultStorageContainerName
$containerName
-ClusterSizeInNodes $clusterNodes -Credential $credential -Version 3.0
Write-Host "Finished!"

Notice that this script uses the Convert-To-SecureString function to encrypt the password in memory.
The password and the user name are passed to the New-Object cmdlet to create a PSCredential object
for the cluster credentials. Notice also that the access key for the storage account is obtained using the
Get-AzureStorageKey cmdlet.

Collecting and loading data into HDInsight 191

Creating a cluster with a customized configuration


The previous example creates a new HDInsight cluster with default configuration settings. If you require
a more customized cluster configuration, you can use the New-AzureHDInsightClusterConfig cmdlet to
create a base configuration for a cluster with a specified number of nodes. You can then use the
following cmdlets to define the settings you want to apply to your cluster:

Set-AzureHDInsightDefaultStorage: Specify the storage account and blob container to be used


by the cluster.

Add-AzureHDInsightStorage: Specify an additional storage account that the cluster can use.

Add-AzureHDInsightMetastore: Specify a custom Azure SQL Database instance to host Hive and
Oozie metadata.

Add-AzureHDInsightConfigValues: Add specific configurations settings for HDFS, map/reduce,


Hive, Oozie, or other Hadoop technologies in the cluster.

After you have added the required configuration settings, you can pass the cluster configuration variable
returned by New-AzureHDInsightClusterConfig to the New-AzureHDInsightCluster cmdlet to create the
cluster.
You can also specify a folder to store shared libraries and upload these so that they are available for use
in HDInsight jobs. Examples include UDFs for Hive and Pig, or custom SerDe components for use in Avro.
For more information see the section Create cluster with custom Hadoop configuration values and
shared libraries in the topic Microsoft .NET SDK For Hadoop on the CodePlex website.
For more information about using PowerShell to manage an HDInsight cluster see the HDInsight
PowerShell Cmdlets Reference Documentation.
Deleting a cluster
When you have finished with the cluster you can use the Remove-AzureHDInsightCluster cmdlet to
delete it. If you are also finished with the storage account, you can delete it after the cluster has been
deleted by using the Remove-AzureStorageAccount cmdlet.
The following code example shows a PowerShell script to delete an HDInsight cluster and the storage
account it was using.
C#
$storageAccountName = "storage-account-name"
$clusterName = "cluster-name"
# Delete HDInsight cluster.
Write-Host "Deleting $clusterName HDInsight cluster..."
Remove-AzureHDInsightCluster -Name $clusterName
# Delete storage account.
Write-Host "Deleting $storageAccountName storage account..."

192 Implementing big data solutions using HDInsight

Remove-AzureStorageAccount -StorageAccountName $storageAccountName

Automating cluster management in a .NET application


When you need to integrate cluster management into an application or service, you can use the .NET
SDK for HDInsight to provision and delete clusters as required. Adding the Microsoft Azure HDInsight
NuGet package to a project makes classes and interfaces in the
Microsoft.WindowsAzure.Management.HDInsight namespace available, and you can use these to
provision and manage HDInsight clusters.
Many of the techniques used to initiate jobs from a .NET application require the use of an Azure
management certificate to authenticate the request. To obtain a certificate you can:

Use the makecert command in a Visual Studio command line to create a certificate and upload
it to your subscription in the Azure management portal as described in Create and Upload a
Management Certificate for Azure.

Use the Get-AzurePublishSettingsFile and Import-AzurePublishSettingsFile Windows


PowerShell cmdlets to generate, download, and install a new certificate from your Azure
subscription as described in the section How to: Connect to your subscription of the topic How
to install and configure Azure PowerShell. If you want to use the same certificate on more than
one client computer you can copy the Azure publishsettings file to each one and use the
Import-AzurePublishSettingsFile cmdlet to import it.

After you have created and installed your certificate, it will be stored in the Personal certificate store on
your computer. You can view the details by using the certmgr.msc console.
To create a cluster programmatically, you must create an instance of the ClusterCreateParameters class,
specifying the following information:

A globally unique name for the cluster.

The geographical region where you want to create the cluster.

The default Azure storage account to be used by the cluster.

The access key for the storage account.

The blob container in the storage account to be used by the cluster.

The number of data nodes to be created in the cluster.

The credentials to be used for administrative access to the cluster.

The version of HDInsight to be used.

After you have created the initial ClusterCreateParameters class, you can optionally customize the
default HDInsight configuration settings by using the following properties:

Collecting and loading data into HDInsight 193

AdditionalStorageAccounts: Use this property to enable the cluster to access to additional


Azure storage accounts if required.

CoreConfiguration: Specify a ConfigValuesCollection object that contains custom Hadoop


configuration settings as key/value pairs.

HdfsConfiguration: Specify a ConfigValuesCollection object that contains custom HDFS


configuration settings as key/value pairs.

HiveConfiguration: Specify a ConfigValuesCollection object that contains custom Hive


configuration settings as key/value pairs.

HiveMetastore: Specify a custom Azure SQL Database instance in which to store Hive metadata.

MapReduceConfiguration: Specify a ConfigValuesCollection object that contains custom


map/reduce configuration settings as key/value pairs.

OozieConfiguration: Specify a ConfigValuesCollection object that contains custom Oozie


configuration settings as key/value pairs.

OozieMetastore: Specify a custom Azure SQL Database instance in which to store Oozie
metadata.

YarnConfiguration: Specify a ConfigValuesCollection object that contains custom YARN


configuration settings as key/value pairs.

When you are ready to create the cluster, you must use a locally stored Azure management certificate
to create an HDInsightCertificateCredential object and then use this object with the HDInsightClient
static class to connect to Azure and create a client object based on the IHDInsightClient interface. The
IHDInsightClient interface provides the CreateCluster method that you can use to create an HDInsight
cluster synchronously, and a CreateClusterAsync method you can use to create the cluster
asynchronously.
The following code example shows a simple console application that creates an HDInsight cluster using
an existing Azure storage account and container. The example is deliberately kept simple by including
the credentials in the code so that you can copy and paste it while you are experimenting with
HDInsight. In a production system you must protect credentials, as described in Securing credentials in
scripts and applications in the Security section of this guide.
C#
using
using
using
using
using
using
using
using

System;
System.Collections.Generic;
System.Linq;
System.Text;
System.Threading.Tasks;
System.Security.Cryptography.X509Certificates;
Microsoft.WindowsAzure.Management.HDInsight;
Microsoft.WindowsAzure.Management.HDInsight.ClusterProvisioning;

194 Implementing big data solutions using HDInsight

namespace ClusterMgmt
{
class Program
{
static void Main(string[] args)
{
string subscriptionId = "subscription-id";
string certFriendlyName = "certificate-friendly-name";
string clusterName = "unique-cluster-name";
string storageAccountName = "storage-account-name";
string storageAccountKey = "storage-account-key";
string containerName = "container-name";
string userName = "user-name";
string password = "password";
string location = "Southeast Asia";
int clusterSize = 4;
// Get the certificate object from certificate store
// using the friendly name to identify it.
X509Store store = new X509Store();
store.Open(OpenFlags.ReadOnly);
X509Certificate2 cert = store.Certificates.Cast<X509Certificate2>()
.First(item => item.FriendlyName == certFriendlyName);
// Create an HDInsightClient object.
HDInsightCertificateCredential creds = new HDInsightCertificateCredential(new
Guid(subscriptionId), cert);
var client = HDInsightClient.Connect(creds);
// Supply cluster information.
ClusterCreateParameters clusterInfo = new ClusterCreateParameters()
{
Name = clusterName,
Location = location,
DefaultStorageAccountName = storageAccountName + ".blob.core.windows.net",
DefaultStorageAccountKey = storageAccountKey,
DefaultStorageContainer = containerName,
UserName = userName,
Password = password,
ClusterSizeInNodes = clusterSize,
Version = "3.0"
};
// Create the cluster.
Console.WriteLine("Creating the HDInsight cluster ...");
ClusterDetails cluster = client.CreateCluster(clusterInfo);

Collecting and loading data into HDInsight 195

Console.WriteLine("Created cluster: {0}.", cluster.ConnectionUrl);


Console.WriteLine("Press a key to end.");
Console.Read();
}
}
}

Note that this example uses a pre-existing Azure storage account and container, which must be hosted
in the same geographical region as the cluster (in this case, Southeast Asia).
To delete a cluster you can use the DeleteCluster method of the HDInsightClient class.
For more information about using the .NET SDK for HDInsight to provision and delete HDInsight clusters
see HDInsight SDK Reference Documentation.

Processing, querying, and transforming data using HDInsight


This section of the guide explores the tools, techniques, and technologies for processing data in a big
data solution. This processing may include executing queries to extract data, transformations to modify
and shape data, and a range of other operations such as creating tables or executing workflows.
Microsoft big data solutions, including HDInsight on Microsoft Azure, are based on a Hadoop distribution
called the Hortonworks Data Platform (HDP). It uses the YARN resource manager to implement a
runtime platform for a wide range of data query, transformation, and storage tools and applications.
Figure 1 shows the high-level architecture of HDP, and how it supports the tools and applications
described in this guide.

Figure 1 - High-level architecture of the Hortonworks Data Platform


The three most commonly used tools for processing data by executing queries and transformations, in
order of popularity, are Hive, Pig, and map/reduce.
HCatalog is a feature of Hive that provides, amongst other features, a way to remove dependencies on
literal file paths in order to stabilize and unify solutions that incorporate multiple steps.
Mahout is a scalable machine learning library for clustering, classification, and collaborative filtering that
you can use to examine data files in order to extract specific types of information.

196 Implementing big data solutions using HDInsight

Storm is a real-time data processing application that is designed to handle streaming data.
These applications can be used for a wide variety of tasks, and many of them can be easily combined
into multi-step workflows by using Oozie.
This section of the guide contains the following topics:

Data processing tools and techniques

Workflow and job orchestration

Choosing tools and technologies

Building custom clients

HBase is a database management system that can provide scalability for storing vast amounts of data,
support for real-time querying, consistent reads and writes, automatic and configurable sharding of
tables, and high reliability with automatic failover. For more information see Data storage in the topic
Specifying the infrastructure.

Evaluating the results


After you have used HDInsight to process the source data you can use the data for analysis and
reporting, which forms the foundation for business decision making. However, before making critical
business decisions based on the results you must carefully evaluate them to ensure they are:

Meaningful. The values in the results, when combined and analyzed, relate to one another in a
meaningful way.

Accurate. The results appear to be correct, or are within the expected range.

Useful. The results are applicable to the business decision they will support, and provide
relevant metrics that help inform the decision making process.

You will often need to employ the services of a business user who intimately understands the business
context for the data to perform the role of a data steward and sanity check the results to determine
whether or not they fall within expected parameters. It may not be possible to validate all of the source
data for a query, especially if it is collected from external sources such as social media sites. However,
depending on the complexity of the processing, you might decide to select a number of data inputs for
spot-checking and trace them through the process to ensure that they produce the expected outcome.
When you are planning to use HDInsight to perform predictive analysis, it can be useful to evaluate the
process against known values. For example, if your goal is to use demographic and historical sales data
to determine the likely revenue for a proposed retail store, you can validate the processing model by
using appropriate source data to predict revenue for an existing store and compare the resulting
prediction to the actual revenue value. If the results of the data processing you have implemented vary

Processing, querying, and transforming data using HDInsight 197

significantly from the actual revenue, then it seems unlikely that the results for the proposed store will
be reliable.

Considerations
Consider the following points when designing and developing data processing solutions:

Big data frameworks offer a huge range of tools that you can use with the Hadoop core engine,
and choosing the most appropriate can be difficult. Azure HDInsight simplifies the process
because all of the tools it includes are guaranteed to be compatible and work correctly
together. This doesnt mean you cant incorporate other tools and frameworks in your solution.

Of the query and transformation applications, Hive is the most popular. However, many
HDInsight processing solutions are actually incremental in naturethey consist of multiple
queries, each operating on the output of the previous one. These queries may use different
query applications. For example, you might first use a custom map/reduce job to summarize a
large volume of unstructured data, and then create a Pig script to restructure and group the
data values produced by the initial map/reduce job. Finally, you might create Hive tables based
on the output of the Pig script so that client applications such as Excel can easily consume the
results.

If you decide to use a resource-intensive application such as HBase or Storm, you should
consider running it on a separate cluster from your Hadoop-based big data batch processing
solution to avoid contention and consequent loss of performance for the application and your
solution as a whole.

The challenges dont end with simply writing and running a job. As in any data processing
scenario, its vitally important to check that the results generated by queries are realistic, valid,
and useful before you invest a lot of time and effort (and cost) in developing and extending your
solution. A common use of HDInsight is simply to experiment with data to see if it can offer
insights into previously undiscovered information. As with any investigational or experimental
process, you need to be convinced that each stage is producing results that are both valid
(otherwise you gain nothing from the answers) and useful (in order to justify the cost and
effort).

Unless you are simply experimenting with data to find the appropriate questions to ask, you will
want to automate some or all of the tasks and be able to run the solution from a remote
computer. For more information see Building custom clients and Building end-to-end solutions
using HDInsight.

Security is a fundamental concern in all computing scenarios, and big data processing is no
exception. Security considerations apply during all stages of a big data process, and include
securing data while in transit over the network, securing data in storage, and authenticating and
authorizing users who have access to the tools and utilities you use as part of your process. For
more details of how you can maximize security of your HDInsight solutions see the topic
Security in the section Building end-to-end solutions using HDInsight.

198 Implementing big data solutions using HDInsight

More information
For more information about HDInsight, see the Microsoft Azure HDInsight web page.
A central point for TechNet articles about HDInsight is HDInsight Services For Windows.
For examples of how you can use HDInsight, see the following tutorials on the HDInsight website:

Use Hive with HDInsight

Use Pig with HDInsight

Develop and deploy Java MapReduce jobs to HDInsight

Develop C# Hadoop streaming programs for HDInsight

Data processing tools and techniques


There is a huge range of tools and frameworks you can use with a big data solution based on Hadoop.
This guide focuses on Microsoft big data solutions, and specifically those built on HDInsight running in
Azure. The following sections of this topic describe the data processing tools you are most likely to use
in an HDInsight solution, and the techniques that you are likely to apply.

Overview of Hive

Overview of Pig

Custom map/reduce components

User-defined functions

Overview of HCatalog

Overview of Mahout

Overview of Storm

Configuring and debugging solutions

Overview of Hive
Hive is an abstraction layer over the Hadoop query engine that provides a query language called HiveQL,
which is syntactically very similar to SQL and supports the ability to create tables of data that can be
accessed remotely through an ODBC connection.
In effect, Hive enables you to create an interface to your data that can be used in a similar way to a
traditional relational database. Business users can use familiar tools such as Excel and SQL Server
Reporting Services to consume data from HDInsight in a similar way as they would from a database
system such as SQL Server. Installing the ODBC driver for Hive on a client computer enables users to
connect to an HDInsight cluster and submit HiveQL queries that return data to an Excel worksheet, or to

Processing, querying, and transforming data using HDInsight 199

any other client that can consume results through ODBC. HiveQL also allows you to plug in custom
mappers and reducers to perform more sophisticated processing.
Hive is a good choice for data processing when:

You want to process large volumes of immutable data to perform summarization, ad hoc
queries, and analysis.

The source data has some identifiable structure, and can easily be mapped to a tabular schema.

You want to create a layer of tables through which business users can easily query source data,
and data generated by previously executed map/reduce jobs or Pig scripts.

You want to experiment with different schemas for the table format of the output.

You are familiar with SQL-like languages and syntax.

The processing you need to perform can be expressed effectively as HiveQL queries.

The latest versions of HDInsight incorporate a technology called Tez, part of the Stinger initiative for
Hadoop, that vastly increases the performance of Hive. For more details see Stinger: Interactive Query
for Hive on Hortonworks website.
If you are not familiar with Hive, a basic introduction to using HiveQL can be found in the topic
Processing data with Hive. You can also experiment with Hive by executing HiveQL statements in the
Hive Editor page of the HDInsight management portal. See Monitoring and logging for more details.
Overview of Pig
Pig is a query interface that provides a workflow semantic for processing data in HDInsight. Pig enables
you to perform complex processing of your source data to generate output that is useful for analysis and
reporting.
Pig statements are expressed in a language named Pig Latin, and generally involve defining relations
that contain data, either loaded from a source file or as the result of a Pig Latin expression on an existing
relation. Relations can be thought of as result sets, and can be based on a schema (which you define in
the Pig Latin statement used to create the relation) or can be completely unstructured.
Pig is a good choice when you need to:

Restructure source data by defining columns, grouping values, or converting columns to rows.

Perform data transformations such as merging and filtering data sets, and applying functions to
all or subsets of records.

Use a workflow-based approach to process data as a sequence of operations, which is often a


logical way to approach many data processing tasks.

If you are not familiar with Pig, a basic introduction to using Pig Latin can be found in the topic
Processing data with Pig.

200 Implementing big data solutions using HDInsight

Custom map/reduce components


Map/reduce code consists of two separate functions implemented as map and reduce components. The
map component is run in parallel on multiple cluster nodes, each node applying it to its own subset of
the data. The reduce component collates and summarizes the results from all of the map functions (see
How do big data solutions work? for more details of these two components).
In most HDInsight processing scenarios it is simpler and more efficient to use a higher-level abstraction
such as Pig or Hive, although you can create custom map and reduce components for use within Hive
scripts in order to perform more sophisticated processing.
Custom map/reduce components are typically written in Java. However, Hadoop provides a streaming
interface that allows components to be used that are developed in other languages such as C#, F#,
Visual Basic, Python, JavaScript, and more. For more information see the section Using Hadoop
Streaming in the topic Writing map/reduce code.
You might consider creating your own map and reduce components when:

You want to process data that is completely unstructured by parsing it and using custom logic in
order to obtain structured information from it.

You want to perform complex tasks that are difficult (or impossible) to express in Pig or Hive
without resorting to creating a UDF. For example, you might need to use an external geocoding
service to convert latitude and longitude coordinates or IP addresses in the source data to
geographical location names.

You want to reuse your existing .NET, Python, or JavaScript code in map/reduce components.
You can do this using the Hadoop streaming interface.

If you are not familiar with writing map/reduce components, a basic introduction and information about
using Hadoop streaming can be found in the topic Writing map/reduce code.
User-defined functions
Developers often find that they reuse the same code in several locations, and the typical way to
optimize this is to create a user-defined function (UDF) that can be imported into other projects when
required. Often a series of UDFs that accomplish related functions are packaged together in a library so
that the library can be imported into a project. Hive and Pig can take advantage of any of the UDFs it
contains. For more information see User-defined functions.
Overview of HCatalog
Technologies such as Hive, Pig, and custom map/reduce code can be used to process data in an
HDInsight cluster. In each case you use code to project a schema onto data that is stored in a particular
location, and then apply the required logic to filter, transform, summarize, or otherwise process the
data to generate the required results.
The code must load the source data from wherever it is stored, and convert it from its current format to
the required schema. This means that each script must include assumptions about the location and

Processing, querying, and transforming data using HDInsight 201

format of the source data. These assumptions create dependencies that can cause your scripts to break
if an administrator chooses to change the location, format, or schema of the source data.
Additionally, each processing interface (Hive, Pig, or custom map/reduce) requires its own definition of
the source data, and so complex data processes that involve multiple steps in different interfaces
require consistent definitions of the data to be maintained across all of the scripts.
HCatalog provides a tabular abstraction layer that helps unify the way that data is interpreted across
processing interfaces, and provides a consistent way for data to be loaded and storedregardless of the
specific processing interface being used. This abstraction exposes a relational view over the data,
including support for partitions.
The following factors will help you decide whether to incorporate HCatalog in your HDInsight solution:

It makes it easy to abstract the data storage location, format, and schema from the code used
to process it.

It minimizes fragile dependencies between scripts in complex data processing solutions where
the same data is processed by multiple tasks.

It enables notification of data availability, making it easier to write applications that perform
multiple jobs.

It is easy to incorporate into solutions that include Hive and Pig scripts, requiring very little extra
code. However, if you use only Hive scripts and queries, or you are creating a one-shot solution
for experimentation purposes and do not intend to use it again, HCatalog is unlikely to provide
any benefit.

Files in JSON, SequenceFile, CSV, and RC format can be read and written by default, and a
custom serializer/deserializer component (SerDe) can be used to read and write files in other
formats (see SerDe on the Apache wiki for more details).

Additional effort is required to use HCatalog in custom map/reduce components because you
must create your own custom load and store functions.

For more information, see Unifying and stabilizing jobs with HCatalog.
Overview of Mahout
Mahout is a data mining query library that you can use to examine data files in order to extract specific
types of information. It provides an implementation of several machine learning algorithms, and is
typically used with source data files containing relationships between the items of interest in a data
processing solution. For example, it can use a data file containing the similarities between different
movies and TV shows to create a list of recommendations for customers based on items they have
already viewed or purchased. The source data could be obtained from a third party, or generated and
updated by your application based on purchases made by other customers.

202 Implementing big data solutions using HDInsight

Mahout queries are typically executed as a separate process, perhaps based on a schedule, to update
the results. These results are usually stored as a file within the cluster storage, though they may be
exported to a database or to visualization tools. Mahout can also be executed as part of a workflow.
However, it is a batch-based process that may take some time to execute with large source datasets.
Mahout is a good choice when you need to:

Apply clustering algorithms to group documents or data items that contain similar content.

Apply recommendation mining algorithms to discover users preferences from their behavior.

Apply classification algorithms to assign new documents or data items to a category based on
the existing categorizations.

Perform frequent data mining operations based on the most recent data.

For more information see the Apache Mahout website.


Overview of Storm
Storm is a scalable, fault-tolerant, distributed, real-time computation system for processing fast and
large streams of data. It allows you to build trees and directed acyclic graphs (DAGs) that
asynchronously process data items using a user-defined number of parallel tasks. It can be used for realtime analytics, online machine learning, continuous computation, Extract Transform Load (ETL) tasks,
and more.
Storm processes messages or stream inputs as individual data items, which it refers to as tuples, using a
user-defined number of parallel tasks. Input data is exposed by spouts that connect to an input stream
such as a message queue, and pass data as messages to one or more bolts. Each bolt is a processing task
that can be configured to run as multiple instances. Bolts can pass data as messages to other bolts using
a range of options. For example, a bolt might pass the results of all the messages it processes to several
different bolts, to a specific single bolt, or to a range of bolts based on filtering a value in the message.
This flexible and configurable routing system allows you to construct complex graphs of tasks to perform
real-time processing. Bolts can maintain state for aggregation operations, and output results to a range
of different types of storage including relational databases. This makes it ideal for performing ETL tasks
as well as providing a real-time filtering, validation, and alerting solution for streaming data. A high-level
language called Trident can be used to build complex queries and processing solutions with Storm.
Storm is a good choice when you need to:

Pre-process data before loading it into a big data solution.

Handle huge volumes of data or messages that arrive at a very high rate.

Filter and sort incoming stream data for storing in separate files, repositories, or database
tables.

Processing, querying, and transforming data using HDInsight 203

Examine the data stream in real time, perhaps to raise alerts for out-of-band values or specific
combinations of events, before analyzing it later using one of the batch-oriented query
mechanisms such as Hive or Pig.

For more information, see the Tutorial on the Storm documentation website.

Processing data with Hive


Hive uses tables to impose a schema on data, and to provide a query interface for client applications.
The key difference between Hive tables and those in traditional database systems, such as SQL Server, is
that Hive adopts a schema on read approach that enables you to be flexible about the specific
columns and data types that you want to project onto your data.
You can create multiple tables with different schemas from the same underlying data, depending on
how you want to use that data. You can also create views and indexes over Hive tables, and partition
tables. Moving data into a Hive-controlled namespace is usually an instantaneous operation.
You can use the Hive command line on the HDInsight cluster to work with Hive tables, and build an
automated solution that includes Hive queries by using the HDInsight .NET SDKs and with a range of
Hadoop-related tools such Oozie and WebHCat. You can also use the Hive ODBC driver to connect to
Hive from any ODBC-capable client application.
The topics covered here are:

Creating tables with Hive

Managing Hive table data location and lifetime

Loading data into Hive tables

Partitioning the data

Querying tables with HiveQL

In addition to its more usual use as a querying mechanism, Hive can be used to create a simple data
warehouse containing table definitions applied to data that you have already processed into the
appropriate format. Azure storage is relatively inexpensive, and so this is a good way to create a
commodity storage system when you have huge volumes of data. An example of this can be found in
Scenario 2: Data warehouse on demand.
Creating tables with Hive
You create tables by using the HiveQL CREATE TABLE statement, which in its simplest form looks similar
to the equivalent statement in Transact-SQL. You specify the schema in the form of a series of column
names and types, and the type of delimiter that Hive will use to delineate each column value as it parses
the data. You can also specify the format for the files in which the table data will be stored if you do not
want to use the default format (where data files are delimited by an ASCII code 1 (Octal \001) character,

204 Implementing big data solutions using HDInsight

equivalent to Ctrl + A). For example, the following code creates a table named mytable and specifies
that the data files for the table should be tab-delimited.
HiveQL
CREATE TABLE mytable (col1 STRING, col2 INT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t';

You can also create a table and populate it as one operation by using a CREATE TABLE statement that
includes a SELECT statement to query an existing table, as described later in this topic.
Hive supports a sufficiently wide range of data types to suit almost any requirement. The primitive data
types you can use for columns in a Hive table are TINYINT, SMALLINT, INT, BIGINT, BOOLEAN, FLOAT,
DOUBLE, STRING, BINARY, DATE, TIMESTAMP, CHAR, VARCHAR, DECIMAL (though the last five of these
are not available in older versions of Hive). In addition to these primitive types you can define columns
as ARRAY, MAP, STRUCT, and UNIONTYPE. For more information see Hive Data Types in the Apache Hive
language manual.
Managing Hive table data location and lifetime
Hive tables are simply metadata definitions imposed on data in underlying files. By default, Hive stores
table data in the user/hive/warehouse/table_name path in storage (the default path is defined in the
configuration property hive.metastore.warehouse.dir), so the previous code sample will create the
table metadata definition and an empty folder at user/hive/warehouse/mytable. When you delete the
table by executing the DROP TABLE statement, Hive will delete the metadata definition from the Hive
database and it will also remove the user/hive/warehouse/mytable folder and its contents.
Table and column names are case-sensitive so, for example, the table named MyTable is not the same
as the table mytable.
However, you can specify an alternative path for a table by including the LOCATION clause in the
CREATE TABLE statement. The ability to specify a non-default location for the table data is useful when
you want to enable other applications or users to access the files outside of Hive. This allows data to be
loaded into a Hive table simply by copying data files of the appropriate format into the folder, or
downloaded directly from storage. When the table is queried using Hive, the schema defined in its
metadata is automatically applied to the data in the files.
An additional benefit of specifying the location is that this makes it easy to create a table for data that
already exists in that location (perhaps the output from a previously executed map/reduce job or Pig
script). After creating the table, the existing data in the folder can be retrieved immediately with a
HiveQL query.
However, one consideration for using managed tables is that, when the table is deleted, the folder it
references will also be deletedeven if it already contained other data files when the table was created.
If you want to manage the lifetime of the folder containing the data files separately from the lifetime of
the table, you must use the EXTERNAL keyword in the CREATE TABLE statement to indicate that the
folder will be managed externally from Hive, as shown in the following code sample.

Processing, querying, and transforming data using HDInsight 205

HiveQL
CREATE EXTERNAL TABLE mytable (col1 STRING, col2 INT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE LOCATION '/mydata/mytable';

In HDInsight the location shown in this example corresponds to wasbs://[container-name]@[storageaccount-name].blob.core.windows.net/mydata/mytable in Azure storage.
This ability to manage the lifetime of the table data separately from the metadata definition of the table
means that you can create several tables and views over the same data, but each can have a different
schema. For example, you may want to include fewer columns in one table definition to reduce the
network load when you transfer the data to a specific analysis tool, but have all of the columns available
for another tool.
As a general guide you should:

Use INTERNAL tables (the default, commonly referred to as managed tables) when you want
Hive to manage the lifetime of the table or when the data in the table is temporary; for
example, when you are running experimental or one-off queries over the source data.

Use INTERNAL tables and also specify the LOCATION for the data files when you want to access
the data files from outside of Hive; for example, if you want to upload the data for the table
directly into the Azure storage location.

Use EXTERNAL tables when you want to manage the lifetime of the data, when data is used by
processes other than Hive, or if the data files must be preserved when the table is dropped.
However, notice that cannot use EXTERNAL tables when you implicitly create the table by
executing a SELECT query against an existing table.

Loading data into Hive tables


In addition to simply uploading data into the location specified for a Hive table, as described in the
previous section of this topic, you can use the HiveQL LOAD statement to load data from an existing file
into a Hive table. When the LOCAL keyword is included, the LOAD statement copies the file from the
local file system to the folder associated with the table.
An alternative is to use a CREATE TABLE statement that includes a SELECT statement to query an
existing table. The results of the SELECT statement are used to create data files for the new table. When
you use this technique the new table must be an internal, non-partitioned table.
You can use an INSERT statement to insert the results of a SELECT query into an existing table, and in
this case the table can be partitioned. You can use the OVERWRITE keyword with both the LOAD and
INSERT statements to replace any existing data with the new data being inserted. You can use the
OVERWRITE DIRECTORY keyword to effectively export the data returned by the SELECT part of the
statement data to an output file.
As a general guide you should:

206 Implementing big data solutions using HDInsight

Use the LOAD statement when you need to create a table from the results of a map/reduce job
or a Pig script. These scripts generate log and status files as well as the output file when they
execute, and using the LOAD method enables you to easily add the output data to a table
without having to deal with the additional files that you do not want to include in the table.
Alternatively you can move the output file to a different location before you create a Hive table
over it.

Use the INSERT statement when you want to load data from an existing table into a different
table. A common use of this approach is to upload source data into a staging table that matches
the format of the source data (for example, tab-delimited text). Then, after verifying the staged
data, compress and load it into a table for analysis, which may be in a different format such as a
SEQUENCE FILE.

Use a SELECT query in a CREATE TABLE statement to generate the table dynamically when you
just want simplicity and flexibility. You do not need to know the column details to create the
table, and you do not need to change the statement when the source data or the SELECT
statement changes. You cannot, however, create an EXTERNAL or partitioned table this way
and so you cannot control the data lifetime of the new table separately from the metadata
definition.

To compress data as you insert it from one table to another, you must set some Hive parameters to
specify that the results of the query should be compressed, and specify the compression algorithm to be
used. The raw data for the table is in TextFile format, which is the default storage. However,
compression may mean that Hadoop will not be able to split the file into chunks/blocks and run multiple
map tasks in parallelwhich can result in under-utilization of the cluster resources by preventing
multiple map tasks from running concurrently. The recommended practice is to insert data into another
table, which is stored in SequenceFile format. Hadoop can split data in SequenceFile format and
distribute it across multiple map jobs.
For example, the following HiveQL statements load compressed data from a staging table into another
table.
HiveQL
CREATE TABLE raw (line STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n';
CREATE TABLE raw_sequence (line STRING)
STORED AS SEQUENCEFILE;
LOAD DATA LOCAL INPATH '/path/file.gz' INTO TABLE raw;
SET hive.exec.compress.output=true;
SET io.seqfile.compression.type=BLOCK; -- NONE/RECORD/BLOCK (see below)
INSERT OVERWRITE TABLE raw_sequence SELECT * FROM raw;

The value for io.seqfile.compression.type determines how the compression is performed. The options
are NONE, RECORD, and BLOCK. RECORD compresses each value individually, while BLOCK buffers up
1MB (by default) before beginning compression.

Processing, querying, and transforming data using HDInsight 207

For more information about creating tables with Hive, see Hive Data Definition Language on the Apache
Hive site. For a more detailed description of using Hive see Hive Tutorial.
Partitioning the data
Advanced options when creating a table include the ability to partition, skew, and cluster the data
across multiple files and folders:

You can use the PARTITIONED BY clause to create a subfolder for each distinct value in a
specified column (for example, to store a file of daily data for each date in a separate folder).

You can use the SKEWED BY clause to create separate files for rows where a specified column
value is in a list of specified values. Rows with values not listed are stored together in a separate
single file.

You can use the CLUSTERED BY clause to distribute data across a specified number of subfolders
(described as buckets) based on hashes of the values of specified columns.

When you partition a table, the partitioning columns are not included in the main table schema section
of the CREATE TABLE statement. Instead, they must be included in a separate PARTITIONED BY clause.
The partitioning columns can, however, still be referenced in SELECT queries. For example, the following
HiveQL statement creates a table in which the data is partitioned by a string value named partcol1.
HiveQL
CREATE EXTERNAL TABLE mytable (col1 STRING, col2 INT)
PARTITIONED BY (partcol1 STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE LOCATION '/mydata/mytable';

When data is loaded into the table, subfolders are created for each partition column value. For example,
you could load the following data into the table.
col1

col2

partcol1

ValueA1

ValueA2

ValueB1

ValueB2

After this data has been loaded into the table, the /mydata/mytable folder will contain a subfolder
named partcol1=A and a subfolder named partcol1=B, and each subfolder will contain the data files for
the values in the corresponding partitions.
When you need to load data into a partitioned table you must include the partitioning column values. If
you are loading a single partition at a time, and you know the partitioning value, you can specify explicit
partitioning values as shown in the following HiveQL INSERT statement.

208 Implementing big data solutions using HDInsight

HiveQL
FROM staging_table s
INSERT INTO mytable PARTITION(partcol1='A')
SELECT s.col1, s.col2
WHERE s.col3 = 'A';

Alternatively, you can use dynamic partition allocation so that Hive creates new partitions as required by
the values being inserted. To use this approach you must enable the non-strict option for the dynamic
partition mode, as shown in the following code sample.
HiveQL
SET hive.exec.dynamic.partition.mode=nonstrict;
SET hive.exec.dynamic.partition=true;
FROM staging_table s
INSERT INTO mytable PARTITION(partVal)
SELECT s.col1, s.col2, s.col3 partVal

Querying tables with HiveQL


After you have created tables and loaded data files into the appropriate locations you can query the
data by executing HiveQL SELECT statements against the tables. HiveQL SELECT statements are similar to
SQL, and support common operations such as JOIN, UNION, GROUP BY, and ORDER BY. For example,
you could use the following code to query the mytable table described earlier.
HiveQL
SELECT col1, SUM(col2) AS total
FROM mytable
GROUP BY col1;

When designing an overall data processing solution with HDInsight, you may choose to perform complex
processing logic in custom map/reduce components or Pig scripts and then create a layer of Hive tables
over the results of the earlier processing, which can be queried by business users who are familiar with
basic SQL syntax. However, you can use Hive for all processing, in which case some queries may require
logic that is not possible to define in standard HiveQL functions.
In addition to common SQL semantics, HiveQL supports the use of:

Custom map/reduce scripts embedded in a query through the MAP and REDUCE clauses.

Custom user-defined functions (UDFs) that are implemented in Java, or that call Java functions
available in the existing installed libraries. UDFs are discussed in more detail in the topic Userdefined functions.

XPath functions for parsing XML data using XPath. See Hive and XML File Processing for more
information.

This extensibility enables you to use HiveQL to perform complex transformations on data as it is queried.
To help you decide on the right approach, consider the following guidelines:

Processing, querying, and transforming data using HDInsight 209

If the source data must be extensively transformed using complex logic before being consumed
by business users, consider using custom map/reduce components or Pig scripts to perform
most of the processing, and create a layer of Hive tables over the results to make them easily
accessible from client applications.

If the source data is already in an appropriate structure for querying and only a few specific but
complex transforms are required, consider using map/reduce scripts embedded in HiveQL
queries to generate the required results.

If queries will be created mostly by business users, but some complex logic is still regularly
required to generate specific values or aggregations, consider encapsulating that logic in custom
UDFs because these will be simpler for business users to include in their HiveQL queries than a
custom map/reduce script.

For more information about selecting data from Hive tables, see Language Manual Select on the Apache
Hive website. For some useful tips on using the SET command to configure headers and directory
recursion in Hive see Useful Hive settings.

Processing data with Pig


Pig Latin syntax has some similarities to LINQ, and encapsulates many functions and expressions that
make it easy to create a sequence of complex data transformations with just a few lines of simple code.
Pig Latin is a good choice for creating relations and manipulating sets, and for working with unstructured
source data. You can always create a Hive table over the results of a Pig Latin query if you want table
format output. However, the syntax of Pig Latin can be complex for non-programmers to master. Pig
Latin is not as familiar or as easy to use as HiveQL, but Pig can achieve some tasks that are difficult, or
even impossible, when using Hive.
You can run Pig Latin statements interactively in the Hadoop command line window or in a command
line Pig shell named Grunt. You can also combine a sequence of Pig Latin statements in a script that can
be executed as a single job, and use user-defined functions you previously uploaded to HDInsight. The
Pig Latin statements are used by the Pig interpreter to generate jobs, but the jobs are not actually
generated and executed until you call either a DUMP statement (which is used to display a relation in
the console, and is useful when interactively testing and debugging Pig Latin code) or a STORE
statement (which is used to store a relation as a file in a specified folder).
Pig scripts generally save their results as text files in storage, where they can easily be viewed on
demand, perhaps by using the Hadoop command line window. However, the results can be difficult to
consume or processes in client applications unless you copy the output files and import them into client
tools such as Excel.
Executing a Pig script
As an example of using Pig, suppose you have a tab-delimited text file containing source data similar to
the following.

210 Implementing big data solutions using HDInsight

Data
Value1
Value2
Value3
Value1
Value3
Value1
Value2
Value2

1
3
2
4
6
2
8
5

You could process the data in the source file with the following simple Pig Latin script.
Pig Latin
A = LOAD '/mydata/sourcedata.txt' USING PigStorage('\t') AS (col1, col2:long);
B = GROUP A BY col1;
C = FOREACH B GENERATE group, SUM(A.col2) as total;
D = ORDER D BY total;
STORE D INTO '/mydata/results';

This script loads the tab-delimited data into a relation named A imposing a schema that consists of two
columns: col1, which uses the default byte array data type, and col2, which is a long integer. The script
then creates a relation named B in which the rows in A are grouped by col1, and then creates a relation
named C in which the col2 value is aggregated for each group in B.
After the data has been aggregated, the script creates a relation named D in which the data is sorted
based on the total that has been generated. The relation D is then stored as a file in the /mydata/results
folder, which contains the following text.
Data
Value1 7
Value3 8
Value2 16

For more information about Pig Latin syntax, see Pig Latin Reference Manual 2 on the Apache Pig
website. For a more detailed description of using Pig see Pig Tutorial.

Writing map/reduce code


You can implement map/reduce code in Java (which is the native language for map/reduce jobs in all
Hadoop distributions), or in a number of other supported languages including JavaScript, Python, C#,
and F# through Hadoop Streaming.
Creating map and reduce components
The following code sample shows a commonly referenced JavaScript map/reduce example that counts
the words in a source that consist of unstructured text data.
JavaScript
var map = function (key, value, context) {

Processing, querying, and transforming data using HDInsight 211

var words = value.split(/[^a-zA-Z]/);


for (var i = 0; i < words.length; i++) {
if (words[i] !== "") {
context.write(words[i].toLowerCase(), 1);
}
}
};
var reduce = function (key, values, context) {
var sum = 0;
while (values.hasNext()) {
sum += parseInt(values.next());
}
context.write(key, sum);
};

The map function splits the contents of the text input into an array of strings using anything that is not
an alphabetic character as a word delimiter. Each string in the array is then used as the key of a new
key/value pair with the value set to 1.
Each key/value pair generated by the map function is passed to the reduce function, which sums the
values in key/value pairs that have the same key. Working together, the map and reduce functions
determine the total number of times each unique word appeared in the source data, as shown here.
Data
Aardvark
About
Above
Action
...

2
7
12
3

For more information about writing map/reduce code, see MapReduce Tutorial on the Apache Hadoop
website and Develop Java MapReduce programs for HDInsight on the Azure website.
Using Hadoop streaming
The Hadoop core within HDInsight supports a technology called Hadoop Streaming that allows you to
interact with the map/reduce process and run your own code outside of the Hadoop core as a separate
executable process. Figure 1 shows a high-level overview of the way that streaming works.

212 Implementing big data solutions using HDInsight

Figure 1 - High level overview of Hadoop Streaming


Figure 1 shows how streaming executes the map and reduce components as separate processes. The
schematic does not attempt to illustrate all of the standard map/reduce stages, such as sorting and
merging the intermediate results or using multiple instances of the reduce component.
When using Hadoop Streaming, each node passes the data for the map part of the process to a separate
process through the standard input (stdin), and accepts the results from the code through the
standard output (stdout), instead of internally invoking a map component written in Java. In the same
way, the node(s) that execute the reduce process pass the data as a stream to the specified code or
component, and accept the results from the code as a stream, instead of internally invoking a Java
reduce component.
Streaming has the advantage of decoupling the map/reduce functions from the Hadoop core, allowing
almost any type of components to be used to implement the mapper and the reducer. The only
requirement is that the components must be able to read from and write to the standard input and
output.

Processing, querying, and transforming data using HDInsight 213

Using the streaming interface does have a minor impact on performance. The additional movement of
the data over the streaming interface can marginally increase query execution time. Streaming tends to
be used mostly to enable the creation of map and reduce components in languages other than Java. It is
quite popular when using Python, and also enables the use of .NET languages such as C# and F# with
HDInsight.
The Azure SDK contains a series of classes that make it easier to use the streaming interface from .NET
code. For more information see Microsoft .NET SDK For Hadoop on CodePlex.
For more details see Hadoop Streaming on the Apache website. For information about writing HDInsight
map/reduce jobs in languages other than Java, see Develop C# Hadoop streaming programs for
HDInsight and Hadoop Streaming Alternatives.

User-defined functions
You can create user-defined functions (UDFs) and libraries of UDFs for use with HDInsight queries and
transformations. Typically, the UDFs are written in Java and they can be referenced and used in a Hive or
Pig script, or (less common) in custom map/reduce code. You can write UDFs in Python for use with Pig,
but the techniques are different from those described in this topic.
UDFs can be used not only to centralize code for reuse, but also to perform tasks that are difficult (or
even impossible) in the Hive and Pig scripting languages. For example, a UDF could perform complex
validation of values, concatenation of column values based on complex conditions or formats,
aggregation of rows, replacement of specific values with nulls to prevent errors when processing bad
records, and much more.
The topics covered here are:

Creating and using UDFs with Hive

Creating and using UDFs with Pig

Creating and using UDFs with Hive


In Hive, UDFs are often referred to a plugins. You can create three different types of Hive plugin. A
standard UDF is used to perform operations on a single row of data, such as transforming a single input
value into a single output value. A user-defined table generating function (UDTF) is used to perform
operations on a single row, but can produce multiple output rows. A user-defined aggregating function
(UDAF) is used to perform operations on multiple rows, and it outputs a single row.
Each type of UDF extends a specific Hive class in Java. For example, the standard UDF must extend the
built-in Hive class UDF or GenericUDF and accept text string values (which can be column names or
specific text strings). As a simple example, you can create a UDF named your-udf-name as shown in the
following code.
Java
package your-udf-name.hive.udf;

214 Implementing big data solutions using HDInsight

import org.apache.hadoop.hive.ql.exec.UDF;
public final class your-udf-name extends UDF {
public Text evaluate(final Text s) {
// Implementation here.
// Return the result.
}
}

The body of the UDF, the evaluate function, accepts one or more string values. Hive passes the values of
columns in the dataset to these parameters at runtime, and the UDF generates a result. This might be a
text string that is returned within the dataset, or a Boolean value if the UDF performs a comparison test
against the values in the parameters. The arguments must be types that Hive can serialize.
After you create and compile the UDF, you upload it to HDInsight at the start of the Hive session using a
script with the following command.
Hadoop command
add jar /path/your-udf-name.jar;

Alternatively, you can upload the UDF to a shared library folder when you create the cluster, as
described in Automating cluster management with PowerShell. Then you must register it using a
command such as the following.
Hive script
CREATE TEMPORARY FUNCTION local-function-name
AS 'package-name.function-name';

You can then use the UDF in your Hive query or transformation. For example, if the UDF returns a text
string value you can use it as shown in the following code to replace the value in the specified column of
the dataset with the value generated by the UDF.
Hive script
SELECT your-udf-name(column-name) FROM your-data

If the UDF performs a simple task such as reversing the characters in a string, the result would be a
dataset where the value in the specified column of every row would have its contents reversed.
However, the registration only makes the UDF available for the current session, and you will need to reregister it each time you connect to Hive.
For more information about creating and using standard UDFs in Hive, see HivePlugins on the Apache
website. For more information about creating different types of Hive UDF, see User Defined Functions in
Hive, Three Little Hive UDFs: Part 1, Three Little Hive UDFs: Part 2, and Three Little Hive UDFs: Part 3 on
the Oracle website.

Processing, querying, and transforming data using HDInsight 215

Creating and using UDFs with Pig


You can create different types of UDF for use with Pig. The most common is an evaluation function that
extends the class EvalFunc. The function must accept an input of type Tuple, and return the required
result (which can be null). The following outline shows a simple example.
Java
package your-udf-package;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;
public class your-udf-name extends EvalFunc<String>
{
public String exec(Tuple input) {
// Implementation here.
// Return the result.
}
}

After you create and compile the UDF, you upload it to HDInsight at the start of the Pig session with the
following command.
Hadoop command
add jar /path/your-udf-name.jar;

Alternatively, you can upload the UDF to a shared library folder when you create the cluster, as
described in Automating cluster management with PowerShell. The REGISTER command at the start of a
Pig script will then make the UDF available and you can use it in your Pig queries and transformations.
For example, if the UDF returns the lower-cased equivalent of the input string, you can use it as shown
in this query to generate a list of lower-cased equivalents of the text strings in the first column of the
input data.
Pig script
REGISTER 'your-udf-name.jar';
A = LOAD 'your-data' AS (column1: chararray, column2: int);
B = FOREACH A GENERATE your-udf-name.function-name(column1);
DUMP B;

A second type of UDF in Pig is a filter function that you can use to filter data. A filter function must
extend the class FilterFunc, accept one or more values as Tuple instances, and return a Boolean value.
The UDF can then be used to filter rows based on values in a specified column of the dataset. For
example, if a UDF named IsShortString returns true for any input value less than five characters in
length, you could use the following script to remove any rows where the first column has a value less
than five characters.
Pig script
REGISTER 'your-udf-name.jar';
A = LOAD 'your-data' AS (column1: chararray, column2: int);

216 Implementing big data solutions using HDInsight

B = FOREACH A GENERATE your-udf-name.function-name(column1);


C = FILTER B BY not IsShortString(column1)
DUMP C;

For more information about creating and using UDFs in Pig, see the Pig UDF Manual on the Apache Pig
website.

Unifying and stabilizing jobs with HCatalog


HCatalog makes it easier to create complex, multi-step data processing solutions that enable you to
operate on the same data by using Hive, Pig, or custom map/reduce code without having to handle
storage details in each script. HCatalog doesnt change the way scripts and queries work. It just abstracts
the details of the data file location and the schema so that your code becomes less fragile because it has
fewer dependencies, and the solution is also much easier to administer.
An overview of the use of HCatalog as storage abstraction layer is shown in Figure 1.

Figure 1 - Unifying different processing mechanisms with HCatalog


To understand the benefits of using HCatalog, consider a typical scenario shown in Figure 2.

Figure 2 - An example solution that benefits from using HCatalog

Processing, querying, and transforming data using HDInsight 217

In the example, Hive scripts create the metadata definition for two tables that have different schemas.
The table named mydata is created over some source data uploaded as a file to HDInsight, and this
defines a Hive location for the table data (step 1 in Figure 2). Next, a Pig script reads the data defined in
the mydata table, summarizes it, and stores it back in the second table named mysummary (steps 2 and
3). However, in reality, the Pig script does not access the Hive tables (which are just metadata
definitions). It must access the source data file in storage, and write the summarized result back into
storage, as shown by the dotted arrow in Figure 2.
In Hive, the path or location of these two files is (by default) denoted by the name used when the tables
were created. For example, the following HiveQL shows the definition of the two Hive tables in this
scenario, and the code that loads a data file into the mydata table.
HiveQL
CREATE TABLE mydata (col1 STRING, col2 INT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t';
LOAD DATA INPATH '/mydata/data.txt' INTO TABLE mydata;
CREATE TABLE mysummary (col1 STRING, col2 BIGINT);

Hive scripts can now access the data simply by using the table names mydata and mysummary. Users
can use HiveQL to query the Hive table without needing to know anything about the underlying file
location or the data format of that file.
However, the Pig script that will group and aggregate the data, and store the results in the mysummary
table, must know both the location and the data format of the files. Without HCatalog, the script must
specify the full path to the mydata table source file, and be aware of the source schema in order to
apply an appropriate schema (which must be defined in the script). In addition, after the processing is
complete, the Pig script must specify the location associated with the mysummary table when storing
the result back in storage, as shown in the following code sample.
Pig Latin
A = LOAD '/mydata/data.txt'
USING PigStorage('\t') AS (col1, col2:long);
...
...
...
STORE X INTO '/mysummary/data.txt';

The file locations, the source format, and the schema are hard-coded in the script, creating some
potentially problematic dependencies. For example, if an administrator moves the data files, or changes
the format by adding a column, the Pig script will fail.
Using HCatalog removes these dependencies by enabling Pig to use the Hive metadata that defines the
tables. To use HCatalog with Pig you must specify the -useHCatalog parameter, and the path to the
HCatalog installation files must be registered as an environment variable named HCAT_HOME. For

218 Implementing big data solutions using HDInsight

example, you could use the following Hadoop command line statements to launch the Grunt interface
with HCatalog enabled.
Command Line to start Pig
SET HCAT_HOME = C:\apps\dist\hcatalog-0.4.1
Pig -useHCatalog

With the HCatalog support loaded you can now use the HCatLoader and HCatStorer objects in the Pig
script, enabling you to access the data through the Hive metadata instead of requiring direct access to
the data file storage.
Pig Latin
A = LOAD 'mydata'
USING org.apache.hcatalog.pig.HCatLoader();
...
...
...
STORE X INTO 'mysummary'
USING org.apache.hcatalog.pig.HCatStorer();

The script stores the summarized data in the location denoted for the mysummary table defined in
Hive, and so it can be queried using HiveQL as shown in the following example.
HiveQL
SELECT * FROM mysummary;

HCatalog also exposes notification events that you can use by other tools such as Oozie to detect when
certain storage events occur.
For more information see HCatalog in the Apache Hive Confluence Spaces.

Configuring and debugging solutions


HDInsight clusters are automatically configured when they are created, and you should not attempt to
change the configuration of a cluster itself by editing the cluster configuration files. You can set a range
of properties that you require for the cluster when you create it, but in general you will set specific
properties for individual jobs at runtime. You may also need to debug your queries and transformations
if they fail or do not produce the results you expect.
Runtime job configuration
You may need to change the cluster configuration properties for a job, such as a query or
transformation, before you execute it. HDInsight allows configuration settings (or configurable
properties) to be specified at runtime for individual jobs.
As an example, in a Hive query you can use the SET statement to set the value of a property. The
following statements at the start of a query set the compression options for that job.
HiveQL
SET mapreduce.map.output.compress=true;

Processing, querying, and transforming data using HDInsight 219

SET mapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.GzipCodec;
SET mapreduce.output.compression.type=BLOCK;
SET hive.exec.compress.intermediate=true;

Alternatively, when executing a Hadoop command, you can use the -D parameter to set property values,
as shown here.
Command Line
hadoop [COMMAND] D property=value

The space between the -D and the property name can be omitted. For more details of the options you
can use in a job command, see the Apache Hadoop Commands Manual.
For information about how you can set the configuration of a cluster for all jobs, rather than configuring
each job at runtime, see Custom cluster management clients.
Debugging and testing
Debugging and testing a Hadoop-based solution is more difficult than a typical local application.
Executable applications that run on the local development machine, and web applications that run on a
local development web server such as the one built into Visual Studio, can easily be run in debug mode
within the integrated development environment. Developers use this technique to step through the
code as it executes, view variable values and call stacks, monitor procedure calls, and much more.
None of these functions are available when running code in a remote cluster. However, there are some
debugging techniques you can apply. This section contains information that will help you to understand
how to go about debugging and testing your solutions:

Writing out significant values or messages during execution. You can add extra statements or
instructions to your scripts or components to display the values of variables, export datasets,
write messages, or increment counters at significant points during the execution.

Obtaining debugging information from log files. You can monitor log files and standard error
files for evidence that will help you locate failures or problems.

Using a single-node local cluster for testing and debugging. Running the solution in a local or
remote single-node cluster can help to isolate issues with parallel execution of mappers and
reducers.

Hadoop jobs may fail for reasons other than an error in the scripts or code. The two primary reasons are
timeouts and unhandled errors due to bad input data. By default, Hadoop will abandon a job if it does
not report its status or perform I/O activity every ten minutes. Typically most jobs will do this
automatically, but some processor-intensive tasks may take longer.
If a job fails due to bad input data that the map and reduce components cannot handle, you can instruct
Hadoop to skip bad records. While this may affect the validity of the output, skipping small volumes of
the input data may be acceptable. For more information, see the section Skipping Bad Records in the
MapReduce Tutorial on the Apache website.

220 Implementing big data solutions using HDInsight

Writing out significant values or messages during execution


A traditional debugging technique is to simply output values as a program executes to indicate progress.
It allows developers to check that the program is executing as expected, and helps to isolate errors in
the code. You can use this technique in HDInsight in several ways:

If you are using Hive you might be able to split a complex script into separate simpler scripts,
and display the intermediate datasets to help locate the source of the error.

If you are using a Pig script you can:

Dump messages and/or intermediate datasets generated by the script to disk before
executing the next command. This can indicate where the error occurs, and provide a
sample of the data for you to examine.

Call methods of the EvalFunc class that is the base class for most evaluation functions in
Pig. You can use this approach to generate heartbeat and progress messages that prevent
timeouts during execution, and to write to the standard log file. See Class EvalFunc<T> on
the Pig website for more information.

If you are using custom map and reduce components you can write debugging messages to the
standard output file from the mapper and then aggregate them in the reducer, or generate
status messages from within the components. See the Reporter class on the Apache website for
more information.

If the problem arises only occasionally, or on only one node, it may be due to bad input data. If
skipping bad records is not appropriate, add code to your mapper class that validates the input
data and reports any errors encountered when attempting to parse or manipulate it. Write the
details and an extract of the data that caused the problem to the standard error file.

You can kill a task while it is operating and view the call stack and other useful information such
as deadlocks. To kill a job, execute the command kill -QUIT [job_id]. The job ID can be found in
the Hadoop YARN Status portal. The debugging information is written to the standard output
(stdout) file.

For information about the Hadoop YARN Status portal, see Monitoring and logging.
Obtaining debugging information from log files
The core Hadoop engine in HDInsight generates a range of information in log files, counters, and status
messages that is useful for debugging and testing the performance of your solutions. Much of this
information is accessible through the Hadoop YARN Status portal.
The following list contains some suggestions to help you obtain debugging information from HDInsight:

Use the Applications section of the Hadoop YARN Status portal to view the status of jobs. Select
FAILED in the menu to see failed jobs. Select the History link for a job to see more details. In the
details page are menu links to show the job counters, and details of each map and reduce task.

Processing, querying, and transforming data using HDInsight 221

The Task Details page shows the errors, and provides links to the log files and the values of
custom and built-in counters.

View the history, job configuration, syslog, and other log files. The Tools section of the Hadoop
YARN Status portal contains a link that opens the log files folder where you can view the logs,
and also a link to view the current configuration of the cluster. In addition, see Monitoring and
logging in the section Building end-to-end solutions using HDInsight.

Run a debug information script automatically to analyze the contents of the standard error,
standard output, job configuration, and syslog files. For more information about running debug
scripts, see How to Debug Map/Reduce Programs and Debugging in the MapReduce Tutorial on
the Apace website.

Using a single-node local cluster for testing and debugging


Performing runtime debugging and single-stepping through the code in map and reduce components is
not possible within the Hadoop environment. If you want to perform this type of debugging, or run unit
tests on components, you can create an application that executes the components locally, outside of
Hadoop, in a development environment that supports debugging. You can then use a mocking
framework and test runner to perform unit tests.
Sometimes errors can occur due to the parallel execution of multiple map and reduce components on a
multi-node cluster. Consider emulating distributed testing by running multiple jobs on a single node
cluster at the same time to detect errors, then expand this approach to run multiple jobs concurrently
on clusters containing more than one node in order to help isolate the issue.
You can create a single-node HDInsight cluster in Azure by specifying the advanced option when creating
the cluster. Alternatively you can install a single-node development environment on your local computer
and execute the solution there. A single-node local development environment for Hadoop-based
solutions that is useful for initial development, proof of concept, and testing is available from
Hortonworks. For more details, see Hortonworks Sandbox.
By using a single-node local cluster you can rerun failed jobs and adjust the input data, or use smaller
datasets, to help you isolate the problem. To rerun the job go to the \taskTracker\task-id\work folder
and execute the command % hadoop org.apache.hadoop.mapred.IsolationRunner ../job.xml. This runs
the failed task in a single Java virtual machine over the same input data.
You can also turn on debugging in the Java virtual machine to monitor execution. More details of
configuring the parameters of a Java virtual machine can be found on several websites, including Java
virtual machine settings on the IBM website.

Workflow and job orchestration


Many data processing solutions require the coordinated processing of multiple jobs, often with a
conditional workflow. For example, consider a solution in which data files containing web server log

222 Implementing big data solutions using HDInsight

entries are uploaded each day, and must be parsed to load the data they contain into a Hive table. A
workflow to accomplish this might consist of the following steps:
1. Insert data from the files located in the /uploads folder into the Hive table.
2. Delete the source files, which are no longer required.
This workflow is relatively simple, but could become more complex when other required tasks are
added. For example:
1. If there are no files in the /uploads folder, go to step 5.
2. Insert data from the files into the Hive table.
3. Delete the source files, which are no longer required.
4. Send an email message to an operator indicating success, and stop.
5. Send an email message to an operator indicating failure.
Implementing these kinds of workflows is possible in a range of ways. For example, you could:

Use the Oozie framework that is installed with HDInsight, and PowerShell or the Oozie client in
the HDInsight .NET SDK to execute it. This is a good option when:

You are familiar with the syntax and usage of Oozie.

You want to execute workflows from within a program running on a client computer.

You are familiar with .NET and prepared to write programs that use the .NET Framework.
For more information see Use Oozie with HDInsight. If you are not familiar with Oozie, see the
next section, "Creating workflows with Oozie" for an overview of how it can be used. A
demonstration of using an Oozie workflow can also be found in the topic Scenario 3: ETL
automation. Use SQL Server Integration Services (SSIS) or a similar integration framework. This
is a good option when:

You have SQL Server installed and are experienced with writing SSIS workflows.

You want to take advantage of the powerful capabilities of SSIS workflows.


The process for creating SSIS workflows is described in more detail in Scenario 4: BI
integration.

Use the Cascading abstraction layer software. This is a good choice when:

You want to execute complex data processing workflows written in any language that runs
on the Java virtual machine.

You have complex multi-level workflows that you need to combine into a single task.

You want to control the execution of the map and reduce phases of jobs directly in code.

Processing, querying, and transforming data using HDInsight 223

Create a custom application or script that executes the tasks as a workflow. This is a good
option when:

You need a fairly simple workflow that can be expressed using your chosen programming or
scripting language.

You want to run scripts on a schedule, perhaps driven by Windows Scheduled Tasks.

You are prepared to use a Remote Desktop connection to communicate with the cluster to
administer the processes.

Third party workflow frameworks such as Hamake or Azkaban are also available and are a good option
when you are familiar with these tools, or if they offer a capability you need that is not available in other
tools. However, they are not currently supported on HDInsight.
More information
Oozie workflows can be executed using the Oozie time-based coordinator, or by using the classes in the
HDInsight SDK. The topic Use Oozie with HDInsight on the Azure website describes how you can use
Oozie, and the topic Use time-based Oozie Coordinator with HDInsight extends this to show time-based
coordination of a workflow.
For information about automating an entire solution see Building end-to-end solutions using HDInsight.

Creating workflows with Oozie


Oozie is the most commonly used mechanism for workflow development in Hadoop. It is a tool that
enables you to create repeatable, dynamic workflows for tasks to be performed in an HDInsight cluster.
The tasks themselves are specified in a control dependency direct acyclic graph (DAG), which is stored in
a Hadoop Process Definition Language (hPDL) file named workflow.xml in a folder in the HDFS file
system on the cluster. For example, the following hPDL document contains a DAG in which a single step
(or action) is defined.
hPDL
<workflow-app xmlns="uri:oozie:workflow:0.2" name="hive-wf">
<start to="hive-node"/>
<action name="hive-node">
<hive xmlns="uri:oozie:hive-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>default</value>
</property>
<property>
<name>oozie.hive.defaults</name>
<value>hive-default.xml</value>
</property>

224 Implementing big data solutions using HDInsight

</configuration>
<script>script.q</script>
<param>INPUT_TABLE=HiveSampleTable</param>
<param>OUTPUT=/results/sampledata</param>
</hive>
<ok to="end"/>
<error to="fail"/>
</action>
<kill name="fail">
<message>Hive failed, error
message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name="end"/>
</workflow-app>

The action itself is a Hive job defined in a HiveQL script file named script.q, with two parameters named
INPUT_TABLE and OUTPUT. The code in script.q is shown in the following example.
HiveQL
INSERT OVERWRITE DIRECTORY '${OUTPUT}' SELECT * FROM ${INPUT_TABLE}

The script file is stored in the same folder as the workflow.xml hPDL file, along with a standard
configuration file for Hive jobs named hive-default.xml.
A configuration file named job.properties is stored on the local file system of the computer on which
the Oozie client tools are installed. This file, shown in the following example, contains the settings that
will be used to execute the job.
job.properties
nameNode=wasb://my_container@my_asv_account.blob.core.windows.net
jobTracker=jobtrackerhost:9010
queueName=default
oozie.use.system.libpath=true
oozie.wf.application.path=/workflowfiles/

To initiate the workflow, the following command is executed on the computer where the Oozie client
tools are installed.
Command Line
oozie job -oozie http://localhost:11000/oozie/ -config c:\scripts\job.properties run

When Oozie starts the workflow it returns a job ID in the format 0000001-123456789123456-oozie-hdpW. You can check the status of a job by opening a Remote Desktop connection to the cluster and using a
web browser to navigate to http://localhost:11000/oozie/v0/job/job-id?show=log.
You can also initiate an Oozie job by using Windows PowerShell or the .NET SDK for HDInsight. For more
details see Initiating an Oozie workflow with PowerShell and Initiating an Oozie workflow from a .NET
application.

Processing, querying, and transforming data using HDInsight 225

For more information about Oozie see Apache Oozie Workflow Scheduler for Hadoop. An example of
using Oozie can be found in Scenario 3: ETL automation.

Choosing tools and technologies


Hadoop-based big data systems such as HDInsight enable data processing using a wide range of tools
and technologies, many of which are described earlier in this section of the guide. This topic provides
comparisons between the commonly used tools and technologies to help you choose the most
appropriate for your own scenarios. The following table shows the main advantages and considerations
for each one.
Query mechanism

Advantages

Considerations

Hive using HiveQL

An excellent solution for batch


processing and analysis of large
amounts of immutable data, for data
summarization, and for ad hoc querying.

It requires the source data to have at least


some identifiable structure.

It uses a familiar SQL-like syntax.


It can be used to produce persistent
tables of data that can be easily
partitioned and indexed.

It is not suitable for real-time queries and


row level updates. It is best used for batch
jobs over large sets of data.
It might not be able to carry out some types
of complex processing tasks.

Multiple external tables and views can be


created over the same data.
It supports a simple data warehouse
implementation that provides massive
scale out and fault tolerance capabilities
for data storage and processing.
Pig using Pig Latin

An excellent solution for manipulating


data as sets, merging and filtering
datasets, applying functions to records or
groups of records, and for restructuring
data by defining columns, by grouping
values, or by converting columns to rows.

SQL users may find Pig Latin is less familiar


and more difficult to use than HiveQL.
The default output is usually a text file and
so it is more difficult to use with
visualization tools such as Excel. Typically
you will layer a Hive table over the output.

It can use a workflow-based approach as


a sequence of operations on data.
Custom map/reduce

It provides full control over the map and


reduce phases and execution.
It allows queries to be optimized to
achieve maximum performance from the
cluster, or to minimize the load on the
servers and the network.
The components can be written in a
range of widely known languages that
most developers are likely to be familiar
with.

It is more difficult than using Pig or Hive


because you must create your own map
and reduce components.
Processes that require the joining of sets of
data are more difficult to implement.
Even though there are test frameworks
available, debugging code is more complex
than a normal application because they run
as a batch job under the control of the
Hadoop job scheduler.

226 Implementing big data solutions using HDInsight

HCatalog

It abstracts the path details of storage,


making administration easier and
removing the need for users to know
where the data is stored.

It supports RCFile, CSV text, JSON text,


SequenceFile and ORC file formats by
default, but you may need to write a custom
SerDe if you use other formats.

It enables notification of events such as


data availability, allowing other tools such
as Oozie to detect when operations have
occurred.

HCatalog is not thread-safe.

It exposes a relational view of data,


including partitioning by key, and makes
the data easy to access.
Mahout

There are some restrictions on the data


types for columns when using the HCatalog
loader in Pig scripts. See HCatLoader Data
Types in the Apache HCatalog
documentation for more details.

It can make it easier and quicker to build


intelligent applications that require data
mining and data classification.

Accurate results are usually obtained only


when there are large pre-categorized sets
of reference data.

It contains pre-created algorithms for


many common machine learning and
data mining scenarios.

Learning from scratch, such as in a related


products (collaborative filtering) scenario
where reference data is gradually
accumulated, can take some time to
provide accurate results.

It can be scaled across distributed nodes


to maximize performance.

As a framework, it often requires writing


considerable supporting code to use it
within a solution.
Storm

It provides an easy way to implement


highly scalable, fault-tolerant, and
reliable real-time processing for
streaming data.

Enabling the full set of features for


guaranteed processing can impose a
performance penalty.

It makes it easier to build complex


parallel processing topologies.
It is useful for monitoring and raising
alerts in real time while storing incoming
stream data for analysis later.
It supports a high-level processing
language called Trident that implements
an intuitive fluent interface.

Typically, you will use the simplest of these approaches that can provide the results you require. For
example, it may be that you can achieve these results by using just Hive, but for more complex scenarios
you may need to use Pig or even write your own map and reduce components. You may also decide,
after experimenting with Hive or Pig, that custom map and reduce components can provide better
performance by allowing you to fine tune and optimize the processing.
The following table shows some of the more general suggestions that will help you make the
appropriate choice of query technology depending on the requirements of your task.
Requirement

Appropriate technologies

Processing, querying, and transforming data using HDInsight 227

Table or dataset joins, or manipulating nested data.

Pig and Hive.

Ad hoc data analysis.

Pig, map/reduce (including Hadoop Streaming for nonJava components).

SQL-like data analysis and data warehousing.

Hive

Working with binary data or SequenceFiles.

Hive with Avro, Java map/reduce components.

Working with existing Java or map/reduce libraries.

Java map/reduce components, UDFs in Hive and Pig.

Maximum performance for large or recurring jobs.

Well-designed Java map/reduce components.

Using scripting or non-Java languages.

Hadoop Streaming

Abstracting storage paths for Hive and Pig to simplify


administration.

HCatalog

Pre-processing or fully processing streaming data in real


time.

Storm

Performing classification and data mining through


collaborative filtering and machine learning.

Mahout

For more information about these tools and technologies see Data processing tools and techniques.

Building custom clients


In most cases the execution of big data processing jobs forms part of a larger business process or BI
solution. While you can execute jobs interactively using the Hadoop command line interface in a remote
desktop connection to the cluster, it is common to incorporate the logic to submit and coordinate
individual jobs and Oozie workflows from client applications. This may be in the form of dedicated job
management utilities and scripts, or as complete applications that fully integrate big data processing
into business processes.
On the Windows platform you can choose from a variety of technologies and APIs for submitting jobs to
an HDInsight cluster. The primary APIs are provided through the Azure and HDInsight modules for
Windows PowerShell, and the .NET SDK for HDInsight.

Building custom clients with Windows PowerShell scripts


PowerShell is a good choice for automating HDInsight jobs when an individual data analyst wants to
experiment interactively with data, or when you want to automate big data processing through the use
of command line scripts that can be scheduled to run when required.
You can run PowerShell scripts interactively in a Windows command line window or in a PowerShellspecific command line console. Additionally, you can edit and run PowerShell scripts in the Windows
PowerShell Interactive Scripting Environment (ISE), which provides IntelliSense and other user interface
enhancements that make it easier to write PowerShell code. You can schedule the execution of
PowerShell scripts using Windows Scheduler, SQL Server Agent, or other tools as described in Building
end-to-end solutions using HDInsight.

228 Implementing big data solutions using HDInsight

Before you use PowerShell to work with HDInsight you must configure the PowerShell environment to
connect to your Azure subscription. To do this you must first download and install the Azure PowerShell
module, which is available through the Microsoft Web Platform Installer. For more details see How to
install and configure Azure PowerShell.
The following examples demonstrate some common scenarios for submitting and running jobs using
PowerShell:

Running a map/reduce job with Windows PowerShell

Running Pig and Hive jobs with Windows PowerShell

Initiating an Oozie workflow with PowerShell

Building custom clients with the .NET Framework


The .NET SDK for HDInsight provides classes that enable developers to submit jobs to HDInsight from
applications built on the .NET Framework. It is a good choice when you need to build custom data
processing applications, or integrate HDInsight data processing into existing business applications that
are based on the .NET Framework.
Many of the techniques used to initiate jobs from a .NET application require the use of an Azure
management certificate to authenticate the request. To do this you can:

Use the makecert command in a Visual Studio command line to create a certificate and upload
it to your subscription in the Azure management portal as described in Create and Upload a
Management Certificate for Azure.

Use the Get-AzurePublishSettingsFile and Import-AzurePublishSettingsFile Windows


PowerShell cmdlets to generate, download, and install a new certificate from your Azure
subscription as described in the section How to: Connect to your subscription of the topic How
to install and configure Azure PowerShell. If you want to use the same certificate on more than
one client computer you can copy the Azure publishsettings file to each one and use the
Import-AzurePublishSettingsFile cmdlet to import it.

After you have created and installed your certificate, it will be stored in the Personal certificate store on
your computer. You can view the details in the certmgr.msc console.
The following examples demonstrate some common scenarios for submitting and running jobs using
.NET Framework code:

Submitting a map/reduce job from a .NET application

Submitting Pig and Hive jobs from a .NET application

Initiating an Oozie workflow from a .NET application

Processing, querying, and transforming data using HDInsight 229

Running a map/reduce job with Windows PowerShell


To submit a map/reduce job that uses a Java .jar file to process data you can use the NewAzureHDInsightMapReduceJobDefinition cmdlet to define the job and its parameters, and then initiate
the job by using the Start-AzureHDInsightJob cmdlet. The job is run asynchronously. If you want to show
the output generated by the job you must wait for it to complete by using the Wait-AzureHDInsightJob
cmdlet with a suitable timeout value, and then display the job output with the GetAzureHDInsightJobOutput cmdlet.
Note that the job output in this context does not refer to the data files generated by the job, but to the
status and outcome messages generated while the job is in progress. This is the same output as
displayed in the console window when the job is executed interactively at the command line.
The following code example shows a PowerShell script that uses the mymapreduceclass class in
mymapreducecode.jar with arguments to indicate the location of the data to be processed and the
folder where the output files should be stored. The script waits up to 3600 seconds for the job to
complete, and then displays the output that was generated by the job.
Windows PowerShell
$clusterName = "cluster-name"
$jobDef = New-AzureHDInsightMapReduceJobDefinition
-JarFile "wasb:///mydata/jars/mymapreducecode.jar"
-ClassName "mymapreduceclass"
-Arguments "wasb:///mydata/source", "wasb:///mydata/output"
$wordCountJob = Start-AzureHDInsightJob -Cluster $clusterName -JobDefinition $jobDef
Write-Host "Map/Reduce job submitted..."
Wait-AzureHDInsightJob -Job $wordCountJob -WaitTimeoutInSeconds 3600
Get-AzureHDInsightJobOutput -Cluster $clusterName -JobId $wordCountJob.JobId StandardError

Due to page width limitations we have broken some of the commands in the code above across several
lines for clarity. In your code each command must be on a single, unbroken line.
The Azure PowerShell module also provides the NewAzureHDInsightStreamingMapReduceJobDefinition cmdlet, which you can use to execute map/reduce
jobs that are implemented in .NET assemblies and that use the Hadoop Streaming API. This cmdlet
enables you to specify discrete .NET executables for the mapper and reducer to be used in the job.

Running Pig and Hive jobs with Windows PowerShell


You can submit Pig and Hive jobs in a Windows PowerShell script by using the NewAzureHDInsightPigJobDefinition and New-AzureHDInsightHiveJobDefiniton cmdlets.

230 Implementing big data solutions using HDInsight

After defining the job you can initiate it with the Start-HDInsightJob cmdlet, wait for it to complete with
the Wait-AzureHDInsightJob cmdlet, and retrieve the completion status with the GetAzureHDInsightJobOutput cmdlet.
The following code example shows a PowerShell script that executes a Hive job based on hard-coded
HiveQL code in the PowerShell script. A Query parameter is used to specify the HiveQL code to be
executed, and in this example some of the code is generated dynamically based on a PowerShell
variable.
Windows PowerShell
$clusterName = "cluster-name"
$tableFolder = "/data/mytable"
$hiveQL
$hiveQL
$hiveQL
$hiveQL

= "CREATE TABLE mytable"


+= " (id INT, val STRING)"
+= " ROW FORMAT DELIMITED FIELDS TERMINATED BY ','"
+= " STORED AS TEXTFILE LOCATION '$tableFolder';"

$jobDef = New-AzureHDInsightHiveJobDefinition -Query $hiveQL


$hiveJob = Start-AzureHDInsightJob Cluster $clusterName JobDefinition $jobDef
Write-Host "HiveQL job submitted..."
Wait-AzureHDInsightJob -Job $hiveJob -WaitTimeoutInSeconds 3600
Get-AzureHDInsightJobOutput -Cluster $clusterName -JobId $hiveJob.JobId StandardError

As an alternative to hard-coding HiveQL or Pig Latin code in a PowerShell script, you can use the File
parameter to reference a file in Azure storage that contains the Pig Latin or HiveQL code to be executed.
In the following code example a PowerShell script uploads a Pig Latin code file that is stored locally in
the same folder as the PowerShell script, and then uses it to execute a Pig job.
Windows PowerShell
$clusterName = "cluster-name"
$storageAccountName = "storage-account-name"
$containerName = "container-name"
# Find the folder where this PowerShell script is saved
$localfolder = Split-Path -parent $MyInvocation.MyCommand.Definition
$destfolder = "data/scripts"
$scriptFile = "ProcessData.pig"
# Upload Pig Latin script to Azure Storage

Processing, querying, and transforming data using HDInsight 231

$storageAccountKey = (Get-AzureStorageKey -StorageAccountName


$storageAccountName).Primary
$blobContext = New-AzureStorageContext -StorageAccountName $storageAccountName StorageAccountKey $storageAccountKey
$blobName = "$destfolder/$scriptFile"
$filename = "$localfolder\$scriptFile"
Set-AzureStorageBlobContent -File $filename -Container $containerName -Blob $blobName
-Context $blobContext -Force
write-host "$scriptFile uploaded to $containerName!"
# Run the Pig Latin script
$jobDef = New-AzureHDInsightPigJobDefinition -File "wasb:///$destfolder/$scriptFile"
$pigJob = Start-AzureHDInsightJob Cluster $clusterName JobDefinition $jobDef
Write-Host "Pig job submitted..."
Wait-AzureHDInsightJob -Job $pigJob -WaitTimeoutInSeconds 3600
Get-AzureHDInsightJobOutput -Cluster $clusterName -JobId $pigJob.JobId -StandardError

In addition to the New-AzureHDInsightHiveJobDefinition cmdlet, you can execute HiveQL commands


using the Invoke-AzureHDInsightHiveJob cmdlet (which can be abbreviated to Invoke-Hive). Generally,
when the purpose of the script is simply to retrieve and display the results of Hive SELECT query, the
Invoke-Hive cmdlet is the preferred option because it requires significantly less code. For more details
about using Invoke-Hive, see Querying Hive tables with Windows PowerShell.

Initiating an Oozie workflow with PowerShell


The Azure PowerShell module does not provide a cmdlet specifically for initiating Oozie jobs. However,
you can use the Invoke-RestMethod cmdlet to submit the job as an HTTP request to the Oozie
application on the HDInsight cluster. This technique requires that your PowerShell code constructs the
appropriate XML job configuration body for the request, and you will need to decipher the JSON
responses returned by the cluster to determine job status and results.
The following code example shows a PowerShell script that uploads the files used by an Oozie workflow,
constructs an Oozie job request to start the workflow, submits the request to a cluster, and displays
status information as the workflow progresses. The example is deliberately kept simple by including the
credentials in the script so that you can copy and paste the code while you are experimenting with
HDInsight. In a production system you must protect credentials, as described in Securing credentials in
scripts and applications in the Security section of this guide.
Windows PowerShell
$clusterName = "cluster-name"
$storageAccountName = "storage-account-name"
$containerName = "container-name"
$clusterUser = "user-name"
$clusterPwd = "password"

232 Implementing big data solutions using HDInsight

$passwd = ConvertTo-SecureString $clusterPwd AsPlainText -Force


$creds = New-Object System.Management.Automation.PSCredential($clusterUser, $passwd)
$storageUri = "wasb://$containerName@$storageAccountName.blob.core.windows.net"
$ooziePath = "/mydata/oozieworkflow"
$tableName = "MyTable"
$tableFolder = "/mydata/MyTable"
# Find the local folder containing the workflow files.
$thisfolder = Split-Path -parent $MyInvocation.MyCommand.Definition
$localfolder = "$thisfolder\oozieworkflow"
# Upload workflow files.
$destfolder = "mydata/oozieworkflow"
$storageAccountKey = (Get-AzureStorageKey -StorageAccountName
$storageAccountName).Primary
$blobContext = New-AzureStorageContext -StorageAccountName $storageAccountName StorageAccountKey $storageAccountKey
$files = Get-ChildItem $localFolder
foreach($file in $files)
{
$fileName = "$localFolder\$file"
$blobName = "$destfolder/$file"
write-host "copying $fileName to $blobName"
Set-AzureStorageBlobContent -File $filename -Container $containerName -Blob
$blobName -Context $blobContext -Force
}
write-host "All files in $localFolder uploaded to $containerName!"
# Oozie job properties.
$OoziePayload = @"
<?xml version="1.0" encoding="UTF-8"?>
<configuration>
<property>
<name>nameNode</name>
<value>$storageUri</value>
</property>
<property>
<name>jobTracker</name>
<value>jobtrackerhost:9010</value>
</property>
<property>
<name>queueName</name>
<value>default</value>
</property>
<property>
<name>oozie.use.system.libpath</name>
<value>true</value>
</property>

Processing, querying, and transforming data using HDInsight 233

<property>
<name>TableName</name>
<value>$tableName</value>
</property>
<property>
<name>TableFolder</name>
<value>$tableFolder</value>
</property>
<property>
<name>user.name</name>
<value>$clusterUser</value>
</property>
<property>
<name>oozie.wf.application.path</name>
<value>$ooziePath</value>
</property>
</configuration>
"@
# Create Oozie job.
$clusterUriCreateJob = "https://$clusterName.azurehdinsight.net:443/oozie/v2/jobs"
$response = Invoke-RestMethod -Method Post
-Uri $clusterUriCreateJob
-Credential $creds
-Body $OoziePayload
-ContentType "application/xml" -OutVariable $OozieJobName
$jsonResponse = ConvertFrom-Json(ConvertTo-Json -InputObject $response)
$oozieJobId = $jsonResponse[0].("id")
Write-Host "Oozie job id is $oozieJobId..."
# Start Oozie job.
Write-Host "Starting the Oozie job $oozieJobId..."
$clusterUriStartJob = "https://$clusterName.azurehdinsight.net:443/oozie/v2/job/" +
$oozieJobId + "?action=start"
$response = Invoke-RestMethod -Method Put -Uri $clusterUriStartJob -Credential $creds
| Format-Table -HideTableHeaders
# Get job status.
Write-Host "Waiting until the job metadata is populated in the Oozie metastore..."
Start-Sleep -Seconds 10
Write-Host "Getting job status and waiting for the job to complete..."
$clusterUriGetJobStatus = "https://$clusterName.azurehdinsight.net:443/oozie/v2/job/"
+ $oozieJobId + "?show=info"
$response = Invoke-RestMethod -Method Get -Uri $clusterUriGetJobStatus -Credential
$creds
$jsonResponse = ConvertFrom-Json (ConvertTo-Json -InputObject $response)
$JobStatus = $jsonResponse[0].("status")
while($JobStatus -notmatch "SUCCEEDED|KILLED")
{

234 Implementing big data solutions using HDInsight

Write-Host "$(Get-Date -format 'G'): $oozieJobId is $JobStatus ...waiting for the


job to complete..."
Start-Sleep -Seconds 10
$response = Invoke-RestMethod -Method Get -Uri $clusterUriGetJobStatus -Credential
$creds
$jsonResponse = ConvertFrom-Json (ConvertTo-Json -InputObject $response)
$JobStatus = $jsonResponse[0].("status")
}
Write-Host "$(Get-Date -format 'G'): $oozieJobId $JobStatus!"

Due to page width limitations we have broken some of the commands in the code above across several
lines for clarity. In your code each command must be on a single, unbroken line.
The request to submit the Oozie job consists of an XML configuration document that contains the
properties to be used by the workflow. These properties include configuration settings for Oozie, and
any parameters that are required by actions defined in the Oozie job. In this example the properties
include parameters named TableName and TableFolder that are used by the following action in the
workflow.xml file uploaded in the oozieworkflow folder.
Partial Oozie Workflow XML
<action name='CreateTable'>
<hive xmlns="uri:oozie:hive-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>default</value>
</property>
<property>
<name>oozie.hive.defaults</name>
<value>hive-default.xml</value>
</property>
</configuration>
<script>CreateTable.q</script>
<param>TABLE_NAME=${TableName}</param>
<param>LOCATION=${TableFolder}</param>
</hive>
<ok to="end"/>
<error to="fail"/>
</action>

This action passes the parameter values to the CreateTable.q file, also in the oozieworkflow folder,
which is shown in the following code example.
HiveQL
DROP TABLE IF EXISTS ${TABLE_NAME};

Processing, querying, and transforming data using HDInsight 235

CREATE EXTERNAL TABLE ${TABLE_NAME} (id INT, val STRING)


ROW FORMAT DELIMITED FIELDS TERMINATED BY ' '
STORED AS TEXTFILE LOCATION '${LOCATION}';

Submitting a map/reduce job from a .NET application


You can use the classes in the Microsoft Azure HDInsight NuGet package to submit map/reduce jobs to
an HDInsight cluster from a .NET application. After adding this package you can use the
MapReduceJobCreateParameters class to define a map/reduce job with a specified .jar file path and
class name. You can then add any required arguments, such as paths for the source data and output
directory.
Next you must add code to load your Azure management certificate and use it to create a credential for
the HDInsight cluster. You use these credentials with the JobSubmissionClientFactory class to connect
to the cluster and create a job submission client object that implements the IJobSubmissionClient
interface, and then use the client objects CreateMapReduceJob method to submit the job you defined
earlier. When you submit a job, its unique job ID is returned.
You can leave the job to run, or write code to await its completion and display progress status by
examining the JobStatusCode of a JobDetails object retrieved using the job ID. In the following code
example the client application checks the job progress every ten seconds until it has completed.
C#
using
using
using
using
using
using
using
using
using
using
using
using

System;
System.Collections.Generic;
System.Linq;
System.Text;
System.Threading.Tasks;
System.Threading;
System.IO;
System.Security.Cryptography.X509Certificates;
Microsoft.WindowsAzure.Storage;
Microsoft.WindowsAzure.Storage.Blob;
Microsoft.WindowsAzure.Management.HDInsight;
Microsoft.Hadoop.Client;

namespace MRClient
{
class Program
{
static void Main(string[] args)
{
// Azure variables.
string subscriptionID = "subscription-id";
string certFriendlyName = "certificate-friendly-name";
string clusterName = "cluster-name";
// Define the MapReduce job.

236 Implementing big data solutions using HDInsight

MapReduceJobCreateParameters mrJobDefinition = new


MapReduceJobCreateParameters()
{
JarFile = "wasb:///mydata/jars/mymapreducecode.jar",
ClassName = "mymapreduceclass"
};
mrJobDefinition.Arguments.Add("wasb:///mydata/source");
mrJobDefinition.Arguments.Add("wasb:///mydata/Output");
// Get the certificate object from certificate store
// using the friendly name to identify it.
X509Store store = new X509Store();
store.Open(OpenFlags.ReadOnly);
X509Certificate2 cert = store.Certificates.Cast<X509Certificate2>()
.First(item => item.FriendlyName == certFriendlyName);
JobSubmissionCertificateCredential creds = new
JobSubmissionCertificateCredential(
new Guid(subscriptionID), cert, clusterName);
// Create a Hadoop client to connect to HDInsight.
var jobClient = JobSubmissionClientFactory.Connect(creds);
// Run the MapReduce job.
JobCreationResults mrJobResults =
jobClient.CreateMapReduceJob(mrJobDefinition);
// Wait for the job to complete.
Console.Write("Job running...");
JobDetails jobInProgress = jobClient.GetJob(mrJobResults.JobId);
while (jobInProgress.StatusCode != JobStatusCode.Completed
&& jobInProgress.StatusCode != JobStatusCode.Failed)
{
Console.Write(".");
jobInProgress = obClient.GetJob(jobInProgress.JobId);
Thread.Sleep(TimeSpan.FromSeconds(10));
}
// Job is complete.
Console.WriteLine("!");
Console.WriteLine("Job complete!");
Console.WriteLine("Press a key to end.");
Console.Read();
}
}
}

Notice the variables required to configure the Hadoop client. These include the unique ID of the
subscription in which the cluster is defined (which you can view in the Azure management portal), the
friendly name of the Azure management certificate to be loaded (which you can view in certmgr.msc),
and the name of your HDInsight cluster.

Processing, querying, and transforming data using HDInsight 237

The Microsoft Azure HDInsight NuGet package also includes a


StreamingMapReduceJobCreateParameters class, which you can use to submit a streaming
map/reduce job that uses .NET executable assemblies to implement the mapper and reducer for the job.

Submitting Pig and Hive jobs from a .NET application


You can define Pig and Hive jobs in a .NET client application by using the HiveJobCreateParameters and
PigJobCreateParameters classes from the Microsoft Azure HDInsight NuGet package. You can then
submit the jobs to an HDInsight cluster by using the CreateHiveJob and CreatePigJob methods of the
IJobSubmissionClient interface, which is implemented by objects returned by the JobSubmissionFactory
objects Connect method.
The following code example shows how to define and submit a Hive job that executes a HiveQL
statement to create a table.
C#
using
using
using
using
using
using
using
using
using
using
using
using

System;
System.Collections.Generic;
System.Linq;
System.Text;
System.Threading.Tasks;
System.Threading;
System.IO;
System.Security.Cryptography.X509Certificates;
Microsoft.WindowsAzure.Storage;
Microsoft.WindowsAzure.Storage.Blob;
Microsoft.WindowsAzure.Management.HDInsight;
Microsoft.Hadoop.Client;

namespace HiveClient
{
class Program
{
static void Main(string[] args)
{
// Azure variables.
string subscriptionID = "subscription-id";
string certFriendlyName = "certificate-friendly-name";
string clusterName = "cluster-name";
string hiveQL = @"CREATE TABLE mytable (id INT, val STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE LOCATION '/data/mytable';";
// Define the Hive job.
HiveJobCreateParameters hiveJobDefinition = new HiveJobCreateParameters()
{
JobName = "Create Table",

238 Implementing big data solutions using HDInsight

StatusFolder = "/CreateTableStatus",
Query = hiveQL
};
// Get the certificate object from certificate store
// using the friendly name to identify it.
X509Store store = new X509Store();
store.Open(OpenFlags.ReadOnly);
X509Certificate2 cert = store.Certificates.Cast<X509Certificate2>()
.First(item => item.FriendlyName == certFriendlyName);
JobSubmissionCertificateCredential creds = new
JobSubmissionCertificateCredential(
new Guid(subscriptionID), cert, clusterName);
// Create a hadoop client to connect to HDInsight.
var jobClient = JobSubmissionClientFactory.Connect(creds);
// Run the Hive job.
JobCreationResults jobResults = jobClient.CreateHiveJob(hiveJobDefinition);
// Wait for the job to complete.
Console.Write("Job running...");
JobDetails jobInProgress = jobClient.GetJob(jobResults.JobId);
while (jobInProgress.StatusCode != JobStatusCode.Completed
&& jobInProgress.StatusCode != JobStatusCode.Failed)
{
Console.Write(".");
jobInProgress = jobClient.GetJob(jobInProgress.JobId);
Thread.Sleep(TimeSpan.FromSeconds(10));
}
// Job is complete
Console.WriteLine("!");
Console.WriteLine("Job complete!");
Console.WriteLine("Press a key to end.");
Console.Read();
}
}
}

Notice the variables required to configure the Hadoop client. These include the unique ID of the
subscription in which the cluster is defined (which you can view in the Azure management portal), the
friendly name of the Azure management certificate to be loaded (which you can view in certmgr.msc),
and the name of your HDInsight cluster.
In previous example, the HiveQL command to be executed was specified as the Query parameter of the
HiveJobCreateParameters object. A similar approach is used to specify the Pig Latin statements to be
executed when using the PigJobCreateParameters class. Alternatively, you can use the File property to
specify a file in Azure storage that contains the HiveQL or Pig Latin code to be executed. The following

Processing, querying, and transforming data using HDInsight 239

code example shows how to submit a Pig job that executes the Pig Latin code in a file that already exists
in Azure storage.
C#
using
using
using
using
using
using
using
using
using
using
using
using

System;
System.Collections.Generic;
System.Linq;
System.Text;
System.Threading.Tasks;
System.Threading;
System.IO;
System.Security.Cryptography.X509Certificates;
Microsoft.WindowsAzure.Storage;
Microsoft.WindowsAzure.Storage.Blob;
Microsoft.WindowsAzure.Management.HDInsight;
Microsoft.Hadoop.Client;

namespace PigClient
{
class Program
{
static void Main(string[] args)
{
// Azure variables.
string subscriptionID = "subscription-id";
string certFriendlyName = "certificate-friendly-name";
string clusterName = "cluster-name";
// Define the Pig job.
PigJobCreateParameters pigJobDefinition = new PigJobCreateParameters()
{
StatusFolder = "/PigJobStatus",
File = "/weather/scripts/SummarizeWeather.pig"
};

// Get the certificate object from certificate store


// using the friendly name to identify it.
X509Store store = new X509Store();
store.Open(OpenFlags.ReadOnly);
X509Certificate2 cert = store.Certificates.Cast<X509Certificate2>()
.First(item => item.FriendlyName == certFriendlyName);
JobSubmissionCertificateCredential creds = new
JobSubmissionCertificateCredential(
new Guid(subscriptionID), cert, clusterName);
// Create a hadoop client to connect to HDInsight.
var jobClient = JobSubmissionClientFactory.Connect(creds);

240 Implementing big data solutions using HDInsight

// Run the Pig job.


JobCreationResults jobResults = jobClient.CreatePigJob(pigJobDefinition);
// Wait for the job to complete.
Console.Write("Job running...");
JobDetails jobInProgress = jobClient.GetJob(jobResults.JobId);
while (jobInProgress.StatusCode != JobStatusCode.Completed
&& jobInProgress.StatusCode != JobStatusCode.Failed)
{
Console.Write(".");
jobInProgress = jobClient.GetJob(jobInProgress.JobId);
Thread.Sleep(TimeSpan.FromSeconds(10));
}
// Job is complete.
Console.WriteLine("!");
Console.WriteLine("Job complete!");
Console.WriteLine("Press a key to end.");
Console.Read();
}
}
}

You can combine this approach with any of the data upload techniques described in Uploading data with
the Microsoft .NET Framework to build a client application that uploads source data and the Pig Latin or
HiveQL code files required to process it, and then submits a job to initiate processing.

Initiating an Oozie workflow from a .NET application


When an application needs to perform complex data processing as a sequence of dependent actions,
you can define an Oozie workflow to coordinate the data processing tasks and initiate it from a .NET
client application.
The Microsoft .NET API for Hadoop WebClient NuGet package is part of the .NET SDK for HDInsight, and
includes the OozieHttpClient class. You can use this class to connect to the Oozie application on an
HDInsight cluster and initiate an Oozie workflow job. The following example code shows a .NET client
application that uploads the workflow files required by the Oozie workflow, and then initiates an Oozie
workflow job that uses these files. The example is deliberately kept simple by including the credentials in
the code so that you can copy and paste it while you are experimenting with HDInsight. In a production
system you must protect credentials, as described in Securing credentials in scripts and applications in
the Security section of this guide.
C#
using
using
using
using
using

System;
System.Collections.Generic;
System.Linq;
System.Text;
System.Threading.Tasks;

Processing, querying, and transforming data using HDInsight 241

using
using
using
using
using
using
using

System.IO;
Microsoft.Hadoop.WebHDFS;
Microsoft.Hadoop.WebHDFS.Adapters;
Microsoft.Hadoop.WebClient;
Microsoft.Hadoop.WebClient.OozieClient;
Microsoft.Hadoop.WebClient.OozieClient.Contracts;
Newtonsoft.Json;

namespace OozieClient
{
class Program
{
const string hdInsightUser = "user-name";
const string hdInsightPassword = "password";
const string hdInsightCluster = "cluster-name";
const string azureStore = "storage-account-name";
const string azureStoreKey = "storage-account-key";
const string azureStoreContainer = "container-name";
const string workflowDir = "/data/oozieworkflow/";
const string inputPath = "/data/source/";
const string outputPath = "/data/output/";
static void Main(string[] args)
{
try
{
UploadWorkflowFiles().Wait();
CreateAndExecuteOozieJob().Wait();
}
catch (Exception ex)
{
Console.WriteLine(ex.Message);
}
finally
{
Console.WriteLine("Press a key to end");
Console.Read();
}
}
private static async Task UploadWorkflowFiles()
{
try
{
var workflowLocalDir = new DirectoryInfo(@".\oozieworkflow");
var hdfsClient = new WebHDFSClient(hdInsightUser,
new BlobStorageAdapter(azureStore, azureStoreKey, azureStoreContainer,
false));
Console.WriteLine("Uploading workflow files...");
await hdfsClient.DeleteDirectory(workflowDir);

242 Implementing big data solutions using HDInsight

foreach (var file in workflowLocalDir.GetFiles())


{
await hdfsClient.CreateFile(file.FullName, workflowDir + file.Name);
}
}
catch (Exception ex)
{
Console.WriteLine("An error occurred while uploading files");
throw (ex);
}
}
private static async Task CreateAndExecuteOozieJob()
{
try
{
var nameNodeHost = "wasb://" + azureStoreContainer + "@" + azureStore
+ ".blob.core.windows.net";
var tableName = "MyTable";
var tableFolder = "/Data/MyTable";
var clusterAddress = "https://" + hdInsightCluster + ".azurehdinsight.net";
var clusterUri = new Uri(clusterAddress);
// Create an oozie job and execute it.
Console.WriteLine("Starting oozie workflow...");
var client = new OozieHttpClient(clusterUri, hdInsightUser,
hdInsightPassword);
var prop = new OozieJobProperties(
hdInsightUser,
nameNodeHost,
"jobtrackerhost:9010",
workflowDir,
inputPath,
outputPath);
var parameters = prop.ToDictionary();
parameters.Add("oozie.use.system.libpath", "true");
parameters.Add("TableName", tableName);
parameters.Add("TableFolder", tableFolder);
var newJob = await client.SubmitJob(parameters);
var content = await newJob.Content.ReadAsStringAsync();
var serializer = new JsonSerializer();
dynamic json = serializer.Deserialize(new JsonTextReader(new
StringReader(content)));
string id = json.id;
await client.StartJob(id);
Console.WriteLine("Oozie job started");

Processing, querying, and transforming data using HDInsight 243

Console.WriteLine("View workflow progress at " + clusterAddress


+ "/oozie/v0/job/" + id + "?show=log");
}
catch (Exception ex)
{
Console.WriteLine("An error occurred while initiating the Oozie workflow
job");
throw (ex);
}
}
}
}

Notice that an OozieJobProperties object contains the properties to be used by the workflow. These
properties include configuration settings for Oozie as well as any parameters that are required by
actions defined in the Oozie job. In this example the properties include parameters named TableName
and TableFolder, which are used by the following action in the workflow.xml file that is uploaded in the
oozieworkflow folder.
Partial Oozie Workflow XML
<action name='CreateTable'>
<hive xmlns="uri:oozie:hive-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>default</value>
</property>
<property>
<name>oozie.hive.defaults</name>
<value>hive-default.xml</value>
</property>
</configuration>
<script>CreateTable.q</script>
<param>TABLE_NAME=${TableName}</param>
<param>LOCATION=${TableFolder}</param>
</hive>
<ok to="end"/>
<error to="fail"/>
</action>

244 Implementing big data solutions using HDInsight

This action passes the parameter values to the CreateTable.q file. This file is also in the oozieworkflow
folder, and is shown in the following code example.
HiveQL
DROP TABLE IF EXISTS ${TABLE_NAME};
CREATE EXTERNAL TABLE ${TABLE_NAME} (id INT, val STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ' '
STORED AS TEXTFILE LOCATION '${LOCATION}';

Consuming and visualizing data from HDInsight


This section of the guide describes a variety of ways to consume data from HDInsight for analysis and
reporting. It explores the commonly used tools in the Microsoft data platform that can integrate directly
or indirectly with HDInsight. It also demonstrates how you can create useful and attractive visualizations
of information that you generate using HDInsight. Figure 1 shows a high level view of the options for
consuming and visualizing data from HDInsight.

Consuming and visualizing data from HDInsight 245

Figure 1 - Some typical options for consuming data from HDInsight

246 Implementing big data solutions using HDInsight

Figure 1 represents a map that shows the various routes through which data can be delivered from an
HDInsight cluster to client applications. The client application destinations shown on the map include
Excel and SQL Server Reporting Services (SSRS). You can also create custom applications, or use common
enterprise BI data visualization technologies such as PerformancePoint Services in SharePoint Server to
consume data from HDInsightdirectly or indirectlyusing the same interfaces.
The topics and technologies discussed in this section of the guide are:

Microsoft Excel

SQL Server Reporting Services

SQL Server Analysis Services

SQL Server database

Building custom clients

Security is also a fundamental concern in all computing scenarios, and big data processing is no
exception. Security considerations apply during all stages of a big data process, and include securing
data while in transit over the network, securing data in storage, and authenticating and authorizing
users who have access to the tools and utilities you use as part of your process. For more details of how
you can maximize security of your HDInsight solutions see the topic Security in the section Building endto-end solutions using HDInsight.

More information
For more information about HDInsight, see Microsoft Azure HDInsight.
For more information about the tools and add-ins for Excel see Power BI for Office 365.
For more information about the HDInsight .NET SDKs see HDInsight SDK Reference Documentation on
MSDN.

Microsoft Excel
Excel is one of the most widely used data manipulation and visualization applications in the world, and is
commonly used as a tool for interactive data analysis and reporting. It supports comprehensive data
import and connectivity options that include built-in data connectivity to a wide range of data sources,
and the availability of add-ins such as Power Query, Power View, PowerPivot, and Power Map.
Additionally, Power BI for Office 365 provides a cloud-based platform for sharing data and reports in
Excel workbooks across the enterprise.
Excel includes a range of analytical tools and visualizations that you can apply to tables of data in one or
more worksheets within a workbook. All Excel 2013 workbooks encapsulate a data model in which you
can define tables of data and relationships between them. These data models make it easier to slice
and dice data in PivotTables and PivotCharts, and to create Power View visualizations.

Consuming and visualizing data from HDInsight 247

Office 2013 and Office 365 ProPlus are available in 32-bit and 64-bit versions. If you plan to use Excel to
build data models and perform analysis of big data processing results, the 64-bit version of Office is
recommended because of its ability to handle larger volumes of data.
Excel is especially useful when you want to add value and insight by augmenting the results of your data
analysis with external data. For example, you may perform an analysis of social media sentiment data by
geographical region in HDInsight, and consume the results in Excel. This geographically oriented data
can be enhanced by subscribing to a demographic dataset in the Azure Marketplace. The socioeconomic and population data may provide an insight into why your organization is more popular in
some locations than in others.
As well as datasets that you can download and use to augment your results, Azure Marketplace includes
a number of data services that you can use for data validation (for example, verifying that telephone
numbers and postal codes are valid) and for data transformation (for example, looking up the country or
region, state, and city for a particular IP address or longitude/latitude value).
The following topics describe the tools and techniques you can use to import and visualize data using
Excel:

Built-in data connectivity

Power Query

PowerPivot

Visualizing data in Excel

Power BI for Microsoft Office 365

Choosing an Excel technology

Built-in data connectivity


Excel includes native support for importing data from a wide range of data sources including web sites,
OData feeds, relational databases, and more. The data connectivity options available on the Data tab of
the ribbon enable business users to specify data source connection information, and select the data to
be imported, using a wizard-style interface. The data is imported into a worksheet, and in Excel 2013 it
can be added to the workbook data model and used for data visualization in a PivotTable report,
PivotChart, Power View report, or Power Map tour (depending on the specific edition of Excel).
The built-in data connectivity capabilities include support for ODBC sources, and are an appropriate
option for consuming data from HDInsight when that data is accessible through Hive tables. To use this
option, the Hive ODBC driver must be installed on the client computer where Excel will be used.
There is a 32-bit and a 64-bit version of the driver available for download from Microsoft Hive ODBC
Driver. You should install the one that matches the version of Windows on the target computer. When

248 Implementing big data solutions using HDInsight

you install the 64-bit version it also installs the 32-bit version so you will be able to use it to connect to
Hive from both 64-bit and 32-bit applications.
You can simplify the process of connecting to HDInsight by using the Data Sources (ODBC)
administrative tool to create a data source name (DSN) that encapsulates the ODBC connection
information, as shown in Figure 1. Creating a DSN makes it easier for business users with limited
experience of configuring data connections to import data from Hive tables that are defined in
HDInsight. If you set up both 32-bit and 64-bit DSNs using the same name, client applications will
automatically use the appropriate one.

Figure 1 - Creating an ODBC DSN for Hive


After you create a DSN, users can specify it in the data connection wizard in Excel and then select a Hive
table or view to be imported into the workbook, as shown in Figure 2.

Consuming and visualizing data from HDInsight 249

Figure 2 - Importing a Hive table into Excel


The Hive tables shown in Figure 2 are used as an example throughout this section of the guide. These
tables contain UK meteorological observation data for 2012, obtained from the UK Met Office Weather
Open Data dataset in Azure Marketplace.
Using the built-in data connectivity capabilities of Excel is a good solution when you need to perform
simple analysis and reporting based on the results of an HDInsight query, and the data can be
encapsulated in Hive tables. You can import the data on any client computer that has the ODBC driver
for Hive is installed, and in all editions of Excel.
In Excel 2013 you can add the imported data to the workbook data model and combine it with data from
other sources to create analytical mashups. However, in scenarios where complex data modeling is the
primary goal, using an edition of Excel that supports PowerPivot offers greater flexibility and modeling
functionality. It also makes it easier to create PivotTable Report, PivotChart, Power View, and Power
Map visualizations. These can help you create more meaningful and immersive results.
The following table describes specific considerations for using Excels built-in data connectivity in the
HDInsight use cases and models described in this guide.

250 Implementing big data solutions using HDInsight

Use case

Considerations

Iterative data
exploration

Built-in data connectivity in Excel is a suitable choice when the results of the data
processing can be encapsulated in a Hive table, or a query with simple joins can be
encapsulated in a Hive view, and the volume of data is sufficiently small to support
interactive connectivity with tolerable response times.

Data warehouse on
demand

When HDInsight is used to create a basic data warehouse containing Hive tables,
business users can use the built-in data connectivity in Excel to consume data from those
tables for analysis and reporting. However, for complex data models that require multiple
related tables and queries with complex joins, PowerPivot may be a better choice.

ETL automation

Most ETL scenarios are designed to transform big data into a suitable structure and
volume for storage in a relational data source for further analysis and querying. While
Excel may be used to consume the data from the relational data source after it has been
transferred from HDInsight, it is unlikely that an Excel workbook would be the direct target
for the ETL process.

BI integration

Importing data from a Hive table and combining it with data from a BI data source (such
as a relational data warehouse or corporate data model) is an effective way to accomplish
report-level integration with an enterprise BI solution. However, in self-service analysis
scenarios, advanced users such as business analysts may require a more comprehensive
data modeling solution such as that offered by PowerPivot, and can benefit from the ability
to share queries, data models, and reports with Power BI for Office 365.

Guidelines for using the Hive ODBC Driver in Excel


When planning to use the native data import functionality in Excel, consider the following guidelines:

Install both 32-bit and 64-bit Hive ODBC Drivers and create 32-bit and 64-bit ODBC DSNs with
the same name. This enables 32-bit and 64-bit clients to use the same connection string when
connecting to Hive.

Importing data into a table in a worksheet makes it possible to filter the data, use data bars and
conditional formatting, and create charts. Tables in worksheets are automatically included in
the workbook data model. However, if you need to define relationships between multiple
tables, or create custom columns and aggregations, it may be more efficient to import data
directly into a PowerPivot data model.

Imported data can be refreshed from the original data source. When importing data from Hive
tables you will be able to refresh the tables only while the HDInsight cluster is running.

Power Query
The Power Query add-in enhances Excel by providing a comprehensive interface for querying a wide
range of data sources. It can also be used to perform data enhancements such as cleansing data by
replacing values, and combining data sets from different sources. Power Query includes a data source
provider for HDInsight, which enables users to browse the folders in Azure blob storage that are

Consuming and visualizing data from HDInsight 251

associated with an HDInsight cluster. You can download the Power Query add-in from the Office
website.
By connecting directly to blob storage, users can import data from files such as those generated by
map/reduce jobs and Pig scripts, and import the underlying data files associated with Hive tableseven
if the cluster is not running or has been deleted. This enables organizations to consume the results of
HDInsight processing, while significantly reducing costs if no further HDInsight processing is required.
Keeping an HDInsight cluster running when it is not executing queries just so that you can access the
data incurs charges to your Azure account. If you are not using the cluster, you can close it down but still
be able to access the data at any time using Power Query in Excel, or any other tool that can access
Azure blob storage.
With the Power Query add-in installed, you can use the From Other Sources option on the Power Query
tab on the Excel ribbon to import data from HDInsight. You must specify the account name and key of
the Azure blob store, not the HDInsight cluster itself. After connecting to the Azure blob store you can
select a data file, convert its contents to a table by specifying the appropriate delimiter, and modify the
data types of the columns before importing it into a worksheet, as shown in Figure 1.

Figure 1 - Importing data from HDInsight with Power Query

252 Implementing big data solutions using HDInsight

The imported data can be added to the workbook data model, or analyzed directly in the worksheet.
The following table describes specific considerations for using Power Query in the HDInsight use cases
and models described in this guide.
Use case

Considerations

Iterative data
exploration

Power Query is a good choice when HDInsight data processing techniques such as
map/reduce jobs or Pig scripts generate files that contain the results to be analyzed or
reported. The HDInsight cluster can be deleted after the processing is complete, leaving
the results in Azure blob storage ready to be consumed by business users in Excel. With
the addition of a Power BI for Office 365 subscription, queries that return data from files in
Azure blob storage can be sharedmaking big data processing results discoverable by
other Excel users in the enterprise through the Online Search feature.

Data warehouse on
demand

When HDInsight is used to implement a basic data warehouse it usually includes a


schema of Hive tables that are queried over ODBC connections. While it is technically
possible to use Power Query to import data from the files on which the Hive tables are
based, more complex Hive solutions that include partitioned tables would be difficult to
consume this way.

ETL automation

Most ETL scenarios are designed to transform big data into a suitable structure and
volume for storage in a relational data source for further analysis and querying. It is
unlikely that Power Query would be used to consume data files from the blob storage
associated with the HDInsight cluster, though it may be used to consume data from the
relational data store loaded by the ETL process.

BI integration

Importing data from a file in Azure blob storage and combining it with data from a BI data
source (such as a relational data warehouse or corporate data model) is an effective way
to accomplish report-level integration with an enterprise BI solution. Additionally, users
can import the datasets retrieved by Power Query into a PowerPivot data model, and
publish workbooks containing data models and Power View visualizations to Power BI for
Office 365 in a self-service BI scenario.

Guidelines for using Power Query with HDInsight


When using Power Query to consume output files from HDInsight, consider the following guidelines:

Ensure that the big data processing jobs you use to generate data for analysis store their output
in appropriately named folders. This makes it easier for Power Query users to find output files
with generic names such as part-r-00000.

You can apply filters and sophisticated transformations to data in Power Query queries while
importing the output file from a big data processing job. However, you should generally try to
perform as much as possible of the required filtering and shaping within the big data processing
job itself in order to simplify the query that Excel users need to create.

Ensure that Power Query users are familiar with the schema of output files generated by big
data processing jobs. Output files generally do not include column headers.

When a big data processing job generates multiple output files you can use multiple Power
Query queries to combine the data.

Consuming and visualizing data from HDInsight 253

PowerPivot
The growing awareness of the value of decisions based on proven data, combined with advances in data
analysis tools and techniques, has resulted in an increased demand for versatile analytical data models
that support ad-hoc analysis (the self-service approach).
PowerPivot is an Excel-based technology in Office 2013 Professional Plus and Office 365 ProPlus that
enables advanced users to create complex data models that include hierarchies for drill-up/drill-down
aggregations, custom data access expression (DAX) calculated measures, key performance indicators
(KPIs), and other features not available in basic data models. PowerPivot is also available as an add-in for
previous releases of Excel. PowerPivot uses xVelocity compression technology to support in-memory
data models that enable high-performance analysis, even with extremely large volumes of data.
You can create and edit PowerPivot data models by using the PowerPivot for Excel interface, which is
accessed from the PowerPivot tab of the Excel ribbon. Through this interface you can enhance tables
that have been added to the workbook data model by other users or processes. You can also import
multiple tables from one or more data sources into the data model and define relationships between
them. Figure 1 shows a data model in PowerPivot for Excel.

Figure 1 - PowerPivot for Excel

254 Implementing big data solutions using HDInsight

PowerPivot brings many of the capabilities of enterprise BI to Excel, enabling business analysts to create
personal data models for sophisticated self-service data analysis. Users can share PowerPivot workbooks
through SharePoint Server, where they can be viewed interactively in a browser, enabling teams of
analysts to collaborate on data analysis and reporting.
The following table describes specific considerations for using PowerPivot in the HDInsight use cases and
models described in this guide.
Use case

Considerations

Iterative data
exploration

PowerPivot is a good choice when HDInsight data processing results can be


encapsulated in Hive tables and imported directly into the PowerPivot data model over an
ODBC connection, or when output files can be imported into Excel using Power Query.
PowerPivot enables business analysts to enhance the data model and share it across
teams through SharePoint Server or Power BI for Office 365.

Data warehouse on
demand

When HDInsight is used to implement a basic data warehouse, it usually includes a


schema of Hive tables that are queried over ODBC connections. PowerPivot makes it
easy to import multiple Hive tables into a data model, define relationships between them,
and refresh the model with new data if the Hive tables in HDInsight are updated.

ETL automation

Most ETL scenarios are designed to transform big data into a suitable structure and
volume for storage in a relational data source for further analysis and querying. In this
scenario, PowerPivot may be used to consume data from the relational data store loaded
by the ETL process.

BI integration

PowerPivot is a core feature in Microsoft-based BI solutions, and enables self-service


analysis that combines enterprise BI data with big data processing results from HDInsight
at the report level. In a self-service BI scenario, the ability to publish PowerPivot
workbooks in SharePoint Server or Power BI sites makes it easy for business analysts to
collaborate and share the insights they discover.

Guidelines for using PowerPivot with HDInsight


When using PowerPivot with HDInsight, consider the following guidelines:

When importing data into tables in the PowerPivot data model, minimize the size of the
workbook document by using filters to remove rows and columns that are not required.

If the PowerPivot workbook includes data from Hive tables, and you plan to share it in
SharePoint Server or in a Power BI site, use an explicit ODBC connection string instead of a DSN.
This will enable the PowerPivot data model to be refreshed when stored on a server where the
DSN is not available.

Hide any numeric columns for which PowerPivot automatically generates aggregated measures.
This ensures that they do not appear as dimension attributes in the PivotTable Fields and
Power View Fields panes.

Consuming and visualizing data from HDInsight 255

Visualizing data in Excel


After data has been imported using any of the techniques described in this section of the guide, business
users can employ the full range of Excel analytical and visualization functionality to explore the data and
create reports that include color-coded data values, charts, indicators, and interactive slicers and
timelines.
Built-in charts, formatting, PivotTables, and PivotCharts
All editions of Excel support data visualization through charts, data bars, sparklines, and conditional
formatting. These can be used to great effect when creating graphical representations of data in a
worksheet. PivotTables and PivotCharts are a common way to aggregate data values across multiple
dimensions, and enable users to see the relationships between these dimensions. Additionally, you can
use slicers and timelines to support interactive filtering of data visualizations.
For example, having imported a table of weather data, you could use a PivotTable or PivotChart to view
temperature aggregated by month and geographical region, and data bars to make it easy to compare
individual values in the worksheet, as shown in Figure 1.

Figure 1 - Analyzing data with a PivotTable and a PivotChart in Excel

256 Implementing big data solutions using HDInsight

Power View
Power View is a data visualization technology that enables interactive, graphical exploration of data in a
data model. Power View is available as a component of SQL Server 2012 Reporting Services when
integrated with SharePoint Server, but is also available in Excel 2013 Professional Plus and Office 365
ProPlus. Using Power View you can create interactive visualizations that make it easy to explore
relationships and trends in the data. Figure 2 shows how Power View can be used to visualize the
weather data in the data model described in the topic PowerPivot.

Figure 2 - Visualizing data with Power View


Power Map
Power Map is an Excel add-in for Microsoft Office 365 Power BI subscribers that enables you to visualize
geographic and temporal analytical data on a map. With Power Map you can display geographically
related values on an interactive map, and create a virtual tour that shows the data in 3D. If the data has
a temporal dimension you can incorporate time-lapse animation into the tour to illustrate how the
values change over time. Figure 3 shows a Power Map visualization in Excel.

Consuming and visualizing data from HDInsight 257

Figure 3 - Visualizing geographic data with Power Map


At the time this guide was written, Power Map was only available when Excel is installed from an Office
365 site where a Power BI subscription is supported. It is not available for the retail edition of Microsoft
Office 2013. For more information about Power BI and how to obtain it, see the Power BI section of the
Office website.
Guidelines for visualizing data from HDInsight in Excel
Consider the following guidelines when choosing a visualization tool for HDInsight data:

Use Power View when you need to explore data using a range of data visualizations. Power
View is particularly effective when you want to explore relationships between data in multiple
tables in a PowerPivot data model, but can also be used to visualize data in a single worksheet.

Use Power Map when you want to show changes in geographically-related data values over
time. Your data must include at least one geographic field, and must also include a temporal
field if you want to visualize changes to data over time.

Use native PivotCharts and conditional formatting when you want to create data visualizations
in workbooks that will be opened in versions of Excel that do not support Power View or Power
Map.

258 Implementing big data solutions using HDInsight

Power BI for Microsoft Office 365


As businesses increasingly move IT services to the cloud, many organizations are taking advantage of
Office 365 online services for productivity application deployment and management, email and
communications, and document sharing. Power BI is an additional service for Office 365 that enables
business users to publish and share queries, data models, and reports in Excel workbooks across the
enterprise; and to engage in rich, cloud-based data exploration and visualization. Power BI is available as
a standalone Office 365 service or as an add-on for existing Office 365 enterprise-level online service
plans.
Power BI for Office 365 builds on the capabilities of Power Query, Power View, PowerPivot, and Power
Map for Excel to enable self-service reporting and data visualization in Power BI sites. The reports and
visualizations can be accessed in a web browser, or through a Windows Store app on Window 8 and
Windows RT devices. Business users can use Power BI to take on data stewardship responsibilities for
the data they share, managing access to the data and providing appropriate documentation for other
users in the organization who discover it. Additionally, administrators can use the Data Management
Gateway feature in Power BI to securely publish on-premises data sources to Power BI, enabling
automated refresh of data models in shared workbooks that have been published in Power BI sites.
Sharing Power Query queries with Power BI
One of the key foundations of a self-service BI solution is discoverable data. Not all business users have
the expertise to construct queries against Hive tables or the Azure blob store, but most are already
familiar with searching for data on the web. Power BI enables business analysts to define queries that
return data (including the results of big data processing in HDInsight) and share them with the
organization to make the data discoverable. Business analysts take on the role of data steward for the
queries they publish. They can provide documentation about the data returned by the query, and how
to request access to the underlying data sources. Figure 1 shows a query published by a Power BI user.

Consuming and visualizing data from HDInsight 259

Figure 1 - Sharing a Query


After a query has been shared, other users in the organization can discover it by using the Online Search
feature of Power Query, as shown in Figure 2. This ability to share a query for discovery through search
makes it easier for users to incorporate the results of big data processing into their self-service data
models and reports, even if they lack the expertise to build their own queries.

260 Implementing big data solutions using HDInsight

Figure 2 - Discovering shared data with Online Search


After discovering data that has been exposed through shared queries, business users can use
PowerPivot to create their own data models that combine data from multiple sources, and use Power
Map and Power View to visualize that data in Excel workbooks.
Publishing reports in Power BI sites
Organizations that have a Power BI for Office 365 subscription can share workbooks containing data
models, Power View reports, and Power Map tours in special Power BI sites that are hosted in
SharePoint Online. Business users can view these workbooks as interactive reports in a web browser, or
in the Power BI app available from Windows Store. Power BI sites enable administrators to promote
specific workbooks as featured reports, and users can add their own favourite reports to their My Power
BI site. Figure 3 shows a report in a Power BI site.

Consuming and visualizing data from HDInsight 261

Figure 3 - Viewing a report in a Power BI site


In addition to publishing Power View visualizations as reports, administrators can add workbooks to the
Power BI Q&A feature, which enables users to query the data models in the workbook by submitting
natural language expressions. For example, a data model containing meteorological data could be
queried using expressions such as Which regions have the highest average temperature? Show
maximum temperature by month or Show average wind speed by country as a pie chart. Power BI
interprets these expressions and applies an appropriate query to retrieve and visualize the required
data. Figure 4 shows a report generated by Power BI Q&A.

262 Implementing big data solutions using HDInsight

Figure 4 - Using Power BI Q&A to query a data model using natural language
Power BI for Office 365 is a great choice when you want to empower business users to create and share
their own queries, data models, and reports. Users with the necessary skills can use HDInsight to process
data (for example, by using Pig or Hive scripts), and then import the data directly into Excel data models
using the Hive ODBC Driver or Power Query. The reports generated from these data models can then be
published in a Power BI site where other business users can view them.
Alternatively, you can publish the results of big data processing to the general business user population
through shared queries that you have created with Power Query. Business users can then engage in selfservice data modeling and analysis simply by discovering and consuming the big data processing results
you have sharedwithout requiring any knowledge of how the results were generated, or even where
the results are stored.
The following table describes specific considerations for using Power BI in the HDInsight use cases and
models described in this guide.

Consuming and visualizing data from HDInsight 263

Use case

Considerations

Iterative data
exploration

In an iterative data exploration scenario, users can use Power Query or the Hive ODBC
Driver to consume the results of each data processing iteration in Excel, and then use
native Excel charting, Power View, or Power Map to visualize the data. Power BI makes it
easier for multiple analysts to collaborate by sharing queries and reports in a Power BI
site.

Data warehouse on
demand

When HDInsight is used to implement a basic data warehouse, it usually includes a


schema of Hive tables that are queried over ODBC connections. Data analysts can use
the Hive ODBC Driver to import the data from Hive into a PowerPivot data model and
create Power View reports. They can then share the workbook in a Power BI site so that
users can view and interact with the reports, or query the data model using Q&A.

ETL automation

Most ETL scenarios are designed to transform big data into a suitable structure and
volume for storage in a relational data source for further analysis and querying. In most
cases, the ETL process loads the data into a relational data store for analysis. However, it
would be possible to use HDInsight to filter and shape data into tabular formats, and then
use Power Query or the Hive ODBC Driver to import the data into a PowerPivot data
model that can provide a source for reports and interactive analysis in a Power BI site.

BI integration

Power BI provides a platform for sharing queries, data models, and reports. By sharing
queries that obtain HDInsight output files from Azure storage, organizations can use
Power BI to make big data processing results discoverable for self-service BI.

Guidelines for using Power BI with HDInsight


When using Power BI for big data analysis with HDInsight, consider the following guidelines:

Use Power Query to share queries that retrieve data from HDInsight output files. This makes big
data processing results discoverable by business users who may have difficulty creating their
own queries.

Foster a culture of data stewardship in which users take responsibility for the queries they
define and share. Encourage users to document their queries and to monitor their usage in their
My Power BI site.

Use the Synonyms feature in PowerPivot to specify alternative terms for tables and columns in
your data model. This will improve the ability of the Power BI Q&A feature to interpret natural
language queries.

Choosing an Excel technology


Excel offers a wide range of options for consuming and visualizing data. It can be difficult to understand
how these technologies work together, or how to choose appropriate technologies when you want to
analyze the results generated by HDInsight. The following table shows the specific capabilities of each
technology in relation to common data consumption and visualization tasks in a big data solution based
on HDInsight.

264 Implementing big data solutions using HDInsight

Editions:

All editions

Office 2013 Professional Plus

Office 365 ProPlus

Office 365 ProPlus


Features and tasks

Native Data Tools

PowerPivot

Import data from Hive


using ODBC

Yes (one table at a


time)

Yes (multiple
tables)

Import data from Azure


blob storage
Load data into a data
model

Yes

Power
Map*

Power BI
Sites*

Yes

Yes

Yes*

Design a data model

Yes
Yes

Yes

Show geographic data


on a map

Yes

Show geographic data


changes over time

Yes*

Yes*

View data in workbooks


in a browser
Support natural
language queries (Q&A)

Power
View

Yes

Share queries with other


users

Create interactive charts

Power
Query

Yes*

Yes* (define
synonyms)

Yes*

* Requires Office 365 ProPlus with a Power BI subscription


Note that the table does not reflect qualitative aspects of the technologies, such as their ease of use or
flexibility. For example, you can use the native data tools in either Excel or Power View to create
interactive charts. However, the range of visualizations available and the user experience when visually
exploring data in Power View is generally better than that offered by PivotCharts and other native
visualization tools.
In many scenarios you are likely to use a combination of technologies. For example, you may use Power
Query to import the results of HDInsight processing jobs into the workbook data model, and then use
PowerPivot to refine the data model to define relationships, hierarchies, and custom fields. You may
then use native PivotTable functionality to analyze data aggregations, before using Power View to
visually explore the data. Finally, you might publish the workbook, including the Power View
visualization, as a report in a Power BI site where it can be viewed in a browser, and its data model can
be included as a data source for Q&A natural language visualization.
Figure 1 shows how Excel and Office 365 technologies work together to help organizations analyze the
big data processing results generated by HDInsight.

Consuming and visualizing data from HDInsight 265

Figure 1 - Using Excel and Office 365 technologies to analyze big data processing results
The options discussed here assume that you want to use Excel to consume and visualize data directly
from HDInsight, or from the Azure blob storage it uses. However, in many scenarios the results of
HDInsight processing are transferred to a database (for example, a data warehouse implemented in SQL
Server) or an analytical data model (for example, a SQL Server Analysis Services cube). You can use
native Excel data connectivity, PowerPivot, and Power Query to consume data from practically any data

266 Implementing big data solutions using HDInsight

source, and then use native visualization tools, Power View, Power Map, and Power BI sites as described
in this section of the guide.

SQL Server Reporting Services


SQL Server Reporting Services (SSRS) provides a platform for delivering reports that can be viewed
interactively in a Web browser, printed, or exported and delivered automatically in a wide range of
electronic document formats. These formats include Excel, Word, PDF, image, and XML.
Reports consist of data regions, such as tables of data values or charts, which are based on datasets that
encapsulate queries. The datasets used by a report are, in turn, based on data sources that can include
relational databases, corporate data models in SQL Server Analysis Services, or other OLE DB or ODBC
sources.
Professional report authors, who are usually specialist BI developers, can create reports by using the
Report Designer tool in SQL Server Data Tools. This tool provides a Visual Studio-based interface for
creating data sources, datasets, and reports; and for publishing a project of related items to a report
server. Figure 1 shows Report Designer being used to create a report based on data in HDInsight, which
is accessed through an ODBC data source and a dataset that queries a Hive table.

Figure 1 - Report Designer

Consuming and visualizing data from HDInsight 267

While many businesses rely on reports created by BI specialists, an increasing number of organizations
are empowering business users to create their own self-service reports. To support this scenario,
business users can use Report Builder (shown in Figure 2). This is a simplified report authoring tool that
is installed on demand from a report server. To further simplify self-service reporting you can have BI
professionals create and publish shared data sources and datasets that can be easily referenced in
Report Builder, reducing the need for business users to configure connections or write queries.

Figure 2 - Report Builder


Reports are published to a report server, where they can be accessed interactively through a web
interface. When Reporting Services is installed in native mode, the interface is provided by a web
application named Report Manager (shown in Figure 3). You can also install Reporting Services in
SharePoint-integrated mode, in which case the reports are accessed in a SharePoint document library.

268 Implementing big data solutions using HDInsight

Figure 3 - Viewing a report in Report Manager


Reporting Services provides a powerful platform for delivering reports in an enterprise, and can be used
effectively in a big data solution. The following table describes specific considerations for using
Reporting Services in the HDInsight use cases and models described in this guide.
Use case

Considerations

Iterative data
exploration

For one-time analysis and data exploration, Excel is generally a more appropriate tool
than Reporting Services because it requires less in the way of infrastructure configuration
and provides a more dynamic user interface for interactive data exploration.

Data warehouse on
demand

When HDInsight is used to implement a basic data warehouse, it usually includes a


schema of Hive tables that are queried over ODBC connections. In this scenario you can
use Reporting Services to create ODBC-based data sources and datasets that query Hive
tables.

ETL automation

Most ETL scenarios are designed to transform big data into a suitable structure and
volume for storage in a relational data source for further analysis and querying. In this
scenario, Reporting Services may be used to consume data from the relational data store
loaded by the ETL process.

Consuming and visualizing data from HDInsight 269

BI integration

You can use Reporting Services to integrate data from HDInsight with enterprise BI data
at the report level by creating reports that display data from multiple data sources. For
example, you could use an ODBC data source to connect to HDInsight and query Hive
tables, and an OLE DB data source to connect to a SQL Server data warehouse.
However, in an enterprise BI scenario that combines corporate data and big data in formal
reports, better integration can generally be achieved by integrating at the data warehouse
or corporate data model level, and by using a single data source in Reporting Services to
connect to the integrated data.

Guidelines for using Reporting Services with HDInsight


When using SQL Server Reporting Services with HDInsight, consider the following guidelines:

When creating a data source for Hive tables, use an explicit ODBC connection string in
preference to a DSN. This ensures that the data source does not depend on a DSN on the report
server.

Consider increasing the default timeout value for datasets that query Hive tables. Hive queries
over ODBC can take a considerable amount of time.

Consider using report snapshots, or cached datasets and reports, to improve performance by
reducing the number of times that queries are submitted to HDInsight.

The guidance provided here assumes that you want to use SQL Server Reporting Services to consume
and visualize data directly from HDInsight. However, in many scenarios the results of HDInsight
processing are transferred to a database (for example, a data warehouse implemented in SQL Server) or
an analytical data model (for example, a SQL Server Analysis Services cube). You can use Reporting
Services to consume and visualize data from practically any data source.

SQL Server Analysis Services


In many organizations, enterprise BI reporting and analytics is a driver for decision making and business
processes that involve multiple users in different parts of the business. In these kinds of scenarios,
analytical data is often structured into a corporate data model (often referred to colloquially as a cube)
and accessed from multiple client applications and reporting tools. Corporate data models help to
simplify the creation of analytical mashups and reports, and ensure consistent values are calculated for
core business measures and key performance indicators across the organization.
On the Microsoft data platform, SQL Server Analysis Services (SSAS) provides a mechanism for creating
and publishing corporate data models that can be used as a source for PivotTables in Excel, reports in
SQL Server Reporting Services, dashboards in SharePoint Server PerformancePoint Services, and other BI
reporting and analysis tools.
You can install SSAS in one of two modes; Tabular mode or Multidimensional mode. Tabular mode
supports loading data with the ODBC driver, which means that you can connect to the results of a Hive
query to analyze the data. However, the Multidimensional mode does not support ODBC, and so you

270 Implementing big data solutions using HDInsight

must either import the data into SQL Server or configure a linked server to pass through the query and
results.

Tabular mode
When installed in Tabular mode, SSAS can be used to host tabular data models that are based on the
xVelocity in-memory analytics engine. These models use the same technology and design as PowerPivot
data models in Excel but can be scaled to handle much larger volumes of data, and they can be secured
using enterprise-level role based security. You create tabular data models in the Visual Studio-based SQL
Server Data Tools development environment, and you can choose to create the data model from scratch
or import an existing PowerPivot for Excel workbook.
Because Tabular models support ODBC data sources, you can easily include data from Hive tables in the
data model. You can use HiveQL queries to pre-process the data as it is imported in order to create the
schema that best suits your analytical and reporting goals. You can then use the modeling capabilities of
SQL Server Analysis Services to create relationships, hierarchies, and other custom model elements to
support the analysis that users need to perform.
Figure 1 shows a Tabular model for the weather data used in this section of the guide.

Consuming and visualizing data from HDInsight 271

Figure 1 - A Tabular SQL Server Analysis Services data model


The following table describes specific considerations for using tabular SSAS data models in the HDInsight
use cases and models described in this guide.
Use case

Considerations

Iterative data
exploration

If the results of HDInsight data processing can be encapsulated in Hive tables, and
multiple users must perform consistent analysis and reporting of the results, an SSAS
tabular model is an easy way to create a corporate data model for analysis. However, if
the analysis will only be performed by a small group of specialist users, you can probably
achieve this by using PowerPivot in Excel.

Data warehouse on
demand

When HDInsight is used to implement a basic data warehouse, it usually includes a


schema of Hive tables that are queried over ODBC connections. A tabular data model is a
good way to create a consistent analytical and reporting schema based on the Hive
tables.

ETL automation

If the target of the HDInsight-based ETL process is a relational database, you might build
a tabular data model based on the tables in the database in order to enable enterprise-

272 Implementing big data solutions using HDInsight

level analysis and reporting.


BI integration

If the enterprise BI solution already uses tabular SSAS models, you can add Hive tables
to these models and create any necessary relationships to support integrated analysis of
corporate BI data and big data results from HDInsight. However, if the corporate BI data
warehouse is based on a dimensional model that includes surrogate keys and slowly
changing dimensions, it can be difficult to define relationships between tables in the two
data sources. In this case, integration at the data warehouse level may be a better
solution.

Multidimensional mode
As an alternative to Tabular mode, SSAS can be installed in Multidimensional mode. Multidimensional
mode provides support for a more established online analytical processing (OLAP) approach to cube
creation, and is the only mode supported by releases of SSAS prior to SQL Server 2012. Additionally, if
you plan to use SSAS data mining functionality, you must install SSAS in Multidimensional mode.
Multidimensional mode includes some features that are not supported or are difficult to implement in
Tabular data models, such as the ability to aggregate semi-additive measures across accounting
dimensions and use international translations in the cube definition. However, although
Multidimensional data models can be built on OLE DB data sources, some restrictions in the way cube
elements are implemented means that you cannot use an ODBC data source. Therefore, there is no way
to directly connect dimensions or measure groups in the data model to Hive tables in an HDInsight
cluster.
To use HDInsight data as a source for a Multidimensional data model in SSAS you must either transfer
the data from HDInsight to a relational database system such as SQL Server, or define a linked server in
another SQL Server instance that can act as a proxy and pass queries through to Hive tables in HDInsight.
The use of linked servers to access Hive tables from SQL Server, and techniques for transferring data
from HDInsight to SQL Server, are described in the topic SQL Server database.
Figure 2 shows a Multidimensional data model in Visual Studio. This data model is based on views in a
SQL Server database, which are in turn based on queries against a linked server that references Hive
tables in HDInsight.

Consuming and visualizing data from HDInsight 273

Figure 2 - A Multidimensional SQL Server Analysis Services data model


The following table describes specific considerations for using Multidimensional SSAS data models in the
HDInsight use cases and models described in this guide.
Use case

Considerations

Iterative data
exploration

For one-time analysis, or analysis by a small group of users, the requirement to use a
relational database such as SQL Server as a proxy or interim host for the HDInsight
results means that this approach involves more effort than using a tabular data model or
just analyzing the data in Excel.

Data warehouse on
demand

When HDInsight is used to implement a basic data warehouse, it usually includes a


schema of Hive tables that are queried over ODBC connections. To include these tables
in a multidimensional SSAS data model you would need to use a linked server in another
SQL Server instance to access the Hive tables on behalf of the data model, or transfer the
data from HDInsight to a relational database. Unless you specifically require advanced
OLAP capabilities that can only be provided in a multidimensional data model, a tabular
data model is probably a better choice.

274 Implementing big data solutions using HDInsight

ETL automation

If the target of the HDInsight-based ETL process is a relational database that can be
accessed through an OLE DB connection, you might build a multidimensional data model
based on the tables in the database to enable enterprise-level analysis and reporting.

BI integration

If the enterprise BI solution already uses multidimensional SSAS models that you want to
extend to include data from HDInsight, you should integrate the data at the data
warehouse level and base the data model on the relational data warehouse.

Guidelines for using SQL Server Analysis Services with HDInsight


When using SSAS with HDInsight, consider the following guidelines:

You cannot use ODBC data sources in a Multidimensional SSAS database. If you must include
Hive tables in a Multidimensional model, consider defining a linked server in a SQL Server
instance and adding SQL Server views that query the Hive tables to the SSAS data source.

If you are including data from Hive tables in a Tabular data model, use an explicit ODBC
connection string instead of a DSN. This will enable the data model to be refreshed when stored
on a server where the DSN is not available.

Consider the life cycle of the HDInsight cluster when scheduling data model processing. When
the model is processed, it refreshes partitions from the original data sources and so you should
ensure that that the HDInsight cluster and its Hive tables will be available when partitions based
on them are processed.

SQL Server database


Many organizations already use tools and services that are part of the comprehensive Microsoft data
platform to perform data analysis and reporting, and to implement enterprise BI solutions. Many of
these tools, services, and reporting applications make it easier to query and analyze data in a relational
database, rather than consuming the data directly from HDInsight. For example, Multidimensional SSAS
data models can only be based on relational data sources.
The wide-ranging support for working with data in a relational database means that, in many scenarios,
the optimal way to perform big data analysis is to process the data in HDInsight but consume the results
through a relational database system such as SQL Server or Azure SQL Database. You can accomplish this
by enabling the relational database to act as a proxy or interface that transparently accesses the data in
HDInsight on behalf of its client applications, or by transferring the results of HDInsight data processing
to tables in the relational database.

Linked servers
Linked servers are server-level connection definitions in a SQL Server instance that enable queries in the
local SQL Server engine to reference tables in remote servers. You can use the ODBC driver for Hive to
create a linked server in a SQL Server instance that references an HDInsight cluster, enabling you to
execute Transact-SQL queries that reference Hive tables.

Consuming and visualizing data from HDInsight 275

To create a linked server you can either use the graphical tools in SQL Server Management Studio or the
sp_addlinkedserver system stored procedure, as shown in the following code.
Transact-SQL
EXEC master.dbo.sp_addlinkedserver
@server = N'HDINSIGHT', @srvproduct=N'Hive',
@provider=N'MSDASQL', @datasrc=N'HiveDSN',
@provstr=N'Provider=MSDASQL.1;Persist Security Info=True;User ID=UserName;
Password=P@ssw0rd;'

After you have defined the linked server you can use the Transact-SQL OpenQuery function to execute
pass-through queries against the Hive tables in the HDInsight data source, as shown in the following
code.
Transact-SQL
SELECT * FROM OpenQuery(HDINSIGHT, 'SELECT * FROM Observations');

Using a four-part distributed query as the source of the OpenQuery statement is not always a good idea
because the syntax of HiveQL differs from T-SQL in several ways.
By using a linked server you can create views in a SQL Server database that act as pass-through queries
against Hive tables, as shown in Figure 1. These views can then be queried by analytical tools that
connect to the SQL Server database.

276 Implementing big data solutions using HDInsight

Figure 1 - Using HDInsight as a linked server over ODBC


You must be aware of some issues such as compatible data types between HiveQL and SQL, and some
language syntax limitations, when using a linked server. The issues and the supported data types are
described in the blog post How to create a SQL Server Linked Server to HDInsight HIVE using Microsoft
Hive ODBC Driver.
The following table describes specific considerations for using linked servers in the HDInsight use cases
and models described in this guide.
Use case

Considerations

Iterative data
exploration

For one-time analysis, or analysis by a small group of users, the requirement to use a
relational database such as SQL Server as a proxy or interim host for the HDInsight
results means that this approach involves more effort than using a tabular data model or
just analyzing the data in Excel.

Data warehouse on
demand

Depending on the volume of data in the data warehouse, and the frequency of queries
against the Hive tables, using a linked server with a Hive-based data warehouse might
make it easier to support a wide range of client applications. A linked server is a suitable

Consuming and visualizing data from HDInsight 277

solution for populating data models on a regular basis when they are processed during
out-of-hours periods, or for periodically refreshing cached datasets for Reporting Services.
However, the performance of pass-through queries over an ODBC connection may not be
sufficient to meet your users expectations for interactive querying and reporting directly in
client applications such as Excel.
ETL automation

Generally, the target of the ETL processes is a relational database, making a linked server
that references Hive tables unnecessary.

BI integration

If the ratio of Hive tables to data warehouse tables is small, or they are relatively rarely
queried, a linked server might be a suitable way to integrate data at the data warehouse
level. However, if there are many Hive tables or if the data in the Hive tables must be
tightly integrated into a dimensional data warehouse schema, it may be more effective to
transfer the data from HDInsight to local tables in the data warehouse.

PolyBase
PolyBase is a data integration technology in the Microsoft Analytics Platform System (APS) that enables
data in an HDInsight cluster to be queried as native tables in a relational data warehouse that is
implemented in SQL Server Parallel Data Warehouse (PDW). SQL Server PDW is an edition of SQL Server
that is only available pre-installed in an APS appliance, and it uses a massively parallel processing (MPP)
architecture to implement highly scalable data warehouse solutions.
PolyBase enables parallel data movement between SQL Server and HDInsight, and supports standard
Transact-SQL semantics such as GROUP BY and JOIN clauses that reference large volumes of data in
HDInsight. This enables APS to provide an enterprise-scale data warehouse solution that combines
relational data in data warehouse tables with data in an HDInsight cluster.
The following table describes specific considerations for using PolyBase in the HDInsight use cases and
models described in this guide.
Use case

Considerations

Iterative data
exploration

For one-time analysis, or analysis by a small group of users, the requirement to use an
APS appliance may be cost-prohibitive, unless such an appliance is already present in the
organization.

Data warehouse on
demand

If the volume of data and the number of query requests are extremely high, using an APS
appliance as a data warehouse platform that includes HDInsight data through PolyBase
might be the most cost-effective way to achieve the required levels of performance and
scalability your data warehousing solution requires.

ETL automation

Generally, the target of the ETL process is a relational database, making PolyBase
integration with HDInsight unnecessary.

BI integration

If your enterprise BI solution already uses an APS appliance, or the combined scalability
and performance requirements for enterprise BI and big data analysis is extremely high,
the combination of SQL Server PDW with PolyBase in a single APS appliance might be a
suitable solution. However, note that PolyBase does not inherently integrate HDInsight
data into a dimensional data warehouse schema. If you need to include big data in
dimension members that use surrogate keys, or you need to support slowly changing

278 Implementing big data solutions using HDInsight

dimensions, some additional integration effort may be required.

Sqoop
Sqoop is a Hadoop technology included in HDInsight. It is designed to make it easy to transfer data
between Hadoop clusters and relational databases. You can use Sqoop to export data from HDInsight
data files to SQL Server database tables by specifying the location of the data files to be exported, and a
JDBC connection string for the target SQL Server instance. For example, you could run the following
command on an HDInsight server to copy the data in the /hive/warehouse/observations path to the
observations table in an Azure SQL Database named mydb located in a server named jkty65.
Sqoop command
sqoop export --connect "jdbc:sqlserver://jkty65.database.windows.net:1433;
database=mydb;user=username@jkty65;password=Pa$$w0rd;
logintimeout=30;"
--table observations
--export-dir /hive/warehouse/observations

Sqoop is generally a good solution for transferring data from HDInsight to Azure SQL Database servers,
or to instances of SQL Server that are hosted in virtual machines running in Azure, but it can present
connectivity challenges when used with on-premises database servers. A key requirement is that
network connectivity can be successfully established between the HDInsight cluster where the Sqoop
command is executed and the target SQL Server instance. When used with HDInsight this means that
the SQL Server instance must be accessible from the Azure service where the cluster is running, which
may not be permitted by security policies in organizations where the target SQL Server instance is
hosted in an on-premises data center.
In many cases you can enable secure connectivity between virtual machines in Azure and on-premises
servers by creating a virtual network in Azure. However, at the time of writing it was not possible to add
the virtual machines in an HDInsight cluster to an Azure virtual network, so this approach cannot be
used to enable Sqoop to communicate with an on-premises server hosting SQL Server without traversing
the corporate firewall.
You can use Sqoop interactively from the Hadoop command line, or you can use one of the following
techniques to initiate a Sqoop job:

Create a Sqoop action in an Oozie workflow.

Implement a PowerShell script that uses the New-AzureHDInsightSqoopJobDefinition and


Start-AzureHDInsightJob cmdlets to run a Sqoop command.

Implement a custom application that uses the .NET SDK for HDInsight to submit a Sqoop job.

The following table describes specific considerations for using Sqoop in the HDInsight use cases and
models described in this guide.

Consuming and visualizing data from HDInsight 279

Use case

Considerations

Iterative data
exploration

For one-time analysis, or analysis by a small group of users, using Sqoop is a simple way
to transfer the results of data processing to SQL Database or a SQL Server instance for
reporting or analysis.

Data warehouse on
demand

When using HDInsight as a data warehouse for big data analysis, the data is generally
accessed directly in Hive tablesmaking transfer to a database using Sqoop
unnecessary.

ETL automation

Generally, the target of the ETL processes is a relational database, and Sqoop may be
the mechanism that is used to load the transformed data into the target database.

BI integration

When you want to integrate the results of HDInsight processing with an enterprise BI
solution at the data warehouse level you can use Sqoop to transfer data from HDInsight
into the data warehouse tables, or (more commonly) into staging tables from where it will
be loaded into the data warehouse. However, if network connectivity between HDInsight
and the target database is not possible you may need to consider an alternative technique
to transfer the data, such as SQL Server Integration Services.

SQL Server Integration Services


SQL Server Integration Services (SSIS) provides a flexible platform for building ETL solutions that transfer
data between a wide range of data sources and destinations while applying transformation, validation,
and data cleansing operations to the data as it passes through a data flow pipeline. SSIS is a good choice
for transferring data from HDInsight to SQL Server when network security policies make it impossible to
use Sqoop, or when you must perform complex transformations on the data as part of the import
process.
To transfer data from HDInsight to SQL Server using SSIS you can create an SSIS package that includes a
Data Flow task. The Data Flow task minimally consists of a source, which is used to extract the data from
the HDInsight cluster; and a destination, which is used to load the data into a SQL Server database. The
task might also include one or more transformations, which apply specific changes to the data as it flows
from the source to the destination.
The source used to extract data from HDInsight can be an ODBC source that uses a HiveQL query to
retrieve data from Hive tables, or a custom source that programmatically downloads data from files in
Azure blob storage.
Figure 2 shows an SSIS Data Flow task that uses an ODBC source to extract data from Hive tables, applies
a transformation to convert the data types of the columns returned by the query, and then loads the
transformed data into a table in a SQL Server database.

280 Implementing big data solutions using HDInsight

Figure 2 - Using SSIS to transfer data from HDInsight to SQL Server


The following table describes specific considerations for using SSIS in the HDInsight use cases and
models described in this guide.
Use case

Considerations

Iterative data
exploration

For one-time analysis, or analysis by a small group of users, SSIS can provide a simple
way to transfer the results of data processing to SQL Server for reporting or analysis.

Data warehouse on
demand

When using HDInsight as a data warehouse for big data analysis, the data is generally
accessed directly in Hive tables, making transfer to a database through SSIS
unnecessary.

ETL automation

Generally, the target of the ETL processes is a relational database. In some cases the
ETL process in HDInsight might transform the data into a suitable structure, and then
SSIS can be used to complete the process by transferring the transformed data to SQL
Server.

BI integration

SSIS is often used to implement ETL processes in an enterprise BI solution, so its a


natural choice when you need to extend the BI solution to include big data processing
results from HDInsight. SSIS is particularly useful when you want to integrate data from
HDInsight into an enterprise data warehouse, where you will typically uses SSIS to extract

Consuming and visualizing data from HDInsight 281

the data from HDInsight into a staging table, and perhaps use a different SSIS package to
load the staged data in synchronization with data from other corporate sources.

Guidelines for integrating HDInsight with SQL Server


When planning to integrate HDInsight with SQL Server, consider the following guidelines:

PolyBase is only available in Microsoft APS appliances.

When using Sqoop to transfer data between HDInsight and SQL Server, consider the effect of
firewalls between the HDInsight cluster in Azure and the SQL Server database server.

When using SSIS to transfer data from Hive tables, use an explicit ODBC connection string
instead of a DSN. This enables the SSIS package to run on a server where the DSN is not
available.

When using SSIS to transfer data from Hive tables, specify the DefaultStringColumnLength
parameter in the ODBC connection string. The default value for this setting is 32767, which
results in SSIS treating all strings as DT_TEXT or DT_NTEXT data type values. For optimal
performance, limit strings to 4000 characters or less so that SSIS automatically treats them as
DT_STR or DT_WSTR data type values.

When using SSIS to work with Hive ODBC sources, set the ValidateExternalMetadata property
of the ODBC data source component to False. This prevents Visual Studio from validating the
metadata until you open the data source component, reducing the frequency with which the
Visual Studio environment becomes unresponsive while waiting for data from the HDInsight
cluster.

Building custom clients


In some cases you may want to implement a custom solution to consume and visualize the big data
processing results generated by HDInsight, instead of using an existing or third-party tool. For example,
using custom code to consume data from HDInsight is common in scenarios where you need to
integrate big data processing into an existing application or service, or where you just want to explore
data by using simple scripts. This section of the guide provides some examples of custom clients built
using PowerShell and the .NET Framework.

Building custom clients with Windows PowerShell scripts


The Azure module for Windows PowerShell includes a range of cmdlets that you can use to access data
generated by HDInsight. You can use these cmdlets to consume data by querying Hive tables or by
downloading output files from Azure blob storage, as demonstrated by the following examples:

Querying Hive tables with Windows PowerShell

Retrieving job output files with Windows PowerShell

282 Implementing big data solutions using HDInsight

Considerations for using Windows PowerShell scripts


The following table describes specific considerations for using PowerShell in the HDInsight use cases and
models described in this guide.
Use case

Considerations

Iterative data
exploration

For one-time analysis or iterative exploration of data, PowerShell provides a flexible, easy
to use scripting framework that you can use to upload data and scripts, initiate jobs, and
consume the results.

Data warehouse on
demand

Data warehouses are usually queried by reporting clients such as Excel or SQL Server
Reporting Services. However, PowerShell can be useful as a tool to quickly test queries.

ETL automation

The target of the ETL processes is typically a relational database. While you may use
PowerShell to upload source data to Azure blob storage and to initiate the HDInsight jobs
that encapsulate the ETL process, its unlikely that PowerShell would be an appropriate
tool to consume the results.

BI integration

In an enterprise BI solution, users generally use established tools such as Excel or SQL
Server Reporting Services to visualize data. However, in a similar way to the data
warehouse scenario, you may use PowerShell to test queries against Hive tables.

In addition, consider the following:

You can run PowerShell scripts interactively in a Windows command line window or in a
PowerShell-specific command line console. Additionally, you can edit and run PowerShell scripts
in the Windows PowerShell Interactive Scripting Environment (ISE), which provides IntelliSense
and other user interface enhancements that make it easier to write PowerShell code.

You can schedule the execution of PowerShell scripts using Windows Scheduler, SQL Server
Agent, or other tools as described in Building end-to-end solutions using HDInsight.

Before you use PowerShell to work with HDInsight you must configure the PowerShell
environment to connect to your Azure subscription. To do this you must first download and
install the Azure PowerShell module, which is available through the Web Platform Installer. For
more details see How to install and configure Azure PowerShell.

Building custom clients with the .NET Framework


When you need to integrate big data processing into an application or service, you can use .NET
Framework code to consume the results of jobs executed in HDInsight. The .NET SDK includes numerous
classes for writing custom code that interacts with HDInsight. The following examples demonstrate
some common scenarios:

Using the Microsoft Hive ODBC Driver in a .NET client

Using LINQ To Hive in a .NET client

Retrieving job output files with the .NET Framework

Consuming and visualizing data from HDInsight 283

Considerations for using the .NET Framework


The following table describes specific considerations for using the .NET Framework to implement
custom client applications in the HDInsight use cases and models described in this guide.
Use case

Considerations

Iterative data
exploration

For one-time analysis or iterative exploration of data, writing a custom client application
may be an inefficient way to consume the data unless the team analyzing the data have
existing .NET development skills and plan to implement a custom client for a future big
data processing solution.

Data warehouse on
demand

In some cases a big data solution consists of a data warehouse based on HDInsight and a
custom application that consumes data from the data warehouse. For example, the goal
of a big data project might be to incorporate data from an HDInsight-based data
warehouse into an ASP.NET web application. In this case, using the .NET Framework
libraries for HDInsight is an appropriate choice.

ETL automation

The target of the ETL processes is typically a relational database. You might use a
custom .NET application to upload source data to Azure blob storage and initiate the
HDInsight jobs that encapsulate the ETL process.

BI integration

In an enterprise BI solution, users generally use established tools such as Excel or SQL
Server Reporting Services to visualize data. However, you may use the .NET libraries for
HDInsight to integrate big data into a custom BI application or business process.

More information
For information on using PowerShell with HDInsight see HDInsight PowerShell Cmdlets Reference
Documentation.
For information on using the HDInsight SDK see HDInsight SDK Reference Documentation and the
incubator projects on the CodePlex website.

Querying Hive tables with Windows PowerShell


To query a Hive table in a PowerShell script you can use the New-AzureHDInsightHiveJobDefinition and
Start-AzureHDInsightJob cmdlets, or you can use the Invoke-AzureHDInsightHiveJob cmdlet (which can
be abbreviated to Invoke-Hive). Generally, when the purpose of the script is simply to retrieve and
display the results of Hive SELECT query, the Invoke-Hive cmdlet is the preferred option because using it
requires significantly less code.
The Invoke-Hive cmdlet can be used with a Query parameter to specify a hard-coded HiveQL query, or
with a File parameter that references a HiveQL script file stored in Azure blob storage. The following
code example uses the Query parameter to execute a hard-coded HiveQL query.
Windows PowerShell
$clusterName = "cluster-name"
$hiveQL = "SELECT obs_date, avg(temperature) FROM observations GROUP BY obs_date;"
Use-AzureHDInsightCluster $clusterName
Invoke-Hive -Query $hiveQL

284 Implementing big data solutions using HDInsight

Figure 1 shows how the results of this query are displayed in the Windows PowerShell ISE.

Figure 1 - Using the Invoke-Hive cmdlet in the Windows PowerShell ISE

Retrieving job output files with Windows PowerShell


Hive is the most commonly used Hadoop technology for big data processing in HDInsight. However, in
some scenarios the data may be processed using a technology such as Pig or custom map/reduce code,
which does not overlay the output files with a tabular schema that can be queried. In this case, your
custom PowerShell code must download the files generated by the HDInsight job and display their
contents.
The Get-AzureStorageBlobContent cmdlet enables you to download an entire blob path from an Azure
storage container, replicating the folder structure represented by the blob path on the local file system.
To use the Get-AzureStorageBlobContent cmdlet you must first instantiate a storage context by using
the New-AzureStorageContext cmdlet. This requires a valid storage key for your Azure storage account,
which you can retrieve by using the Get-AzureStorageKey cmdlet.

Consuming and visualizing data from HDInsight 285

The Set-AzureStorageBlobContent cmdlet is used to copy local files to an Azure storage container. The
Set-AzureStorageBlobContent and Get-AzureStorageBlobContent cmdlets are often used together
when working with HDInsight to upload source data and scripts to Azure before initiating a data
processing job, and then to download the output of the job.
As an example, the following PowerShell code uses the Set-AzureStorageBlobContent cmdlet to upload
a Pig Latin script named SummarizeWeather.pig, which is then invoked using the NewAzureHDInsightPigJobDefinition and Start-AzureHDInsightJob cmdlets. The output file generated by the
job is downloaded using the Get-AzureStorageBlobContent cmdlet, and its contents are displayed using
the cat command.
Windows PowerShell
$clusterName = "cluster-name"
$storageAccountName = "storage-account-name"
$containerName = "container-name"
# Find the folder where this script is saved.
$localfolder = Split-Path -parent $MyInvocation.MyCommand.Definition
$destfolder =
$scriptFile =
$outputFolder
$outputFile =

"weather/scripts"
"SummarizeWeather.pig"
= "weather/output"
"part-r-00000"

# Upload Pig Latin script.


$storageAccountKey = (Get-AzureStorageKey -StorageAccountName
$storageAccountName).Primary
$blobContext = New-AzureStorageContext -StorageAccountName $storageAccountName StorageAccountKey $storageAccountKey
$blobName = "$destfolder/$scriptFile"
$filename = "$localfolder\$scriptFile"
Set-AzureStorageBlobContent -File $filename -Container $containerName -Blob $blobName
-Context $blobContext -Force
write-host "$scriptFile uploaded to $containerName!"
# Run the Pig Latin script.
$jobDef = New-AzureHDInsightPigJobDefinition -File "wasb:///$destfolder/$scriptFile"
$pigJob = Start-AzureHDInsightJob Cluster $clusterName JobDefinition $jobDef
Write-Host "Pig job submitted..."
Wait-AzureHDInsightJob -Job $pigJob -WaitTimeoutInSeconds 3600
Get-AzureHDInsightJobOutput -Cluster $clusterName -JobId $pigJob.JobId -StandardError
# Get the job output.
$remoteblob = "$outputFolder/$outputFile"
write-host "Downloading $remoteBlob..."
Get-AzureStorageBlobContent -Container $containerName -Blob $remoteblob -Context
$blobContext -Destination $localFolder
cat $localFolder\$outputFolder\$outputFile

286 Implementing big data solutions using HDInsight

The SummarizeWeather.pig script in this example generates the average wind speed and temperature
for each date in the source data, and stores the results in the /weather/output folder as shown in the
following code example.
Pig Latin (SummarizeWeather.pig)
Weather = LOAD '/weather/data' USING PigStorage(',') AS (obs_date:chararray,
obs_time:chararray, weekday:chararray, windspeed:float, temp:float);
GroupedWeather = GROUP Weather BY obs_date;
AggWeather = FOREACH GroupedWeather GENERATE group, AVG(Weather.windspeed) AS
avg_windspeed, MAX(Weather.temp) AS high_temp;
DailyWeather = FOREACH AggWeather GENERATE FLATTEN(group) AS obs_date, avg_windspeed,
high_temp;
SortedWeather = ORDER DailyWeather BY obs_date ASC;
STORE SortedWeather INTO '/weather/output';

Figure 1 shows how the results of this script are displayed in the Windows PowerShell ISE.

Figure 1 - Using the Get-AzureStorageBlobContent cmdlet in the Windows PowerShell ISE

Consuming and visualizing data from HDInsight 287

Note that the script must include the name of the output file to be downloaded. In most cases, Pig jobs
generate files in the format part-r-0000x. Some map/reduce operations may create files with the format
part-m-0000x, and Hive jobs that insert data into new tables generate numeric filenames such as
000000_0. In most cases you will need to determine the specific filename(s) generated by your data
processing job before writing PowerShell code to download the output.
The contents of downloaded files can be displayed in the console using the cat command, as in the
example above, or you could open a file containing delimited text results in Excel.

Using the Microsoft Hive ODBC Driver in a .NET client


One of the easiest ways to consume data from HDInsight in a custom .NET application is to use the
System.Data.Odbc.OdbcConnection class with the Hive ODBC driver to query Hive tables in the
HDInsight cluster. This approach enables programmers to use the same data access classes and
techniques that are commonly used to retrieve data from relational database sources such as SQL
Server.
Due to the typical latency when using ODBC to connect to HDInsight, and the time taken to execute the
Hive query, you should use asynchronous programming techniques when opening connections and
executing commands. To make this easier, the classes in the System.Data and System.Data.Odbc
namespaces provide asynchronous versions of the most common methods.
The following code example shows a simple Microsoft C# console application that uses an ODBC data
source name (DSN) to connect to HDInsight, execute a Hive query, and display the results. The example
is deliberately kept simple by including the connection string in the code so that you can copy and paste
it while you are experimenting with HDInsight. In a production system you must protect connection
strings, as described in Securing credentials in scripts and applications in the Security section of this
guide.
C#
using
using
using
using
using

System;
System.Threading.Tasks;
System.Data;
System.Data.Odbc;
System.Data.Common;

namespace HiveClient
{
class Program
{
static void Main(string[] args)
{
GetData();
Console.WriteLine("----------------------------------------------");
Console.WriteLine("Press a key to end");
Console.Read();
}

288 Implementing big data solutions using HDInsight

static async void GetData()


{
using (OdbcConnection conn =
new OdbcConnection("DSN=Hive;UID=user-name;PWD=password"))
{
conn.OpenAsync().Wait();
OdbcCommand cmd = conn.CreateCommand();
cmd.CommandText =
"SELECT obs_date, avg(temp) FROM weather GROUP BY obs_date;";
DbDataReader dr = await cmd.ExecuteReaderAsync();
while (dr.Read())
{
Console.WriteLine(dr.GetDateTime(0).ToShortDateString()
+ ": " + dr.GetDecimal(1).ToString("#00.00"));
}
}
}
}
}

The output from this example code is shown in Figure 1.

Figure 1 - Output retrieved from Hive using the OdbcConnection class

Consuming and visualizing data from HDInsight 289

Using LINQ To Hive in a .NET client


Language-Integrated Query (LINQ) provides a consistent syntax for querying data sources in a .NET
application. Many .NET developers use LINQ to write object-oriented code that retrieves and
manipulates data from a variety of sources, taking advantage of type checking and IntelliSense as they
do so.
LINQ to Hive is a component of the .NET SDK for HDInsight that enables developers to write LINQ
queries that retrieve data from Hive tables, enabling them to use the same consistent approach to
consuming data from Hive as they do for other data sources.
The following code example shows how you can use LINQ to Hive to retrieve data from a Hive table in a
C# application. The example is deliberately kept simple by including the credentials in the code so that
you can copy and paste it while you are experimenting with HDInsight. In a production system you must
protect credentials, as described in Securing credentials in scripts and applications in the Security
section of this guide.
C#
using
using
using
using
using

System;
System.Collections.Generic;
System.Linq;
System.Text;
System.Threading.Tasks;

using Microsoft.Hadoop.Hive;
namespace LinqToHiveClient
{
class Program
{
static void Main(string[] args)
{
var db = new HiveDatabase(
webHCatUri: new Uri("https://mycluster.azurehdinsight.net"),
username: "user-name", password: "password",
azureStorageAccount: "storage-account-name.blob.core.windows.net",
azureStorageKey: "storage-account-key");
var q = from x in
(from o in db.Weather
select new { o.obs_date, temp = o.temperature })
group x by x.obs_date into g
select new { obs_date = g.Key, temp = g.Average(t => t.temp)};
q.ExecuteQuery().Wait();
var results = q.ToList();
foreach (var r in results)

290 Implementing big data solutions using HDInsight

{
Console.WriteLine(r.obs_date.ToShortDateString() + ": "
+ r.temp.ToString("#00.00"));
}
Console.WriteLine("---------------------------------");
Console.WriteLine("Press a key to end");
Console.Read();
}
}
public class HiveDatabase : HiveConnection
{
public HiveDatabase(Uri webHCatUri, string username, string password,
string azureStorageAccount, string azureStorageKey)
: base(webHCatUri, username, password,
azureStorageAccount, azureStorageKey) { }
public HiveTable<WeatherRow> Weather
{
get
{
return this.GetTable<WeatherRow>("Weather");
}
}
}
public class WeatherRow : HiveRow
{
public DateTime obs_date { get; set; }
public string obs_time { get; set; }
public string day { get; set; }
public float wind_speed { get; set; }
public float temperature { get; set; }
}
}

Notice that the code includes a class that inherits from HiveConnection, which provides an abstraction
for the Hive data source. This class contains a collection of tables that can be queried (in this case, a
single table named Weather). The table contains a collection of objects that represent the rows of data
in the table, each of which is implemented as a class that inherits from HiveRow. In this case, each row
from the Weather table contains the following fields:

obs_date

obs_time

day

wind_speed

Consuming and visualizing data from HDInsight 291

temperature

The query in this example groups the data by obs_date and returns the average temperature value for
each date. The output from this example code is shown in Figure 1.

Figure 1 - Output retrieved using LINQ to Hive

Retrieving job output files with the .NET Framework


When the HDInsight jobs you have used to process your data do not generate Hive tables, you can
implement code to retrieve the results from the output files generated by the jobs. The following
examples demonstrate this:

Using the Microsoft .NET API for Hadoop WebClient Package

Using the Windows Azure Storage Library

Using the Microsoft .NET API for Hadoop WebClient Package


When your application project includes a reference to the Microsoft .NET API for Hadoop WebClient
package you can use the OpenFile method of the WebHDFSClient class to open the output files and
read their contents. This approach can be particularly convenient if you have already used classes in this
package to upload the source data and initiate the HDInsight jobs.
As an example, the following code shows a simple console application that reads and displays the
contents of an output file that is stored as a blob named /weather/output/part-r-00000. The examples
in this section are deliberately kept simple by including the credentials in the code so that you can copy
and paste it while you are experimenting with HDInsight. In a production system you must protect

292 Implementing big data solutions using HDInsight

credentials, as described in Securing credentials in scripts and applications in the Security section of
this guide.
C#
using System;
using System.Text;
using System.Threading.Tasks;
using Microsoft.Hadoop.WebHDFS;
using Microsoft.Hadoop.WebHDFS.Adapters;
namespace BlobClient
{
class Program
{
static void Main(string[] args)
{
GetResult();
Console.WriteLine("--------------------------");
Console.WriteLine("Press a key to end");
Console.Read();
}
static async void GetResult()
{
var hdInsightUser = "user-name";
var storageName = "storage-account-name";
var storageKey = "storage-account-key";
var containerName = "container-name";
var outputFile = "/weather/output/part-r-00000";
// Get the contents of the output file.
var hdfsClient = new WebHDFSClient(hdInsightUser,
new BlobStorageAdapter(storageName, storageKey, containerName, false));
await hdfsClient.OpenFile(outputFile)
.ContinueWith(r => r.Result.Content.ReadAsStringAsync()
.ContinueWith(c => Console.WriteLine(c.Result.ToString())));
}
}
}

The output from this example code is shown in Figure 1.

Consuming and visualizing data from HDInsight 293

Figure 1 - Output retrieved using the OpenFile method of the WebHDFSClient class
For more information about using the .NET SDK see HDInsight SDK Reference Documentation and the
incubator projects on the CodePlex website.
Using the Windows Azure Storage Library
In some cases you may want to download the output files generated by HDInsight jobs so that they can
be opened in client applications such as Excel. You can use the CloudBlockBlob class in the Windows
Azure Storage package to download the contents of the blob to a file.
The following example shows how you can use the Windows Azure Storage package in an application to
download the contents of a blob to a file.
C#
using System;
using System.Text;
using System.Threading.Tasks;
using
using
using
using

Microsoft.WindowsAzure.Storage;
Microsoft.WindowsAzure.Storage.Auth;
Microsoft.WindowsAzure.Storage.Blob;
System.IO;

namespace BlobDownloader
{
class Program
{
const string AZURE_STORAGE_CONNECTION_STRING = "DefaultEndpointsProtocol=https;"
+ "AccountName=storage-account-name;AccountKey=storage-account-key";
static void Main(string[] args)
{
CloudStorageAccount storageAccount = CloudStorageAccount.Parse

294 Implementing big data solutions using HDInsight

(AZURE_STORAGE_CONNECTION_STRING);
CloudBlobClient blobClient = storageAccount.CreateCloudBlobClient();
CloudBlobContainer container = blobClient.GetContainerReference("containername");
CloudBlockBlob blob = container.GetBlockBlobReference("weather/output/part-r00000");
var fileStream = File.OpenWrite(@".\results.txt");
using ( fileStream)
{
blob.DownloadToStream(fileStream);
}
Console.WriteLine("Results downloaded to " + fileStream.Name);
Console.WriteLine("Press a key to end");
Console.Read();
}
}
}

The output from this example code is shown in Figure 2.

Figure 2 - Downloading a blob with the CloudBlockBlob.DownloadToStream method


The Windows Azure Storage package enables a more versatile approach to consuming output files
generated by HDInsight jobs than the WebHDFSClient class in the Microsoft .NET API for Hadoop
WebClient package. In particular, you can use the other classes in the Windows Azure Storage package
to browse blob hierarchies in a container, and to download all of the blobs in a specific path. This makes
it easier to download results from HDInsight jobs that generate multiple output files, or in cases where
the exact name of an output file is unknown. The library also provides asynchronous methods that can
be used to great effect when HDInsight jobs generate extremely large output files.

Consuming and visualizing data from HDInsight 295

For more information about using the classes in the Windows Azure Storage package see How to use
Blob Storage from .NET.

Building end-to-end solutions using HDInsight


Many scenarios for using a big data solution such as HDInsight will focus on exploring data, perhaps
from newly discovered sources, and then iteratively refining the queries and transformations used to
find insights within that data. After you discover questions that provide useful and valid information
from the data, and determine the tasks that are required to accomplish this, you will probably want to
explore how you can automate and manage the entire solution.
Alternatively, you may already have a definite plan for using HDInsightperhaps as an ETL automation
mechanism, as a data warehouse, or for integration with an existing BI system. In all of these scenarios,
automation can help you to more easily execute repeated processes in a predictable way, and with a
reduced chance of operator error.
Figure 1 shows the typical stages and some of the tasks in a big data solution, for which you may decide
to automate all or selected parts.

Figure 1 - The typical tasks in an end-to-end big data solution

296 Implementing big data solutions using HDInsight

The automation and orchestration of these tasks must be planned carefully to create an overall solution
that performs efficiently and can be easily integrated into business practices. The more complex your big
data processing requirements, the more important it is to plan the coordination of all the moving
parts in the solution to achieve the required results in as efficient and error-free way as possible.
This section of the guide focuses on building end-to-end solutions that minimize the need for operator
or administrator intervention, maximize the security of the process and the data, and provide sufficient
information to be able to monitor solutions. This section is divided into two distinct topic areas:

Designing end-to-end solutions. This includes planning the solution to meet the requirements of
dependencies, constraints, and consistency; protecting the application, the data, and the
cluster; and implementing scheduling for the overall process and the individual tasks.

Monitoring and logging. This includes monitoring the cluster itself and the individual tasks,
auditing operations, and accessing log files.

More information
For more information about HDInsight see the Microsoft Azure HDInsight web page.
See Collecting and loading data into HDInsight for more details and considerations for provisioning a
cluster and storage, and uploading data to a big data solution such as HDInsight.
See Processing, querying, and transforming data using HDInsight for more details and considerations for
processing big data with HDInsight.
See Consuming and visualizing data from HDInsight for more details and considerations for consuming
the output of big data processing jobs.
See Appendix A - Tools and technologies reference for information about the many tools, frameworks,
utilities, and technologies you can adopt to help automate an end-to-end solution.

Designing end-to-end solutions


Automation enables you to avoid some or all of the manual operations required to perform your specific
big data processing tasks. Unless you are simply experimenting with some data you will probably want
to create a completely automated end-to-end solution. For example, you may want to make a solution
repeatable without requiring manual interaction every time, perhaps incorporate a workflow, and even
execute the entire solution automatically on a schedule. HDInsight supports a range of technologies and
techniques to help you achieve this, several of which are used in the example scenario youve already
seen this guide.
You can think of an end-to-end big data solution as being a process that encompasses multiple discrete
sub-processes. Throughout this guide you have seen how to automate these individual sub-processes
using a range of tools such as Windows PowerShell, the .NET SDK for HDInsight, SQL Server Integration
Services, Oozie, and command line tools.

Building end-to-end solutions using HDInsight 297

A typical big data process might consist of the following sub-processes:

Data ingestion: source data is loaded to Azure storage, ready for processing. For details of how
you can automate individual tasks for data ingestion see Custom data upload clients in the
section Collecting and loading data into HDInsight.

Cluster provisioning: When the data is ready to be processed, a cluster is provisioned. For
details of how you can automate cluster provisioning see Custom cluster management clients in
the section Collecting and loading data into HDInsight.

Job submission and management: One or more jobs are executed on the cluster to process the
data and generate the required output. For details of how you can automate individual tasks for
submitting and managing jobs see Building custom clients in the section Processing, querying,
and transforming data using HDInsight.

Data consumption: The job output is retrieved from HDInsight, either directly by a client
application or through data transfer to a permanent data store. For details of how you can
automate data consumption tasks see Building custom clients in the section Consuming and
visualizing data from HDInsight.

Cluster deletion: The cluster is deleted when it is no longer required to process data or service
Hive queries. For details of how you can delete a cluster see Custom cluster management clients
in the section Collecting and loading data into HDInsight.

Data visualization: The retrieved results are visualized and analyzed, or used in a business
application. For details of tools for visualizing and analyzing the results see the section
Consuming and visualizing data from HDInsight.

However, before beginning to design an automated solution, it is sensible to start by identifying the
dependencies and constraints in your specific data processing scenario, and considering the
requirements for each stage in the overall solution. For example, you must consider how to coordinate
the automation of these operations as a whole, as well as planning the scheduling of each discrete task.
This section includes the following topics related to designing automated end-to-end solutions:

Workflow dependencies and constraints

Task design and context

Coordinating solutions and tasks

Scheduling solution and task execution

Security

Considerations
Consider the following points when designing and implementing end-to-end solutions around HDInsight:

298 Implementing big data solutions using HDInsight

Analyze the requirements for the solution before you start to implement automation. Consider
factors such as how the data will be collected, the rate at which it arrives, the timeliness of the
results, the need for quick access to aggregated results, and the consequent impact of the
speed of processing each batch. All of these factors will influence the processes and
technologies you choose, the batch size for each process, and the overall scheduling for the
solution.

Automating a solution can help to minimize errors for tasks that are repeated regularly, and by
setting permissions on the client-side applications that initiate jobs and access the data you can
also limit access so that only your authorized users can execute them. Automation is likely to be
necessary for all types of solutions except those where you are just experimenting with data
and processes.

The individual tasks in your solutions will have specific dependencies and constraints that you
must accommodate to achieve the best overall data processing workflow. Typically these
dependencies are time based and affect how you orchestrate and schedule the tasks and
processes. Not only must they execute in the correct order, but you may also need to ensure
that specific tasks will be completed before the next one begins. See Workflow dependencies
and constraints for more information.

Consider if you need to automate the creation of storage accounts to hold the cluster data, and
decide when this should occur. HDInsight can automatically create one or more linked storage
accounts for the data as part of the cluster provisioning process. Alternatively, you can
automate the creation of linked storage accounts before you create a cluster, and non-linked
storage accounts before or after you create a cluster. For example, you might automate
creating a new storage account, loading the data, creating a cluster that uses the new storage
account, and then executing a job. For more information about linked and non-linked storage
accounts see Cluster and storage initialization in the section Collecting and loading data into
HDInsight.

Consider the end-to-end security of your solution. You must protect the data from unauthorized
access and tampering when it is in storage and on the wire, and secure the cluster as a whole to
prevent unauthorized access. See Security for more details.

As with any complex multi-step solution, it is important to make monitoring and


troubleshooting as easy as possible by maintaining detailed logs of the individual stages of the
overall process. This typically requires comprehensive exception handling and well as planning
how to log the information. See Monitoring and logging for more information.

Workflow dependencies and constraints


A big data batch processing solution typically consists of multiple steps that take some source data,
apply transformations to shape and filter it, and consume the results in an application or analytical
system. As with all workflows, this sequence of tasks usually includes some dependencies and

Building end-to-end solutions using HDInsight 299

constraints that you must take into account when planning the solution. Typical dependencies and
constraints include:

Minimum latency toleration in consuming systems. How up-to-date does the data need to be
in reports, data models, and applications that consume the data processing results?

Volatility of source data. How frequently does the source data get updated or added to?

Data source dependencies. Are there data processing tasks for which data from one source
cannot be processed until data from another source is available?

Duration of processing tasks. How long does it typically take to complete each task in the
workflow?

Resource contention for existing workloads. To what degree can data processing operations
degrade the performance and scalability of systems that are in use for ongoing business
processes?

Cost. What is the budget for the employee time and infrastructure resources used to process
the data?

An example scenario
As an example, consider a scenario in which business analysts want to use an Excel report in an Office
365 Power BI site to visualize web server activity for an online retail site. The data in Excel is in a
PivotTable, which is based on a connection to Azure SQL Database. The web server log data must be
processed using a Pig job in HDInsight, and then loaded into SQL Database using Sqoop. The business
analysts want to be able to view daily page activity for each day the site is operational, up to and
including the previous day.
To plan a workflow for this requirement, you might consider the following questions:

How up-to-date does the data need to be in reports, data models, and applications that
consume the data processing results?

How frequently does the source data get updated or added to?

The requirement is that the Excel workbook includes all data up to and including the
previous days web server logs, so a solution is required that refreshes the workbook as
soon as possible after the last log entry of the day has been processed. In a 24x7 system
this means that the data must be processed daily, just after midnight.

This depends on how active the website is. Many large online retailers handle thousands of
requests per second, so the log files may grow extremely quickly.

Are there data processing tasks for which data from one source cannot be processed until data
from another source is available?

If analysis in Excel is limited to just the website activity, there are no dependencies between
data sources. However, if the web server log data must be combined with sales data

300 Implementing big data solutions using HDInsight

captured by the organizations e-commerce application (perhaps based on mapping


product IDs in web request query strings to orders placed) there may be a requirement to
capture and stage data from both sources before processing it.

How long does it typically take to complete each task in the workflow?

To what degree can data processing operations degrade the performance and scalability of
systems that are in use for ongoing business processes?

You will need to test samples of data to determine this. Based on the high volatility of the
source data, and the requirement to include log entries right up to midnight, you might find
that it takes a significantly long time to upload a single daily log file and process it with
HDInsight before the Excel workbook can be refreshed with the latest data. You might
therefore decide that a better approach is to use hourly log files and perform multiple
uploads during the day, or capture the log data in real-time using a tool such as Flume and
write it directly to Azure storage. You could also process the data periodically during the
day to reduce the volume of data to be processed at midnight, enabling the Excel workbook
to be refreshed within a smaller time period.

There may be some impact on the web servers as the log files are read, and you should test
the resource utilization overhead this causes.

What is the budget for the employee time and infrastructure resources used to process the
data?

The process can be fully automated, which minimizes human resource costs. The main
running cost is the HDInsight cluster, and you can mitigate this by only provisioning the
cluster when it is needed to perform the data processing jobs. For example, you could
design a workflow in which log files are uploaded to Azure storage on an hourly basisbut
the HDInsight cluster is only provisioned at midnight when the last log file has been
uploaded, and then deleted when the data has been processed. If processing the logs for
the entire day takes too long to refresh the Excel workbook in a timely fashion, you could
automate provisioning of the cluster and processing of the data twice per day.

In the previous example, based on measurements you make by experimenting with each stage of the
process, you might design a workflow in which:
1. The web servers are configured to create a new log each hour.
2. On an hourly schedule the log files for the previous hour are uploaded to Azure storage. For
the purposes of this example, uploading an hourly log file takes between three and five
minutes.
3. At noon each day an HDInsight cluster is provisioned and the log files for the day so far are
processed. The results are then transferred to Azure SQL Database, and the cluster and the

Building end-to-end solutions using HDInsight 301

log files that have been processed are deleted. For the purposes of the example, this takes
between five and ten minutes.
4. The remaining logs for the day are uploaded on an hourly schedule.
5. At midnight an HDInsight cluster is provisioned and the log files for the day so far are
processed. The results are then transferred to SQL Database and the cluster and the log files
that have been processed are deleted. For the purposes of the example, this takes between
five and ten minutes.
6. Fifteen minutes later the data model in the Excel workbook is refreshed to include the new
data that was added to the SQL Database tables during the two data processing activities
during the day.
The specific dependencies and constraints in each big data processing scenario can vary significantly.
However, spending the time upfront to consider how you will accommodate them will help you plan and
implement an effective end to end solution.

Task design and context


When you are designing the individual tasks for an automated big data solution, and how they will be
combined and scheduled, you should consider the following factors:

Task execution context

Task parameterization

Data consistency

Exception handling and logging

Task execution context


When you plan automated tasks, you must determine the user identity under which the task will be
executed, and ensure that it has sufficient permissions to access any files or services it requires. Ensure
that the context under which components, tools, scripts, and custom client applications will execute has
sufficient but not excessive permissions, and for just the necessary resources.
In particular, if the task uses Azure PowerShell cmdlets or .NET SDK for HDInsight classes to access Azure
services, you must ensure that the execution context has access to the required Azure management
certificates, credentials, and storage account keys. However, avoid storing credential information in
scripts or code; instead load these from encrypted configuration files where possible.
When you plan to schedule automated tasks you must identify the account that will be used by the task
when it runs in an unattended environment. The context under which on-premises components, tools,
scripts, and custom client applications execute requires sufficient permission to access certificates,
publishing settings files, files on the local file system (or on a remote file share), SSIS packages that use

302 Implementing big data solutions using HDInsight

DTExec, and other resourcesbut not HDInsight itself because these credentials will be provided in the
scripts or code.
Windows Task Scheduler enables you to specify Windows credentials for each scheduled task, and the
SQL Server Agent enables you to define proxies that encapsulate credentials with access to specific
subsystems for individual job steps. For more information about SQL Server Agent proxies and
subsystems see Implementing SQL Server Agent Security.
Task parameterization
Avoid hard-coding variable elements in your big data tasks. This may include file locations, Azure service
names, Azure storage access keys, and connection strings. Instead, design scripts, custom applications,
and SSIS packages to use parameters or encrypted configuration settings files to assign these values
dynamically. This can improve security, as well as maximizing reuse, minimizing development effort, and
reducing the chance of errors caused by multiple versions that might have subtle differences. See
Securing credentials in scripts and applications in the Security section of this guide for more
information.
When using SQL Server 2012 Integration Services or later, you can define project-level parameters and
connection strings that can be set using environment variables for a package deployed in an SSIS
catalog. For example, you could create an SSIS package that encapsulates your big data process and
deploy it to the SSIS catalog on a SQL Server instance. You can then define named environments (for
example Test or Production), and set default parameter values to be used when the package is run
in the context of a particular environment. When you schedule an SSIS package to be run using a SQL
Server Agent job you can specify the environment to be used.
If you use project-level parameters in an SSIS project, ensure that you set the Sensitive option for any
parameters that must be encrypted and stored securely. For more information see Integration Services
(SSIS) Parameters.
Data consistency
Partial failures in a data processing workflow can lead to inconsistent results. In many cases, analysis
based on inconsistent data can be more harmful to a business than no analysis at all.
When using SSIS to coordinate big data processes, use the control flow checkpoint feature to support
restarting the package at the point of failure.
Consider adding custom fields to enable lineage tracking of all data that flows through the process. For
example, add a field to all source data with a unique batch identifier that can be used to identify data
that was ingested by a particular instance of the workflow process. You can then use this identifier to
reverse all changes that were introduced by a failed instance of the workflow process.
Exception handling and logging
In any complex workflow, errors or unexpected events can cause exceptions that prevent the workflow
from completing successfully. When an error occurs in a complex workflow, it can be difficult to
determine what went wrong.

Building end-to-end solutions using HDInsight 303

Most developers are familiar with common exception handling techniques, and you should ensure that
you apply these to all custom code in your solution. This includes custom .NET applications, PowerShell
scripts, map/reduce components, and Transact-SQL scripts. Implementing comprehensive logging
functionality for both successful and unsuccessful operations in all custom scripts and applications helps
to create a source of troubleshooting information in the event of a failure, as well as generating useful
monitoring data.
If you use PowerShell or custom .NET code to manage job submission and Oozie workflows, capture the
job output returned to the client and include it in your logs. This helps centralize the logged information,
making it easier to find issues that would otherwise require you to examine separate logs in the
HDInsight cluster (which may have been deleted at the end of a partially successful workflow).
If you use SSIS packages to coordinate big data processing tasks, take advantage of the native logging
capabilities in SSIS to record details of package execution, errors, and parameter values. You can also
take advantage of the detailed log reports that are generated for packages deployed in an SSIS catalog.

Coordinating solutions and tasks


There are numerous options for coordinating the automation of the end-to-end process. Some common
options on the Windows platform include scripts, custom applications, and tools such as SQL Server
Integration Services (SSIS). This topic describes the following techniques:

Coordinating the process with Windows PowerShell scripts

Coordinating the process with a custom .NET application or service

Coordinating the process with SSIS

Coordinating the process with Windows PowerShell scripts


You can use PowerShell to automate almost all aspects of a typical HDInsight solution. You might use a
single script or (more likely) a collection of related scripts that can be run interactively. These scripts
may be called from a single control script, daisy-chained from one another so that each one starts the
next, or each one could be scheduled to run at predetermined times.
Examples of tasks that can be automated with PowerShell include:

Provisioning and deleting Azure storage accounts and HDInsight clusters.

Uploading data and script files to Azure storage.

Submitting map/reduce, Pig, and Hive jobs that process data.

Running a Sqoop job to transfer data between HDInsight and a relational database.

Starting an Oozie workflow.

Downloading output files generated by jobs.

Executing the DTExec.exe command line tool to run an SSIS package.

304 Implementing big data solutions using HDInsight

Running an XMLA command in SQL Server Analysis Services (SSAS) to process a data model.

PowerShell is often the easiest way to automate individual tasks or sub-processes, and can be a good
choice for relatively simple end-to-end processes that have minimal steps and few conditional branches.
However, the dependency on one or more script files can make it fragile for complex solutions.
The following table shows how PowerShell can be used to automate an end-to-end solution for each of
the big data use cases and models discussed in this guide.
Use case

Considerations

Iterative data
exploration

Iterative data exploration is usually an interactive process that is performed by a small


group of data analysts. PowerShell provides an easy-to-implement solution for automating
on-demand provisioning and deletion of an HDInsight cluster, and for uploading commonly
reused data and script files to Azure storage. The data analysts can then use PowerShell
interactively to run jobs that analyze the data.

Data warehouse on
demand

In this scenario you can use a PowerShell script to upload new data files to Azure storage,
provision the cluster, recreate Hive tables, refresh reports that are built on them, and then
delete the cluster.

ETL automation

In a simple ETL solution you can encapsulate the jobs that filter and shape the data in an
Oozie workflow, which can be initiated from a PowerShell script. If the source and/or
destination of the ETL process is a relational database that is accessible from the
HDInsight cluster, you can use Sqoop actions in the Oozie workflow. Otherwise you can
use PowerShell to upload source files and download output files, or to run an SSIS
package using the DTExec.exe command line tool.

BI integration

In an enterprise BI integration scenario there is generally an existing established ETL


coordination solution based on SSIS, and the processing of big data with HDInsight can
be added to this solution. Some of the processing tasks may be automated using
PowerShell scripts that are initiated by SSIS packages.

Coordinating the process with a custom .NET application or service


The .NET SDK for HDInsight provides a range of classes and interfaces that developers can use to interact
with HDInsight, and additional .NET APIs enable integration with other software in the Microsoft data
platform, such as SQL Server. This makes it possible to build custom applications for big data processing,
or to enhance existing applications to integrate them with HDInsight.
Building a custom application or service to coordinate an HDInsight process is appropriate when you
want to encapsulate your big data solution as a Windows service, or when you need to implement a
business application that makes extensive use of big data processing.
The following table shows how custom .NET code can be used to automate an end-to-end solution for
each of the big data use cases and models discussed in this guide.
Use case

Considerations

Iterative data
exploration

Iterative data exploration is usually an interactive process that is performed by a small


group of data analysts. Building a custom application to manage this process is usually
not required. However, if the iterative analysis evolves into a useful, repeatable business

Building end-to-end solutions using HDInsight 305

practice you may want to implement a custom application that integrates the analytical
processing into a business process.
Data warehouse on
demand

If the data warehouse is specifically designed to help analysts examine data that is
generated by a custom business application, you might integrate the process of uploading
new data, provisioning a cluster, recreating Hive tables, refreshing reports, and deleting
the cluster into the application using classes and interfaces from the .NET SDK for
HDInsight.

ETL automation

As in the data warehouse on demand scenario, if the ETL process is designed to take the
output from a particular application and process it for analysis you could manage the
entire ETL process from the application itself.

BI integration

In this scenario there is generally an existing established ETL coordination solution based
on SSIS, and the processing of big data with HDInsight can be added to this solution.
Some of the processing tasks may be automated using custom SSIS tasks, which are
implemented using .NET code.

Coordinating the process with SSIS


SQL Server Integration Services (SSIS) provides a platform for implementing complex control flows and
data pipelines that can be coordinated and managed centrally. Even if you are not planning to use SSIS
data flows to transfer data into and out of Azure storage, you can still make use of SSIS control flows to
coordinate the sub-processes in your HDInsight-based big data process.
SSIS includes a wide range of control flow tasks that you can use in your process, including:

Data Flow. Data flow tasks encapsulate the transfer of data from one source to another, with
the ability apply complex transformations and data cleansing logic as the data is transferred.

Execute SQL. You can use Execute SQL tasks to run SQL commands in relational databases. For
example, after using Sqoop to transfer the output of a big data processing job to a staging table
you could use an Execute SQL task to load the staged data into a production table.

File System. You can use a File System task to manipulate files on the local file system. For
example, you could use a File System task to prepare files for upload to Azure storage.

Execute Process. You can use an execute process task to run a command, such as a custom
command line utility or a PowerShell script.

Analysis Services Processing. You can use an Analysis Services Processing task to process
(refresh) an SSAS data model. For example, after completing a job that creates Hive tables over
new data you could process any SSAS data models that are based on those tables to refresh the
data in the model.

Send Mail. You can use a Send Mail task to send a notification email to an operator when a
workflow is complete, or when a task in the workflow fails.

Additionally, you can use a Script task or create a custom task using .NET code to perform custom
actions.

306 Implementing big data solutions using HDInsight

SSIS control flows use precedence constraints to implement conditional branching, enabling you to
create complex workflows that handle exceptions or perform actions based on variable conditions. SSIS
also provides native logging support, making it easier to troubleshoot errors in the workflow.
The following table shows how SSIS can be used to automate an end-to-end solution for each of the big
data use cases and models discussed in this guide.
Use case

Considerations

Iterative data
exploration

Iterative data exploration is usually an interactive process that is performed by a small


group of data analysts. Building an SSIS solution to automate this is unlikely to be of any
significant benefit.

Data warehouse on
demand

SSIS is designed to coordinate the transfer of data from one store to another, and can be
used effectively for large volumes of data that require transformation and cleansing before
being loaded into the target data warehouse. When the target is a Hive database in
HDInsight, you can use SSIS Execute Process tasks to run command line applications or
PowerShell scripts that provision HDInsight, load data to Azure storage, and create Hive
tables. You can then use an Analysis Services Processing task to process any SSAS data
models that are based on the data warehouse.

ETL automation

Although SSIS itself can be used to perform many ETL tasks, when the data must be
shaped and filtered using big data processing techniques in HDInsight you can use SSIS
to coordinate scripts and commands that provision the cluster, perform the data
processing jobs, export the output to a target data store, and delete the cluster.

BI integration

In an enterprise BI integration scenario there is generally an existing established ETL


coordination solution based on SSIS, and the processing of big data with HDInsight can
be added to this solution.

See SQL Server Integration Services in the MSDN Library for information about how to use SQL Server
Integration Services (SSIS) to automate and coordinate tasks.

Scheduling solution and task execution


Many data processing solutions are based on workflows containing recurring tasks that must execute at
specific times. Data is often uploaded and processed as batches on regular schedules, such as overnight
or at the end of each month. This may depend on a number of factors such as when all of the source
data is available, when the processed results must be available for data consumers, and how long each
step in the process takes to run.
Typically, initiation of an entire end-to-end solution or an individual task is one of the following types:

Interactive: The operation is started on demand by a human operator. For example, a user
might run a PowerShell script to provision a cluster.

Scheduled: The operation is started automatically at a specified time. For example, the
Windows Task Scheduler application could be used to run a PowerShell script or custom tool
automatically at midnight to upload daily log files to Azure storage.

Building end-to-end solutions using HDInsight 307

Triggered: The operation is started automatically by an event, or by the completion (or failure)
of another operation. For example, you could implement a custom Windows service that
monitors a local folder. When a new file is created, the service automatically uploads it to Azure
storage.

After the initial process or task has been started, it can start each sub-process automatically.
Alternatively, you can start them on a scheduled basis that allows sufficient time for all dependent subprocesses to complete.
This topic discusses two different scheduling aspects for automated solutions:

Scheduling automated tasks

Scheduling data refresh in consumers

Scheduling automated tasks


Regardless of how you decide to implement the automated tasks in your big data process, you can
choose from a variety of ways to schedule them for execution automatically at a specific time. Options
for scheduling tasks include:

Windows Task Scheduler. You can use the Task Scheduler administrative tool (or the
schtasks.exe command line program) to configure one-time or recurring commands, and specify
a wide range of additional properties and behavior for each task. You can use this tool to trigger
a command at specific times, when a specific event is written to the Windows event log, or in
response to other system actions. Commands you can schedule with the Windows Task
Scheduler include:

Custom or third party application executables and batch files.

PowerShell.exe (with a parameter referencing the PowerShell script to be run).

DTExec.exe (with a parameter referencing the SSIS package to be run).


See Task Scheduler Overview in the TechNet Library for more information about automating
tasks with the Windows Task Scheduler.

SQL Server Agent. The SQL Server Agent is a commonly used automation tool for SQL Server
related tasks. You can use it to create multistep jobs that can then be scheduled to run at
specific times. The types of step you can use include the following:

Operating System (CmdExec) steps that run a command line program.

PowerShell steps that run specific PowerShell commands.

SQL Server Analysis Services (SSAS) Command steps; for example, to process an SSAS data
model.

SQL Server Integration Services (SSIS) Package steps to run a SSIS packages.

308 Implementing big data solutions using HDInsight

Transact-SQL steps to run Transact-SQL commands in a SQL Server database.


See SQL Server Agent for more information about how you can automate tasks using SQL
Server Agent.

SQL Server Agent offers greater flexibility and manageability than Windows Task Scheduler, but it
requires a SQL Server instance. If you are already planning to use SQL Server, and particularly SSIS, in
your solution then SQL Server Agent is generally the best way to automate scheduled execution of tasks.
However, Windows Task Scheduler offers an effective alternative when SQL Server is not available.
You may also be able to use the Azure Scheduler service in your Azure cloud service applications to
execute some types of tasks. Azure Scheduler can make HTTP requests to other services, and monitor
the outcome of these requests. It is unlikely to be used for initiating on-premises applications and tasks.
However, you might find it useful for accessing an HDInsight cluster directly to perform operations such
as transferring data and performing management tasks within the clustermany of these tasks expose a
REST API that Azure Scheduler could access. For more information see Azure Scheduler on the Azure
website.
Scheduling data refresh in consumers
You can use the Windows Task Scheduler and SQL Server Agent to schedule execution of an SSIS
package, console application, or PowerShell script. However, reports and data models that consume the
output of the processing workflow might need to be refreshed on their own schedule. You can process
SSAS data models in an SSIS control flow by using a SQL Server Agent job or by using PowerShell to run
an XMLA command in the SSAS server, but PowerPivot data models that are stored in Excel workbooks
cannot be processed using this techniqueand must be refreshed separately. Similarly, the refreshing
of SQL Server Reporting Services (SSRS) reports that make use of caching or snapshots must be managed
separately.
Scheduled data refresh for PowerPivot data models in SharePoint Server
PowerPivot workbooks that are shared in an on-premises SharePoint Server site can be refreshed
interactively on-demand, or the workbook owner can define a schedule for automatic data refresh. In a
regularly occurring big data process, the owners of shared workbooks are the data stewards for the data
models they contain. As such, they must take responsibility for scheduling data refresh at the earliest
possible time after updated data is available in order to keep the data models (and reports based on
them) up to date.
For data refresh to be successful, the SharePoint Server administrator must have enabled an unattended
service account for the PowerPivot service and this account must have access to all data sources in the
workbook, as well as all required system rights for Kerberos delegation. The SharePoint administrator
can also specify a range of business hours during which automatic scheduled refresh can occur.
For more information see PowerPivot Data Refresh with SharePoint 2013.

Building end-to-end solutions using HDInsight 309

Scheduled data refresh for reports in Power BI


Workbooks that have been published as reports in a Power BI site can be configured for automatic data
refresh on a specified schedule. This is useful in scenarios where you have used HDInsight to process
data and load it into a database (using Sqoop or SSIS), and then built Excel data models based on the
data in the database. In a regularly occurring big data process the owners of shared workbooks are the
data stewards for the data models they contain, and as such must take responsibility for scheduling data
refresh at the earliest possible time after updated data is available in order to keep the data models
(and reports based on them) up to date.
At the time this guide was written, automatic refresh was only supported for SQL Server and Oracle.
If the workbook makes use of on-premises data sources, the Power BI administrator must configure a
data management gateway that allows access to the on-premises data source from the Power BI service
in Office 365. This may be necessary if, for example, your big data process uses HDInsight to filter and
shape data that is then transferred to an on-premises SQL Server data warehouse.
For information about configuring a data management gateway for Power BI see Create a Data
Management Gateway. For information about scheduling automatic data refresh see Schedule data
refresh for workbooks in Power BI for Office 365.
Scheduled data refresh for SSRS reports
SSRS supports two techniques for minimizing requests to report data sources: caching and snapshots.
Both techniques are designed to improve the responsiveness of reports, and can be especially useful
when data in the data source changes only periodically.
Caching involves fetching the data required to render the report the first time the report is requested,
and maintaining a cached copy of the report until a scheduled time at which the cache expires. New
data is fetched from the data source on the first request after the cache has expired. In SQL Server 2014
Reporting Services you can cache both datasets and reports.
Snapshots are pre-rendered reports that reflect the data at a specific point in time. You can schedule
snapshot creation and then use the snapshots to satisfy report requests, without needing to query the
data source again until the next scheduled snapshot.
In a big data process, these techniques can be useful not only for improving report rendering
performance but also for including data from Hive tables in reportseven when the HDInsight cluster
hosting the Hive tables has been deleted.
Caching and snapshot creation are controlled through the creation of schedules that are managed by
Reporting Services. For more information about using these techniques in SSRS see Performance,
Snapshots, Caching (Reporting Services).

310 Implementing big data solutions using HDInsight

Security
It is vital to consider how you can maximize security for all the applications and services you build and
use. This is particularly the case with distributed applications and services, such as big data solutions,
that move data over public networks and store data outside the corporate network.
Typical areas of concern for security in these types of applications are:

Securing the infrastructure

Securing credentials in scripts and applications

Securing data passing over the network

Securing data in storage

Securing the infrastructure


HDInsight runs on a set of Azure virtual machines that are provisioned automatically when you create a
cluster, and it uses an Azure SQL Database to store metadata for the cluster. The cluster is isolated and
provides external access through a secure gateway node that exposes a single point of access and
carries out user authentication. However, you must be aware of several points related to security of the
overall infrastructure of your solutions:

Ensure you properly protect the cluster by using passwords of appropriate complexity.

Ensure you protect your Azure storage keys and keep them secret. If a malicious user obtains
the storage key, he or she will be able to directly access the cluster data held in blob storage.

Protect credentials, connection strings, and other sensitive information when you need to use
them in your scripts or application code. See Securing credentials in scripts and applications for
more information.

If you enable remote desktop access to the cluster, use a suitably strong password and
configure the access to expire as soon as possible after you finish using it. Remote desktop users
do not have administrative level permissions on the cluster, but it is still possible to access and
modify the core Hadoop system, and read data and the contents of configuration files (which
contain security information and settings) through a remote desktop connection.

Consider if protecting your clusters by using a custom or third party gatekeeper implementation
that can authenticate multiple users with different credentials would be appropriate for your
scenario.

Use local security policies and features, such as file permissions and execution rights, for tools
or scripts that transmit, store, and process the data.

Securing credentials in scripts and applications


Scripts, applications, tools, and other utilities will require access to credentials in order to load data, run
jobs, or download the results from HDInsight. However, if you store credentials in plain text in your

Building end-to-end solutions using HDInsight 311

scripts or configuration files you leave the cluster itself, and the data in Azure storage, open to anyone
who has access to these scripts or configuration files.
In production systems, and at any time when you are not just experimenting with HDInsight using test
data, you should consider how you will protect credentials, connections strings, and other sensitive
information in scripts and configuration files. Some solutions are:

Prompt the user to enter the required credentials as the script or application executes. This is a
common approach in interactive scenarios, but it is obviously not appropriate for automated
solutions where the script or application may run in unattended mode on a schedule, or in
response to a trigger event.

Store the required credentials in encrypted form in the configuration file. This approach is
typically used in .NET applications where sections of the configuration file can be encrypted
using the methods exposed by the .NET framework. See Encrypting Configuration Information
Using Protected Configuration for more information. You must ensure that only authorized
users can execute the application by protecting it using local security policies and features, such
as file permissions and execution rights.

Store the required credentials in a text file, a repository, a database, or Windows Registry in
encrypted form using the Data Protection API (DPAPI). This approach is typically used in
Windows PowerShell scripts. You must ensure that only authorized users can execute the script
by protecting it using local security policies and features, such as file permissions and execution
rights.

The article Working with Passwords, Secure Strings and Credentials in Windows PowerShell on the
TechNet wiki includes some useful examples of the techniques you can use.
Securing data passing over the network
HDInsight uses several protocols for communication between the cluster nodes, and between the
cluster and clients, including RPC, TCP/IP, and HTTP. Consider the following when deciding how to
secure data that passes across the network:

Use a secure protocol for all connections over the Internet to the cluster and to your Azure
storage account. Consider using Secure Socket Layer (SSL) for the connection to your storage
account to protect the data on the wire (supported and recommended for Azure storage). Use
SSL or Transport Layer Security (TLS), or other secure protocols, where appropriate when
communicating with the cluster from client-side tools and utilities, and keep in mind that some
tools may not support SSL or may require you to specifically configure them to use SSL. When
accessing Azure storage from client-side tools and utilities, use the wasbs secure protocol (you
must specify the full path to a file when you use the wasbs protocol).

Consider if you need to encrypt data in storage and on the wire. This is not trivial, and may
involve writing custom components to carry out the encryption. If you create custom
components, use proven libraries of encryption algorithms to carry out the encryption process.

312 Implementing big data solutions using HDInsight

Note that the encryption keys must be available within your custom components running in
Azure, when may leave them vulnerable.
Securing data in storage
Consider the following when deciding how to secure data in storage:

Do not store data that is not associated with your HDInsight processing jobs in the storage
accounts linked to a cluster. HDInsight has full access to all of the containers in linked storage
accounts because the account names and keys are stored in the cluster configuration. See
Cluster and storage initialization for details of how you can isolate parts of your data by using
separate storage accounts.

If you use non-linked storage accounts in an HDInsight job by specifying the storage key for
these accounts in the job files, the HDInsight job will have full access to all of the containers and
blobs in that account. Ensure that these non-linked storage accounts do not contain data that
must be kept private from HDInsight, and that the containers do not have public access
permission. See Using an HDInsight Cluster with Alternate Storage Accounts and Metastores
and Use Additional Storage Accounts with HDInsight Hive for more information.

Consider if using Shared Access Signatures (SAS) to provide access to data in Azure storage
would be an advantage in your scenario. SAS can provide fine-grained controlled and timelimited access to data for clients. For more details see Create and Use a Shared Access
Signature.

Consider if you need to employ monitoring processes that can detect inappropriate access to
the data, and can alert operators to possible security breaches. Ensure that you have a process
in place to lock down access in this case, detect the scope of the security breach, and ensure
validation and integrity of the data afterwards. Hadoop can log access to data. Azure blob
storage also has a built-in monitoring capabilityfor more information see How To Monitor a
Storage Account and Azure Storage Account Monitoring and Logging.

Consider preprocessing or scrubbing the data to remove nonessential sensitive information


before storing it in remote locations such as Azure storage. If you need to stage the data before
processing, perhaps to remove personally identifiable information (a process sometimes
referred to as de-identification), consider using separate storage (preferably on-premises) for
the intermediate stage rather than the dedicated cluster storage. This provides isolation and
additional protection against accidental or malicious access to the sensitive data.

Consider encrypting sensitive data, sensitive parts of the data, or even whole folders and
subfolders. This may include splitting data into separate files, such as dividing credit card
information into different files that contain the card number and the related card-holder
information. Azure blob storage does not have a built-in encryption feature, and so you will
need to encrypt the data using encryption libraries and custom code, or with third-party tools.

Building end-to-end solutions using HDInsight 313

If you are handling sensitive data that must be encrypted you will need to write custom
serializer and deserializer classes and install these in the cluster for use as the SerDe parameter
in Hive statements, or create custom map/reduce components that can manage the
serialization and deserialization. See the Apache Hive Develop Guide for more information
about creating a custom SerDe. However, consider that the additional processing requirements
for encryption imposes a trade-off between security and performance.

Monitoring and logging


There are several ways to monitor an HDInsight cluster and its operation. These include:

Using the Azure cluster status page

Accessing the Hadoop status portals

Accessing the Hadoop-generated log files

Accessing Azure storage metrics

Considerations

Using the Azure cluster status page


The page for an HDInsight cluster in the Azure web management portal displays rudimentary
information for the cluster. This includes a dashboard showing information such as the number of map
and reduce jobs executed over the previous one or four hours, a range of settings and information
about the cluster, and a list of the linked resources such as storage accounts. The cluster status page
also contains monitoring information such as the accumulated, maximum, and minimum data for the
storage containers and running applications.
The configuration section of the portal page enables you to turn on and off Hadoop services for this
cluster and establish a remote desktop connection using RDP to the head node of the cluster. In
addition, the portal page contains a link to open the management page for the cluster.
The management page contains three sections. The Hive Editor provides a convenient way to
experiment with HiveQL commands and statements such as queries, and view the results. It may prove
useful if you are just exploring some data or developing parts of a more comprehensive solution. The
Job History section displays a list of jobs you have executed, and some basic information about each
one. The File Browser section allows you to view the files stored in the clusters Azure blob storage.

Accessing the Hadoop status portals


Hadoop exposes status and monitoring information in two web portals installed on the cluster,
accessible remotely and through links named Hadoop Name Node Status and Hadoop YARN Status
located on the desktop of the remote cluster head node server. The Hadoop YARN Status portal
provides a wide range of information generated by the YARN resource manager, including information

314 Implementing big data solutions using HDInsight

about each node in the cluster, the applications (jobs) that are executing or have finished, job scheduler
details, current configuration of the cluster, and access to log files and metrics.
The portal also exposes a set of metrics that indicate in great detail the status and performance of each
job. These metrics can be used to monitor and fine tune jobs, and to locate errors and issues with your
solutions.

Accessing the Hadoop-generated log files


HDInsight stores its log files in both the cluster file system and in Azure storage. You can examine log
files in the cluster by opening a remote desktop connection to the cluster and browsing the file system
or by using the Hadoop YARN Status portal on the remote head node server. You can examine the log
files in Azure storage using any of the tools that can access and download data from Azure storage.
Examples are AZCopy, CloudXplorer, and the Visual Studio Server Explorer. You can also use PowerShell
and the Azure Storage Client libraries, or the Azure .NET SDKs, to access data in Azure blob storage.
For a list of suitable tools and technologies for accessing Azure storage see Appendix A - Tools and
technologies reference. For examples of accessing Azure storage from custom tools see Building custom
clients in the section Consuming and visualizing data from HDInsight of this guide.

Accessing Azure storage metrics


Azure storage can be configured to log storage operations and access. You can use these logs, which
contain a wealth of information, for capacity monitoring and planning, and for auditing requests to
storage. The information includes latency details, enabling you to monitor and fine tune performance of
your solutions.
You can use the .NET SDK for Hadoop to examine the log files generated for the Azure storage that holds
the data for an HDInsight cluster. The HDInsight Log Analysis Toolkit is a command-line tool with utilities
for downloading and analyzing Azure storage logs. For more information see Microsoft .NET SDK For
Hadoop. A series of blogs from the Azure storage team also contains useful information, examples, and
a case studysee posts tagged analytics - logging & metrics for more details.

Considerations
When implementing monitoring and logging for your solutions, consider the following points:

As with any remote service or application, managing and monitoring its operation may appear
to be more difficult than for a locally installed equivalent. However, remote management and
monitoring technologies are widely available, and are an accepted part of almost all
administration tasks. In many cases the extension of these technologies to cloud-hosted
services and applications is almost seamless.

Establish a monitoring and logging strategy that can provide useful information for detecting
issues early, debugging problematic jobs and processes, and for use in planning. For example, as
well as collecting runtime data and events, consider measuring overall performance, cluster
load, and other factors that will be useful in planning for data growth and future requirements.

Building end-to-end solutions using HDInsight 315

The YARN portal in HDInsight, accessible remotely, can provide a wide range of information
about performance and events for jobs and for the cluster as a whole.

Configure logging and manage the log files for all parts of the process, not just the jobs within
Hadoop. For example, monitor and log data ingestion and data export where the tools support
this, or consider changing to a tool that can provide the required support for logging and
monitoring. Many tools and services, such as SSIS and Azure storage, will need to be configured
to provide an appropriate level of logging.

Consider maintaining data lineage tracking by adding an identifier to each log entry, or through
other techniques. This allows you to trace back the original source of the data and the
operation, and follow it through each stage to understand its consistency and validity.

Consider how you can collect logs from the cluster, or from more than one cluster, and collate
them for purposes such as auditing, monitoring, planning, and alerting. You might use a custom
solution to access and download the log files on a regular basis, and combine and analyze them
to provide a dashboard-like display with additional capabilities for alerting for security or failure
detection. Such utilities could be created using PowerShell, the HDInsight SDKs, or code that
accesses the Azure Service Management API.

Consider if a monitoring solution or service would be a useful benefit. A management pack for
HDInsight is available for use with Microsoft System Center (see the Microsoft Download Center
for more details). In addition, you can use third-party tools such as Chukwa and Ganglia to
collect and centralize logs. Many companies offer services to monitor Hadoop-based big data
solutionssome examples are Centerity, Compuware APM, Sematext SPM, and Zettaset
Orchestrator.

The following table illustrates how monitoring and logging considerations apply to each of the use cases
and models described in this guide.
Use case

Considerations

Iterative data
exploration

In this model you are typically experimenting with data and do not have a long-term plan
for its use, or for the techniques you will discover for finding useful information in the data.
Therefore, monitoring is not likely to be a significant concern when using this model.
However, you may need to use the logging features of HDInsight to help discover the
optimum techniques for processing the data as you refine your investigation, and to debug
jobs.

Data warehouse on
demand

In this model you are likely to have established a regular process for uploading,
processing, and consuming data. Therefore, you should consider implementing a
monitoring and logging strategy that can detect issues early and assist in resolving them.
Typically, if you intend to delete and recreate the cluster on a regular bases, this will
require a custom solution using tools that run on the cluster or on-premises rather than
using a commercial monitoring service.

ETL automation

In this model you may be performing scheduled data transfer operations, and so it is vital
to establish a robust monitoring and logging mechanism to detect errors and to measure

316 Implementing big data solutions using HDInsight

performance.
BI integration

This model is usually part of an organizations core business functions, and so it is vital to
design a strategy that incorporates robust monitoring and logging features, and that can
detect failures early as well as providing ongoing data for forward planning. Monitoring for
security purposes, alerting, and auditing are likely to be important business requirements
in this model.

Tools and technologies reference 317

Appendix A - Tools and technologies


reference
This appendix contains descriptions of the tools, APIs, SDKs, and technologies commonly used in
conjunction with big data solutions, including those built on Hadoop and HDInsight. The icons for each
item, listed below, will help you to identify the tools and technologies you should investigate.
Icon

Description
Tools, APIs, SDKs, and technologies commonly used for
extracting and consuming the results from Hadoopbased solutions.

Data consumption
Tools, APIs, SDKs, and technologies commonly used for
extracting data from data sources and loading it into
Hadoop-based solutions.
Data ingestion

Data processing

Data processing: Tools, APIs, SDKs, and


technologies commonly used for processing,
querying, and transforming data in Hadoop-based
solutions.

Tools, APIs, SDKs, and technologies commonly used for


transferring data between Hadoop and other data stores
such as databases and cloud storage.
Data transfer
Tools, APIs, SDKs, and technologies commonly used for
visualizing and analyzing the results from Hadoop-based
solutions.
Data visualization

Job submission

Tools, APIs, SDKs, and technologies commonly used for


submitting jobs for processing in Hadoop-based
solutions.
Tools, APIs, SDKs, and technologies commonly used for
managing and monitoring Hadoop-based solutions.

Management

Workflow

Tools, APIs, SDKs, and technologies commonly used for


creating workflows and managing multi-step processing
in Hadoop-based solutions.

318 Appendix A - Tools and technologies reference

The tools, APIs, SDKs, and technologies are listed in alphabetical order below.

Ambari

A solution for provisioning, managing, and monitoring Hadoop clusters using an intuitive, easy-to-use
Hadoop management web UI backed by REST APIs.
Usage notes: Only the monitoring endpoint was available in HDInsight at the time this guide was
written.
For more info, see Ambari.

Aspera

A tool for high-performance transfer and synchronization of files and data sets of virtually any size, with
the full access control, privacy and security. Provides maximum speed transfer under variable network
conditions.
Usage notes:

Uses a combination of UDP and TCP, which eliminates the latency issues typically encountered
when using only TCP.

Leverages existing WAN infrastructure and commodity hardware.

For more info, see Aspera.

Avro

A data serialization system that supports rich data structures, a compact, fast, binary data format, a
container file to store persistent data, remote procedure calls (RPC), and simple integration with
dynamic languages. Can be used with the client tools in the .NET SDK for Azure.
Usage notes:

Quick and easy to use and simple to understand.

Tools and technologies reference 319

Uses a JSON serialization approach.

The API supports C# and several other languages.

No monitoring or logging features built in.

For more info, see Avro.

AZCopy

A command-line utility designed for high performance that can upload and download Azure storage
blobs and files. Can be scripted for automaton. Offers a number of functions to filter and manipulate
content. Provides resuming, and logging functions.
Usage notes:

Transfers to and from an Azure datacenter will be constrained by the connection bandwidth
available.

Configure the number of concurrent threads based on experimentation.

A PowerShell script can be created to monitor the logging files.

For more info, see AZCopy.

Azkaban

A framework for creating workflows that access Hadoop. Designed to overcome the problem of
interdependencies between tasks.
Usage notes:

Uses a web server to schedule and manage jobs, an executor server to submit jobs to Hadoop,
and either an internal H2 database or a separate MySQL database to store job details.

For more info, see Azkaban.

320 Appendix A - Tools and technologies reference

Azure Intelligent Systems Service (ISS)

A cloud-based service that can be used to collect data from a wide range of devices and applications,
apply rules that define automated actions on the data, and connect the data to business applications
and clients for analysis.
Usage notes:

Use it to capture, store, join, visualize, analyze, and share data.

Supports remote management and monitoring of data transfers.

Service was in preview at the time this guide was written.

For more info, see Azure Intelligent Systems Service (ISS).

Azure Storage Client Libraries

Exposes storage resources through a REST API that can be called by any language that can make
HTTP/HTTPS requests.
Usage notes:

Provides programming libraries for several popular languages that simplify many tasks by
handling synchronous and asynchronous invocation, batching of operations, exception
management, automatic retries, operational behavior, and more.

Libraries are currently available for .NET, Java, and C++. Others will be available over time.

For more info, see Azure Storage Client Libraries.

Azure Storage Explorer

A free GUI-based tool for viewing, uploading, and managing data in Azure blob storage. Can be used to
view multiple storage accounts at the same in separate tab pages.

Create, view, copy, rename, and delete containers.

Create, view, copy, rename, delete, upload, and download blobs.

Tools and technologies reference 321

Blobs can be viewed as images, video, or text.

Blob properties can be viewed and edited.

For more info, see Azure Storage Explorer.

Azure SQL Database

A platform-as-a-service (PaaS) relational database solution in Azure that offers a minimal configuration,
low maintenance solution for applications and business processes that require a relational database
with support for SQL Server semantics and client interfaces.
Usage notes: A common work pattern in big data analysis is to provision the HDInsight cluster when it is
required, and decommission it after data processing is complete. If you want the results of the big data
processing to remain available in relational format for client applications to consume, you can transfer
the output generated by HDInsight into a relational database. Azure SQL Database is a good choice for
this when you want the data to remain in the cloud, and you do not want to incur the overhead of
configuring and managing a physical server or virtual machine running the SQL Server database engine.
For more info, see Azure SQL Database.

Casablanca

A project to develop support for writing native-code REST for Azure, with integration in Visual Studio.
Provides a consistent and powerful model for composing asynchronous operations based on C++ 11
features.
Usage notes:

Provides support for accessing REST services from native code on Windows Vista, Windows 7,
and Windows 8 through asynchronous C++ bindings to HTTP, JSON, and URIs.

Includes libraries for accessing Azure blob storage from native clients.

Includes a C++ implementation of the Erlang actor-based programming model.

Includes samples and documentation.

For more info, see Casablanca.

322 Appendix A - Tools and technologies reference

Cascading

A data processing API and processing query planner for defining, sharing, and executing data processing
workflows. Adds an abstraction layer over the Hadoop API to simplify development, job creation, and
scheduling.
Usage notes:

Can be deployed on a single node to efficiently test code and process local files before being
deployed on a cluster, or in a distributed mode that uses Hadoop,

Uses a metaphor of pipes (data streams) and filters (data operations) that can be assembled to
split, merge, group, or join streams of data while applying operations to each data record or
groups of records.

For more info, see Cascading.

Cerebrata Azure Management Studio

A comprehensive environment for managing Azure-hosted applications. Can be used to access Azure
storage, Azure log files, and manage the life cycle of applications. Provides a dashboard-style UI.
Usage notes:

Connects through a publishing file and enables use of groups and profiles for managing users
and resources.

Provides full control of storage accounts, including Azure blobs and containers.

Enables control of diagnostics features in Azure.

Provides management capabilities for many types of Azure service including SQL Database.

For more info, see Cerebrata Azure Management Studio.

Tools and technologies reference 323

Chef

An automation platform that transforms infrastructure into code. Allows you to automate configuration,
deployment and scaling for on-premises, cloud-hosted, and hybrid applications.
Usage notes: Available as a free open source version, and an enterprise version that includes additional
management features such as a portal, authentication and authorization management, and support for
multi-tenancy. Also available as a hosted service.
For more info, see Chef.

Chukwa

An open source data collection system for monitoring large distributed systems, built on top of HDFS
and map/reduce. Also includes a exible and powerful toolkit for displaying, monitoring and analyzing
results.
Has five primary components:

Agents that run on each machine and emit data.

Collectors that receive data from the agent and write it to stable storage.

ETL Processes for parsing and archiving the data.

Data Analytics Scripts that aggregate Hadoop cluster health information.

Hadoop Infrastructure Care Center, a web-portal style interface for displaying data.

For more info, see Chukwa.

CloudBerry Explorer

A free GUI-based file manager and explorer for browsing and accessing Azure storage.
Usage notes:

Also available as a paid-for Professional version that adds encryption, compression, multithreaded data transfer, file comparison, and FTP/SFTP support.

324 Appendix A - Tools and technologies reference

For more info, see CloudBerry Explorer.

CloudXplorer

An easy-to-use GUI-based explorer for browsing and accessing Azure storage. Has a wide range of
features for managing storage and transferring data, including access to compressed files. Supports
auto-resume for file transfers.
Usage notes:

Multithreaded upload and download support.

Provides full control of data in Azure blob storage, including metadata.

Auto-resume upload and download of large files.

No logging features.

For more info, see CloudXplorer.

Cross-platform Command Line Interface (X-plat CLI)

An open source command line interface for developers and IT administrators to develop, deploy and
manage Azure applications. Supports management tasks on Windows, Linux, and iOS. Commands can be
extended using Node.js.
Usage notes:

Can be used to manage almost all features of Azure including accounts, storage, databases,
virtual machines, websites, networks, and mobile services.

The open source license allows for reuse of the library.

Does not support SSL for data transfers.

You must add the path to the command line PATH list.

On Windows it is easier to use PowerShell.

For more info, see Cross-platform Command Line Interface (X-plat CLI).

Tools and technologies reference 325

D3.js

A high performance JavaScript library for manipulating documents based on data. It supports large
datasets and dynamic behaviors for interaction and animation, and can be used to generate attractive
and interactive output for reporting, dashboards, and any data visualization task. Based on web
standards such as HTML, SVG and CSS to expose the full capabilities of modern browsers.
Usage notes:

Allows you to bind arbitrary data to a Document Object Model (DOM) and then apply datadriven transformations to the document, such as generating an HTML table from an array of
numbers and using the same data to create an interactive SVG bar chart with smooth transitions
and interaction.

Provides a powerful declarative approach for selecting nodes and can operate on arbitrary sets
of nodes called selections.

For more info, see D3.js.

Falcon

A framework for simplifying data management and pipeline processing that enables automated
movement and processing of datasets for ingestion, pipelines, disaster recovery, and data retention.
Runs on one server in the cluster and is accessed through the command-line interface or the REST API.
Usage notes:

Replicates HDFS files and Hive Tables between different clusters for disaster recovery and multicluster data discovery scenarios.

Manages data eviction policies.

Uses entity relationships to allow coarse-grained data lineage.

Automatically manages the complex logic of late data handling and retries.

Uses higher-level data abstractions (Clusters, Feeds, and Processes) enabling separation of
business logic from application logic.

Transparently coordinates and schedules data workflows using the existing Hadoop services
such as Oozie.

326 Appendix A - Tools and technologies reference

For more info, see Falcon.

FileCatalyst

A client-server based file transfer system that supports common and secure protocols (UDP, FTP, FTPS,
HTTP, HTTPS), encryption, bandwidth management, monitoring, and logging.
Usage notes:

FileCatalyst Direct features are available by installing the FileCatalyst Server and one of the
client-side options.

Uses a combination of UDP and TCP, which eliminates the latency issues typically encountered
when using only TCP.

For more info, see FileCatalyst.

Flume

A distributed, robust, and fault tolerant tool for efficiently collecting, aggregating, and moving large
amounts of log file data. Has a simple and flexible architecture based on streaming data flows and with a
tunable reliability mechanism. The simple extensible data model allows for automation using Java code.
Usage notes:

Includes several plugins to support various sources, channels, sinks and serializers. Well
supported third party plugins are also available.

Easily scaled due to its distributed architecture.

You must manually configure SSL for each agent. Configuration can be complex and requires
knowledge of the infrastructure.

Provides a monitoring API that supports custom and third party tools.

For more info, see Flume.

Tools and technologies reference 327

Ganglia

A scalable distributed monitoring system that can be used to monitor computing clusters. It is based on
a hierarchical design targeted at federations of clusters, and uses common technologies such as XML for
data representation, XDR for compact, portable data transport, and RRDtool for data storage and
visualization.
Comprises the monitoring core, a web interface, an execution environment, a Python client, a command
line interface, and RSS capabilities.
For more info, see Ganglia.

Hadoop command line

Provides access to Hadoop to execute the standard Hadoop commands. Supports scripting for managing
Hadoop jobs and shows the status of commands and jobs.
Usage notes:

Accessed through a remote desktop connection.

Does not provide administrative level access.

Focused on Hadoop and not HDInsight.

You must create scripts or batch files for operations you want to automate.

Does not support SSL for uploading data.

Requires knowledge of Hadoop commands and operating procedures.

For more info, see Commands Manual.

Hamake

A workflow framework based on directed acyclic graph (DAG) principles for scheduling and managing
sequences of jobs by defining datasets and ensuring that each is kept up to date by executing Hadoop
jobs.
Usage notes:

328 Appendix A - Tools and technologies reference

Generalizes the programming model for complex tasks through dataflow programming and
incremental processing.

Workflows are defined in XML and can include iterative steps and asynchronous operations over
more than one input dataset.

For more info, see Hamake.

HCatalog

Provides a tabular abstraction layer that helps unify the way that data is interpreted across processing
interfaces, and provides a consistent way for data to be loaded and stored; regardless of the specific
processing interface being used. This abstraction exposes a relational view over the data, including
support for partitions.
Usage notes:

Easy to incorporate into solutions. Files in JSON, SequenceFile, CSV, and RC format can be read
and written by default, and a custom SerDe can be used to read and write files in other formats.

Enables notification of data availability, making it easier to write applications that perform
multiple jobs.

Additional effort is required in custom map/reduce components because custom load and store
functions must be created.

For more info, see HCatalog.

HDInsight SDK and Microsoft .NET SDK for Hadoop

The HDInsight SDKs provide the capability to create clients that can manage the cluster, and execute
jobs in the cluster. Available for .NET development and other languages such as Node.js. WebHDFS
client is a .NET wrapper for interacting with WebHDFS compliant end-points in Hadoop and Azure
HDInsight. WebHCat is the REST API for HCatalog, a table and storage management layer for Hadoop.
Can be used for a wide range of tasks including:

Creating, customizing, and deleting clusters.

Creating and submitting map/reduce, Pig, Hive, Sqoop, and Oozie jobs.

Tools and technologies reference 329

Configuring Hadoop components such as Hive and Oozie.

Serializing data with Avro.

Using Linq To Hive to query and extract data.

Accessing the Ambari monitoring system.

Performing storage log file analysis.

Accessing the WebHCat (for HCatalog) and WebHDFS services.

For more info see, HDInsight SDK and Microsoft .NET SDK For Hadoop.

Hive

An abstraction layer over the Hadoop query engine that provides a query language called HiveQL, which
is syntactically very similar to SQL and supports the ability to create tables of data that can be accessed
remotely through an ODBC connection. Hive enables you to create an interface to your data that can be
used in a similar way to a traditional relational database.
Usage notes:

Data can be consumed from Hive tables using tools such as Excel and SQL Server Reporting
Services, or though the ODBC driver for Hive.

Hive QL allows you to plug in custom mappers and reducers to perform more sophisticated
processing.

A good choice for processes such as summarization, ad hoc queries, and analysis on data that
has some identifiable structure; and for creating a layer of tables through which users can easily
query the source data, and data generated by previously executed jobs.

For more info, see Hive.

Kafka

A distributed, partitioned, replicated service with the functionality of a messaging system. Stores data as
logs across servers in a cluster and exposes the data through consumers to implement common
messaging patterns such as queuing and publish-subscribe.
Usage notes:

330 Appendix A - Tools and technologies reference

Uses the concepts of topics that are fed to Kafka by producers. The data is stored in the
distributed cluster servers, each of which is referred to as a broker, and accessed by consumers.

Data is exposed over TCP, and clients are available in a range of languages.

Data lifetime is configurable, and the system is fault tolerant though the use of replicated
copies.

For more info, see Kafka.

Knox

A system that provides a single point of authentication and access for Hadoop services in a cluster.
Simplifies Hadoop security for users who access the cluster data and execute jobs, and for operators
who control access and manage the cluster.
Usage notes:

Provides perimeter security to make Hadoop security setup easier.

Supports authentication and token verification security scenarios.

Delivers users a single cluster end-point that aggregates capabilities for data and jobs.

Enables integration with enterprise and cloud identity management environments.

Manages security across multiple clusters and multiple versions of Hadoop.

For more info, see Knox.

LINQ to Hive

A technology that supports authoring Hive queries using Language-Integrated Query (LINQ). The LINQ is
compiled to Hive and then executed on the Hadoop cluster.
Usage notes: The LINQ code can be executed within a client application or as a user-defined function
(UDF) within a Hive query.
For more info, see LINQ to Hive.

Tools and technologies reference 331

Mahout

A scalable machine learning and data mining library used to examine data files to extract specific types
of information. It provides an implementation of several machine learning algorithms, and is typically
used with source data files containing relationships between the items of interest in a data processing
solution.
Usage notes:

A good choice for grouping documents or data items that contain similar content;
recommendation mining to discover users preferences from their behavior; assigning new
documents or data items to a category based on the existing categorizations; and performing
frequent data mining operations based on the most recent data.

For more info, see Mahout.

Management portal

The Azure Management portal can be used to configure and manage clusters, execute HiveQL
commands against the cluster, browse the file system, and view cluster activity. It shows a range of
settings and information about the cluster, and a list of the linked resources such as storage accounts. It
also provides the ability to connect to the cluster through RDP.
Provides rudimentary monitoring features including:

The number of map and reduce jobs executed.

Accumulated, maximum, and minimum data for containers in the storage accounts.

Accumulated, maximum, and minimum data and running applications.

A list of jobs that have executed and some basic information about each one.

For more info, see Get started using Hadoop 2.2 in HDInsight

332 Appendix A - Tools and technologies reference

Map/reduce

Map/reduce code consists of two functions; a mapper and a reducer. The mapper is run in parallel on
multiple cluster nodes, each node applying it to its own subset of the data. The output from the mapper
function on each node is then passed to the reducer function, which collates and summarizes the results
of the mapper function.
Usage notes:

A good choice for processing completely unstructured data by parsing it and using custom logic
to obtain structured information from it; for performing complex tasks that are difficult (or
impossible) to express in Pig or Hive without resorting to creating a UDF; for refining and
exerting full control over the query execution process, such as using a combiner in the map
phase to reduce the size of the map process output.

For more info, see Map/reduce.

Microsoft Excel

One of the most commonly used data analysis and visualization tools in BI scenarios. It includes native
functionality for importing data from a wide range of sources, including HDInsight (via the Hive ODBC
driver) and relational databases such as SQL Server. Excel also provides native data visualization tools,
including tables, charts, conditional formatting, slicers, and timelines.
Usage notes: After HDInsight has been used to process data, the results can be consumed and visualized
in Excel. Excel can consume output from HDInsight jobs directly from Hive tables in the HDInsight cluster
or by importing output files from Azure storage, or through an intermediary querying and data modeling
technology such as SQL Server Analysis Services.
For more info, see Microsoft Excel.

Tools and technologies reference 333

Azure SDK for Node.js

A set of modules for Node.js that can be used to manage many features of Azure.
Includes separate modules for:

Core management

Compute management

Web Site management

Virtual Network management

Storage Account management

SQL Database management

Service Bus management

For more info, see Azure SDK for Node.js.

Oozie

A tool that enables you to create repeatable, dynamic workflows for tasks to be performed in a Hadoop
cluster. Actions encapsulated in an Oozie workflow can include Sqoop transfers, map/reduce jobs, Pig
jobs, Hive jobs, and HDFS commands.
Usage notes:

Defining an Oozie workflow requires familiarity with the XML-based syntax used to define the
Direct Acyclic Graph (DAG) for the workflow actions.

You can initiate Oozie workflows from the Hadoop command line, a PowerShell script, a custom
.NET application, or any client that can submit an HTTP request to the Oozie REST API.

For more info, see Oozie.

334 Appendix A - Tools and technologies reference

Phoenix

A client-embedded JDBC driver designed to perform low latency queries over data stored in Apache
HBase. It compiles standard SQL queries into a series of HBase scans, and orchestrates the running of
those scans to produce standard JDBC result sets. It also supports client-side batching and rollback.
Usage notes:

Supports all common SQL query statement clauses including SELECT, FROM, WHERE, GROUP BY,
HAVING, ORDER BY, and more.

Supports a full set of DML commands and DDL commands including table creation and versioned
incremental table alteration.

Allows columns to be defined dynamically at query time. Metadata for tables is stored in an
HBase table and versioned so that snapshot queries over prior versions automatically use the
correct schema.

For more info, see Phoenix.

Pig

A high-level data-flow language and execution framework for parallel computation that provides a
workflow semantic for processing data in HDInsight. Supports complex processing of the source data to
generate output that is useful for analysis and reporting. Pig statements generally involve defining
relations that contain data. Relations can be thought of as result sets, and can be based on a schema or
can be completely unstructured.
Usage notes: A good choice for restructuring data by defining columns, grouping values, or converting
columns to rows; transforming data such as merging and filtering data sets, and applying functions to all
or subsets of records; and as a sequence of operations that is often a logical way to approach many
map/reduce tasks.
For more info, see Pig.

Tools and technologies reference 335

Power BI

A service for Office 365 that builds on the data modeling and visualization capabilities of PowerPivot,
Power Query, Power View, and Power Map to create a cloud-based collaborative platform for selfservice BI. Provides a platform for users to share the insights they have found when analyzing and
visualizing the output generated by HDInsight, and to make the results of big data processing
discoverable for other, less technically proficient, users in the enterprise.
Usage notes:

Users can share queries created with Power Query to make data discoverable across the
enterprise through Online Search. Data visualizations created with Power View can be published
as reports in a Power BI site, and viewed in a browser or through the Power BI Windows Store
app. Data models created with PowerPivot can be published to a Power BI site and used as a
source for natural language queries using the Power BI Q&A feature.

By defining queries and data models that include the results of big data processing, users
become data stewards and publish it in a way that abstracts the complexities of consuming
and modeling data from HDInsight.

For more info, see Power BI.

Power Map

An add-in for Excel that is available to Office 365 enterprise-level subscribers. Power Map enables users
to create animated tours that show changes in geographically-related data values over time, overlaid on
a map.
Usage notes: When the results of big data processing include geographical and temporal fields you can
import the results into an Excel worksheet or data model and visualize them using Power Map.
For more info, see Power Map.

336 Appendix A - Tools and technologies reference

Power Query

An add-in for Excel that you can use to define, save, and share queries. Queries can be used to retrieve,
filter, and shape data from a wide range of data sources. You can import the results of queries into
worksheets or into a workbook data model, which can then be refined using PowerPivot.
Usage notes: You can use Power Query to consume the results of big data processing in HDInsight by
defining a query that reads files from the Azure blob storage location that holds the output of big data
processing jobs. This enables Excel users to consume and visualize the results of big data processing,
even after the HDInsight cluster has been decommissioned.
For more info, see Power Query.

Power View

An add-in for Excel that enables users to explore data models by creating interactive data visualizations.
It is also available as a SharePoint Server application service when SQL Server Reporting Services is
installed in SharePoint-Integrated mode, enabling users to create data visualizations from PowerPivot
workbooks and Analysis Services data models in a web browser.
Usage notes: After the results of a big data processing job have been imported into a worksheet or data
model in Excel you can use Power View to explore the data visually. With Power View you can create a
set of related interactive data visualizations, including column and bar charts, pie charts, line charts, and
maps.
For more info, see Power View.

PowerPivot

A data modeling add-in for Excel that can be used to define tabular data models for slice and dice
analysis and visualization in Excel. You can use PowerPivot to combine data from multiple sources into a
tabular data model that defines relationships between data tables, hierarchies for drill-up/down
aggregation, and calculated fields and measures.

Tools and technologies reference 337

Usage notes: In a big data scenario you can use PowerPivot to import a result set generated by
HDInsight as a table into a data model, and then combine that table with data from other sources to
create a model for mash-up analysis and reporting.
For more info, see PowerPivot.

PowerShell

A powerful scripting language and environment designed to manage infrastructure and perform a wide
range of operations. Can be used to implement almost any manual or automated scenario. A good
choice for automating big data processing when there is no requirement to build a custom user
interface or integrate with an existing application. Additional packages of cmdlets and functionality are
available for Azure and HDInsight. The PowerShell interactive scripting environment (ISE) also provides a
useful client environment for testing and exploring.
When working with HDInsight it can be used to perform a wide range of tasks including:

Provisioning and decommissioning HDInsight clusters and Azure storage.

Uploading data and code files to Azure storage.

Submitting map/reduce, Pig, Hive, Sqoop, and Oozie jobs.

Downloading and displaying job results.

Usage notes:

It supports SSL and includes commands for logging and monitoring actions.

Installed with Windows, though not all early versions offer optimal performance and range of
operations.

For optimum performance and capabilities all systems must be running the latest version.

Very well formed and powerful language, but has a reasonably high learning curve for new
adopters.

You can schedule PowerShell scripts to run automatically, or initiate them on-demand.

For more info, see PowerShell.

338 Appendix A - Tools and technologies reference

Puppet

Automates repetitive tasks, such as deploying applications and managing infrastructure, both onpremises and in the cloud.
Usage notes: The Enterprise version can automate tasks at any stage of the IT infrastructure lifecycle,
including: discovery, provisioning, OS and application configuration management, orchestration, and
reporting.
For more info, see Puppet.

Reactive Extensions (Rx)

A library that can be used to compose asynchronous and event-based programs using observable
collections and LINQ-style query operators. Can be used to create stream-processing solutions for
capturing, storing, processing, and uploading data. Supports multiple asynchronous data streams from
different sources.
Usage notes:

Download as a NuGet package.

SSL support can be added using code.

Can be used to address very complex streaming and processing scenarios, but all parts of the
solution must be created using code.

Requires a high level of knowledge and coding experience, although plenty of documentation
and samples are available.

For more info, see Reactive Extensions (Rx).

Remote Desktop Connection

Allows you to remotely connect to the head node of the HDInsight cluster and gain access to the
configuration and command line tools for the underlying HDP as well as the YARN and NameNode status
portals.

Tools and technologies reference 339

Usage notes:

You must specify a validity period after which the connection is automatically disabled.

Not recommended for use in production applications but is useful for experimentation and oneoff jobs, and for accessing Hadoop files and configuration on the cluster.

For more info, see Remote Desktop Connection.

REST APIs

Provide access to Hadoop services and Azure services.


Hadoop REST APIs include WebHCat (HCatalog), WebHDFS, and Ambari.
Azure REST APIs include storage management and file access, and SQL Database management features.
Requires a client tool that can create REST calls, or use of a custom application (typically using the
HDInsight SDKs). REST-capable clients include:

Simple Microsoft Azure REST API Sample Tool.

Fiddler add-in for the Internet Explorer browser.

Postman add-in for the Google Chrome browser.

RESTClient add-in for the Mozilla Firefox browser.

Utilities such cURL, which is available for a wide range of platforms including Windows.

Samza

A distributed stream processing framework that uses Kafka for messaging and Hadoop YARN to provide
fault tolerance, processor isolation, security, and resource management.
Usage notes:

Provides a very simple callback-based API comparable to map/reduce for processing messages in
the stream.

Pluggable architecture allows use with many other messaging systems and environments.

Some fault-tolerance features are still under development.

340 Appendix A - Tools and technologies reference

For more info, see Samza.

Signiant

A system that uses managers and agents to automate media technology and file-based transfers and
workflows. Can be integrated with existing IT infrastructure to enable highly efficient file-based
workflows.
Usage notes:

Subscription-based software for accelerated movement of large files between users.

Signiant Managers+Agents is a system-to-system solution that handles the administration,


control, management and execution of all system activity, including workflow modeling, from a
single platform.

For more info, see Signiant.

Solr

A highly reliable, scalable, and fault tolerant enterprise search platform from the Apache Lucene project
that provides powerful full-text search, hit highlighting, faceted search, near real-time indexing, dynamic
clustering, database integration, rich document (such as Word and PDF) handling, and geospatial search.
Usage notes:

Includes distributed indexing, load-balanced querying, replication, and automated failover and
recovery.

REST-like HTTP/XML and JSON APIs make it easy to use from virtually any programming
language.

Wide ranging customization is possible using external configuration and an extensive plugin
architecture.

For more info, see Solr.

Tools and technologies reference 341

SQL Server Analysis Services (SSAS)

A component of SQL Server that enables enterprise-level data modeling to support BI. SSAS can be
deployed in multidimensional or tabular mode; and in either mode can be used to define a dimensional
model of the business to support reporting, interactive analysis, and key performance indicator (KPI)
visualization through dashboards and scorecards.
Usage notes: SSAS is commonly used in enterprise BI solutions where large volumes of data in a data
warehouse are pre-aggregated in a data model to support BI applications and reports. As organizations
start to integrate the results of big data processing into their enterprise BI ecosystem, SSAS provides a
way to combine traditional BI data from an enterprise data warehouse with new dimensions and
measures that are based on the results generated by HDInsight data processing jobs.
For more info, see SQL Server Analysis Services (SSAS).

SQL Server Database Engine

The core component of SQL Server that provides an enterprise-scale database engine to support online
transaction processing (OLTP) and data warehouse workloads. You can install SQL Server on an onpremises server (physical or virtual) or in a virtual machine in Azure.
Usage notes: A common work pattern in big data analysis is to provision the HDInsight cluster when it is
required, and decommission it after data processing is complete. If you want the results of the big data
processing to remain available in relational format for client applications to consume, you must transfer
the output generated by HDInsight into a relational database. The SQL Server database engine is a good
choice for this when you want to have full control over server and database engine configuration, or
when you want to combine the big data processing results with data that is already stored in a SQL
Server database.
For more info, see SQL Server Database Engine.

SQL Server Data Quality Services (DQS)

A SQL Server instance feature that consists of knowledge base databases containing rules for data
domain cleansing and matching, and a client tool that enables you to build a knowledge base and use it

342 Appendix A - Tools and technologies reference

to perform a variety of critical data quality tasks, including correction, enrichment, standardization, and
de-duplication of your data.
Usage notes:

The DQS Cleansing component can be used to cleanse data as it passes through a SQL Server
Integration Services (SSIS) data flow. A similar DQS Matching component is available on
CodePlex to support data deduplication in a data flow.

Master Data Services (MDS) can make use of a DQS knowledge base to find duplicate business
entity records that have been imported into an MDS model.

DQS can use cloud-based reference data services provided by reference data providers to
cleanse data, for example by verifying parts of mailing addresses.

For more info, see SQL Server Data Quality Services (DQS).

SQL Server Integration Services (SSIS)

A component of SQL Server that can be used to coordinate workflows that consist of automated tasks.
SSIS workflows are defined in packages, which can be deployed and managed in an SSIS Catalog on an
instance of the SQL Server database engine.
SSIS packages can encapsulate complex workflows that consist of multiple tasks and conditional
branching. In particular, SSIS packages can include data flow tasks that perform full ETL processes to
transfer data from one data store to another while applying transformations and data cleaning logic
during the workflow.
Usage notes:

Although SSIS is often primarily used as a platform for implementing data transfer solutions, in a
big data scenario is can also be used to coordinate the various disparate tasks required to ingest,
process, and consume data using HDInsight.

SSIS packages are created using the SQL Server Data Tools for Business Intelligence (SSDT-BI)
add-in for Visual Studio, which provides a graphical package design interface.

Completed packages can be deployed to an SSIS Catalog in SQL Server 2012 or later instances, or
they can be deployed as files.

Package execution can be automated using SQL Server Agent jobs, or you can run them from the
command line using the DTExec.exe utility.

To use SSIS in a big data solution, you require at least one instance of SQL Server.

Tools and technologies reference 343

For more info, see SQL Server Integration Services (SSIS).

SQL Server Reporting Services (SSRS)

A component of SQL Server that provides a platform for creating, publishing, and distributing reports.
SQL Server can be deployed in native mode where reports are viewed and managed in a Report
Manager website, or in SharePoint-Integrated mode where case reports are viewed and managed in a
SharePoint Server document library.
Usage notes: When big data analysis is incorporated into enterprise business operations, it is common
to include the results in formal reports. Report developers can create reports that consume big data
processing results directly from Hive tables (via the Hive ODBC Driver) or from intermediary data models
or databases, and publish those reports to a report server for on-demand viewing or automated
distribution via email subscriptions.
For more info, see SQL Server Reporting Services (SSRS).

Sqoop

An easy to use tool for tool designed for efficiently transferring bulk data between Apache Hadoop and
structured data stores such as relational databases. It automates most of this process, relying on the
database to describe the schema for the data to be imported, and uses map/reduce to import and
export the data in order to provides parallel operation and fault tolerance.
Usage notes:

Simple to use and supports automation as part of a solution.

Can be included in an Oozie workflow.

Uses a pluggable architecture with drivers and connectors.

Can support SSL by using Oracle Wallet.

Transfers data to HDFS or Hive by using HCatalog.

For more info, see Sqoop.

344 Appendix A - Tools and technologies reference

Storm

A distributed real-time computation system that provides a set of general primitives. It is simple, and
can be used with any programming language. Supports a high-level language called Trident that provides
exactly-once processing, transactional data store persistence, and a set of common stream analytics
operations.
Usage notes:

Supports SSL for data transfer.

Tools are available to create and manage the processing topology and configuration.

Topology and parallelism must be manually fine-tuned, requiring some expertise.

Supports Logging that can be viewed through the Storm Web UI and a reliability API that allows
custom tools and third party services to provide performance monitoring. Some third party tools
support full real-time monitoring.

For more info, see Storm.

StreamInsight

A component of SQL Server that can be used to perform real-time analytics on streaming and other
types of data. Supports using the Observable/Observer pattern and an Input/Output adaptor model with
LINQ processing capabilities and an administrative GUI. Could be used to capture events to a local file
for batch upload to the cluster, or write the event data directly to the cluster storage. Code could
append events to an existing file, create a new file for each event, or create a new file based on
temporal windows in the event stream.
Usage notes:

Can be deployed on-premises and in an Azure Virtual Machine.

Events are implemented as classes or structs, and the properties defined for the event class
provide the data values for visualization and analysis.

Logging can be done in code to any suitable sink.

Monitoring information is by using the diagnostic views API which requires the Management
Web Service to be enabled and connected.

Tools and technologies reference 345

Provided a complex event processing (CEP) solution out of the box, including debugging tools.

For more info, see StreamInsight.

System Center management pack for HDInsight

Simplifies the monitoring process for HDInsight by providing capabilities to discover, monitor, and
manage HDInsight clusters deployed on an Analytics Platform System (APS) Appliance or Azure. Provides
views for proactive monitoring alerts, health and performance dashboards, and performance metrics for
Hadoop at the cluster and node level.
Usage notes:

Enables near real-time diagnosis and resolution of issues detected in HDInsight.

Includes a custom diagram view that has detailed knowledge about cluster structure and the
health states of host components and cluster services.

Requires the 2012 or 2012 SP1 version of System Center.

Provides context sensitive tasks to stop or start host component, cluster service or all cluster
services at once.

For more info, see System Center management pack for HDInsight.

Visual Studio Server Explorer

A feature available in all except the Express versions of Visual Studio. Provides a GUI-based explorer for
Azure features, including storage, with facilities to upload, view, and download files.
Usage notes:

Simple and convenient to use.

No cost solution for existing Visual Studio users.

Also provides access and management features for SQL Database, useful when using a custom
metastore with an HDInsight cluster.

For more info, see Server Explorer.

346 Copyright

More information
For more details about pre-processing and loading data, and the considerations you should be aware of,
see the section Collecting and loading data into HDInsight of this guide.
For more details about processing the data using queries and transformations, and the considerations
you should be aware of, see the section Processing, querying, and transforming data using HDInsight of
this guide.
For more details about consuming and visualizing the results, and the considerations you should be
aware of, see the section Consuming and visualizing data from HDInsight of this guide.
For more details about automating and managing solutions, and the considerations you should be aware
of, see the section Building end-to-end solutions using HDInsight of this guide.

Copyright
This document is provided as-is. Information and views expressed in this document, including URL and
other Internet web site references, may change without notice.
Some examples depicted herein are provided for illustration only and are fictitious. No real association or
connection is intended or should be inferred.

This document does not provide you with any legal rights to any intellectual property in any Microsoft
product. You may copy and use this document for your internal, reference purposes.
2014 Microsoft. All rights reserved.
Microsoft, Bing, Bing logo, C++, Excel, HDInsight, MSDN, Office 365, SQL Azure, Visual Studio, Windows,
and Windows PowerShell are trademarks of the Microsoft group of companies. All other trademarks are
property of their respective owners.